Skip to content

Is it better to breakdown one image's different boxes into multiple questions?[Question]  #23

@james20141606

Description

@james20141606

Question

Hi, When I try to look into the json file for finetuning, I find things like:

 {
        "id": "VG_100K_2/2412354",
        "image": "vg/VG_100K_2/2412354.jpg",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nPlease provide a short description for this region: [0.32, 0.8, 0.59, 1.0]."
            },
            {
                "from": "gpt",
                "value": "The cupboards are white."
            },
            {
                "from": "human",
                "value": "Please provide the bounding box coordinate of the region this sentence describes: Bathroom floor is dark tiles."
            },
            {
                "from": "gpt",
                "value": "[0.21, 0.92, 0.36, 1.0]"
            },
            {
                "from": "human",
                "value": "Please provide the bounding box coordinate of the region this sentence describes: Lights of bathroom."
            },
            {
                "from": "gpt",
                "value": "[0.47, 0.19, 0.68, 0.27]"
            },
            {
                "from": "human",
                "value": "Please provide the bounding box coordinate of the region this sentence describes: small hole in bathroom door."
            },
            {
                "from": "gpt",
                "value": "[0.7, 0.86, 0.74, 0.91]"
            },
            {
                "from": "human",
                "value": "Please provide a short description for this region: [0.29, 0.62, 0.72, 0.94]."
            },
            {
                "from": "gpt",
                "value": "Bathroom counter top is green."
            },
            {
                "from": "human",
                "value": "Please provide the bounding box coordinate of the region this sentence describes: Two handles on the vanity."
            },
            {
                "from": "gpt",
                "value": "[0.43, 0.87, 0.54, 0.96]"
            },
            {
                "from": "human",
                "value": "Please provide a short description for this region: [0.32, 0.65, 0.71, 0.92]."
            },
            {
                "from": "gpt",
                "value": "Counter top is made of granite."
            },
            {
                "from": "human",
                "value": "Please provide the bounding box coordinate of the region this sentence describes: Man has dark colored hair."
            },
            {
                "from": "gpt",
                "value": "[0.19, 0.49, 0.32, 0.65]"
            },
            {
                "from": "human",
                "value": "Please provide the bounding box coordinate of the region this sentence describes: Bathroom lights are on."
            },
            {
                "from": "gpt",
                "value": "[0.49, 0.22, 0.74, 0.29]"
            },
            {
                "from": "human",
                "value": "Please provide the bounding box coordinate of the region this sentence describes: Man is wearing eyeglasses."
            },
            {
                "from": "gpt",
                "value": "[0.18, 0.48, 0.32, 0.64]"
            }
        ]
    },

I wonder if it is better to split this question to multiple questions so each question contain only one bounding box to make model less puzzled? Although it will increase the training load a lot, could it somehow make the model understand the bounding box better?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions