VLM Data Format (VQA)

Vision-Language Model (VLM) input and output format.

This document outlines the structure and requirements for the Vision-Language Model (VLM) input and output format. The format is used to manage and evaluate annotated datasets with questions and answers for images.

Schema

The following JSON example outlines the top-level components of the VLM format. Further details about each element will be provided in separate sections.

Example

The structure of the VLM JSON file format is outlined below. Additional information on the file structure can be found in the Data Structure Details section.

[
  {
    "id": "example_image",
    "image": "example_image.jpg",
    "conversations": [
      {
        "question_id": 1,
        "question": "Is there a bus in the image?",
        "answer": {
          "groundtruth": true,
          "model001": true
        }
      },
      {
        "question_id": 2,
        "question": "How many people are in the image?",
        "answer": {
          "groundtruth": 0,
          "model001": 3
        }
      },
      {
        "question_id": 3,
        "question": "What are the colors of the bus?",
        "answer": {
          "groundtruth": "red and white",
          "model001": "red and white"
        }
      },
      {
        "question_id": 4,
        "question": "What types of vehicles are in the image?",
        "answer": {
          "groundtruth": ["bus"],
          "model001": ["car"]
        }
      }
    ]
  },
  {
    "id": "example_image2",
    "image": "example_image2.jpg",
    "conversations": []
  }
]

VLM {}

Name
Definition
Type
Required

id

A unique identifier for the image.

string

true

images

A numerical system that specifies the location of points and The image file name. Should match the id.

string

true

conversations

A list of question-and-answer pairs associated with the image.

array

true

conversations {}

Name
Definition
Type
Required

question_id

A unique identifier for the question, representing its rank.

integer

false

question

The text of the question. Required only during export, optional during import. Example: "What are the colors of the bus in the image?"

string

false

answer

Contains answers from various sources.

  • groundtruth: Human-provided ground truth answer.

  • modelXXX: Model-specific answers. Replace modelXXX with the model ID.

object

false

Reminder: When importing data, you only need to provide the question_id. The question field (e.g., "Is there a bus in the image?") is not required and will be ignored. The system will automatically use the question content defined in the ontology.

answer field {}

The answer field in the VLM format can take four different formats depending on the nature of the question and the expected response type (required to match the system ontology settings). Below is a detailed breakdown of each format, including its example usage.

type
Definition

Boolean

Represents a True or False response to binary questions. Example: "groundtruth": true

Option (List)

Represents a list of values, typically for multi-choice or categorical answers. Example: "groundtruth": ["car"] Note:

  • The VLM system currently supports single selection only.

  • During import, if multiple items are provided, the system will automatically prioritize and use the first item in the list as the answer.

Number

Represents numerical responses to questions requiring quantitative answers. Example: "groundtruth": 3

Text

Represents free-form textual responses to open-ended questions. Example: "groundtruth": "The bus is red and white."


Data Structure

The dataset is organized into two main folders: images and annotations. Each image is uniquely identified and linked to its corresponding annotations via a unique id. The following structure ensures scalability, flexibility, and efficient management of the dataset.

File Structure

data/
├── images/
│   ├── abc.jpg
│   ├── abc2.png
│   ├── xyz.jpg
│   └── ... (other images)
├── annotations/
│   ├── vlm_annotation_1.json
│   ├── vlm_annotation_2.json
│   └── ... (other annotation files)

Details

  1. Images Folder (images/):

    • Contains all image files.

    • Supported formats: JPG, PNG.

    • Each image must have a unique name that matches its corresponding id.

  2. Annotations Folder (annotations/):

    • Contains JSON files storing the annotations.

    • Each JSON file can contain multiple annotated entries (to prevent a single file from becoming too large).

    • JSON files can be split logically (e.g., by image batches, categories, or custom criteria)

Last updated