VLM Data Format (VQA)
Vision-Language Model (VLM) input and output format.
Last updated
Vision-Language Model (VLM) input and output format.
Last updated
This document outlines the structure and requirements for the Vision-Language Model (VLM) input and output format. The format is used to manage and evaluate annotated datasets with questions and answers for images.
The following JSON example outlines the top-level components of the VLM format. Further details about each element will be provided in separate sections.
The structure of the VLM JSON file format is outlined below. Additional information on the file structure can be found in the Data Structure Details section.
id
A unique identifier for the image.
string
true
images
A numerical system that specifies the location of points and The image file name. Should match the id
.
string
true
conversations
A list of question-and-answer pairs associated with the image.
array
true
question_id
A unique identifier for the question, representing its rank.
integer
false
question
The text of the question. Required only during export, optional during import.
Example: "What are the colors of the bus in the image?"
string
false
answer
Contains answers from various sources.
groundtruth
: Human-provided ground truth answer.
modelXXX
: Model-specific answers. Replace modelXXX
with the model ID.
object
false
Note:
If the answer
field is empty, the question
key does not need to be provided. This helps streamline the data structure for cases where no response is available.
Reminder:
When importing data, you only need to provide the question_id
. The question
field (e.g., "Is there a bus in the image?"
) is not required and will be ignored.
The system will automatically use the question content defined in the ontology.
The answer
field in the VLM format can take four different formats depending on the nature of the question and the expected response type (required to match the system ontology settings).
Below is a detailed breakdown of each format, including its example usage.
Boolean
Represents a True
or False
response to binary questions.
Example: "groundtruth": true
Option (List)
Represents a list of values, typically for multi-choice or categorical answers.
Example: "groundtruth": ["car"]
Note:
The VLM system currently supports single selection only.
During import, if multiple items are provided, the system will automatically prioritize and use the first item in the list as the answer.
Number
Represents numerical responses to questions requiring quantitative answers.
Example: "groundtruth": 3
Text
Represents free-form textual responses to open-ended questions.
Example: "groundtruth": "The bus is red and white."
The dataset is organized into two main folders: images and annotations. Each image is uniquely identified and linked to its corresponding annotations via a unique id
. The following structure ensures scalability, flexibility, and efficient management of the dataset.
Images Folder (images/
):
Contains all image files.
Supported formats: JPG, PNG.
Each image must have a unique name that matches its corresponding id
.
Annotations Folder (annotations/
):
Contains JSON files storing the annotations.
Each JSON file can contain multiple annotated entries (to prevent a single file from becoming too large).
JSON files can be split logically (e.g., by image batches, categories, or custom criteria)