VLM Data Format (VQA)
Vision-Language Model (VLM) input and output format.
This document outlines the structure and requirements for the Vision-Language Model (VLM) input and output format. The format is used to manage and evaluate annotated datasets with questions and answers for images.
Schema
The following JSON example outlines the top-level components of the VLM format. Further details about each element will be provided in separate sections.
Example
The structure of the VLM JSON file format is outlined below. Additional information on the file structure can be found in the Data Structure Details section.
[
{
"id": "example_image",
"image": "example_image.jpg",
"conversations": [
{
"question_id": 1,
"question": "Is there a bus in the image?",
"answer": {
"groundtruth": true,
"model001": true
}
},
{
"question_id": 2,
"question": "How many people are in the image?",
"answer": {
"groundtruth": 0,
"model001": 3
}
},
{
"question_id": 3,
"question": "What are the colors of the bus?",
"answer": {
"groundtruth": "red and white",
"model001": "red and white"
}
},
{
"question_id": 4,
"question": "What types of vehicles are in the image?",
"answer": {
"groundtruth": ["bus"],
"model001": ["car"]
}
}
]
},
{
"id": "example_image2",
"image": "example_image2.jpg",
"conversations": []
}
]VLM {}
id
A unique identifier for the image.
string
true
images
A numerical system that specifies the location of points and The image file name. Should match the id.
string
true
conversations
A list of question-and-answer pairs associated with the image.
array
true
conversations {}
question_id
A unique identifier for the question, representing its rank.
integer
false
question
The text of the question. Required only during export, optional during import.
Example: "What are the colors of the bus in the image?"
string
false
answer
Contains answers from various sources.
groundtruth: Human-provided ground truth answer.modelXXX: Model-specific answers. ReplacemodelXXXwith the model ID.
object
false
Note:
If the answer field is empty, the question key does not need to be provided. This helps streamline the data structure for cases where no response is available.
answer field {}
The answer field in the VLM format can take four different formats depending on the nature of the question and the expected response type (required to match the system ontology settings).
Below is a detailed breakdown of each format, including its example usage.
Boolean
Represents a True or False response to binary questions.
Example: "groundtruth": true
Option (List)
Represents a list of values, typically for multi-choice or categorical answers.
Example: "groundtruth": ["car"]
Note:
The VLM system currently supports single selection only.
During import, if multiple items are provided, the system will automatically prioritize and use the first item in the list as the answer.
Number
Represents numerical responses to questions requiring quantitative answers.
Example: "groundtruth": 3
Text
Represents free-form textual responses to open-ended questions.
Example: "groundtruth": "The bus is red and white."
Data Structure
The dataset is organized into two main folders: images and annotations. Each image is uniquely identified and linked to its corresponding annotations via a unique id. The following structure ensures scalability, flexibility, and efficient management of the dataset.
File Structure
data/
├── images/
│ ├── abc.jpg
│ ├── abc2.png
│ ├── xyz.jpg
│ └── ... (other images)
├── annotations/
│ ├── vlm_annotation_1.json
│ ├── vlm_annotation_2.json
│ └── ... (other annotation files)Details
Images Folder (
images/):Contains all image files.
Supported formats: JPG, PNG.
Each image must have a unique name that matches its corresponding
id.
Annotations Folder (
annotations/):Contains JSON files storing the annotations.
Each JSON file can contain multiple annotated entries (to prevent a single file from becoming too large).
JSON files can be split logically (e.g., by image batches, categories, or custom criteria)
Last updated