Data Splitting

Data splitting is an essential step in preparing your dataset for machine learning and AI model training. DataVerse offers a versatile and intuitive data splitting feature that allows users to efficiently segment their data slices based on specific criteria. This process is crucial for ensuring that your models are trained on diverse and representative data samples.

Hint

You can process data splitting within your data slice.

In Data Slice Detail, Click "Split" and a window will pop up asking you to confirm the percentage of the intended split for this Data Slice.

Splitting Options in DataVerse:

When you choose to split a data slice in DataVerse, you have four distinct options to tailor the process to your needs:

  1. Random: This option randomly divides your data, ensuring a mix of various elements in each subset.

  2. Sequence Independent: Splits data ensuring that sequences are independent in each subset, suitable for time-series or sequential data. This might result in fewer images than expected in your splitting data slice, especially if there are insufficient unique sequences available.

  3. Class Balance: Ensures that each subset has a balanced representation of different classes in your Ground Truth data, crucial for unbiased model training. We strive to achieve class balance within the set number of images, and if the desired balance is not attainable due to insufficient class instances, the system will randomly select images to fill up the remainder.

  4. Tag Balance: Focuses on balancing the distribution of user-defined tags across each subset, ensuring a comprehensive representation of all tagged elements. The 'Tag Balance' slicing strategy aims to evenly distribute images based on the presence of target tags (option and boolean types) across datasets. We endeavor to achieve tag balance within the set number of images, but if the desired balance is not attainable due to insufficient tag instances, the system will randomly select images to fill the remainder.

Process and Visualization:

Once you select your preferred method and click 'Split', DataVerse efficiently segments your data according to your choice. Post-splitting, you can view and analyze the newly created data subsets in the Data Visualization section, and query "Data Slice". Here, you can also access various metrics that offer insights into the composition and characteristics of each subset.

Benefits of Data Splitting in Dataverse:

  • Enhanced Model Accuracy: By training your models on well-structured and balanced data, you increase the likelihood of higher accuracy and generalization.

  • Bias Mitigation: Balanced datasets help in reducing bias, leading to more reliable and ethical AI solutions.

  • Streamlined Workflow: Dataverse's data splitting simplifies the often complex process of preparing data for AI, saving time and effort.

  • Customizable to Needs: Whether dealing with sequential data or requiring class/tag balance, the platform adapts to diverse project requirements.

In summary, Dataverse's data splitting feature is a powerful tool in your AI and machine learning arsenal, helping you prepare your data effectively for optimal model performance.

Last updated