Data Sampling

Sampling Techniques in DataVerse for Data Slices

DataVerse offers a variety of sampling techniques to help users efficiently select data slices for analysis, training, or testing purposes. The platform provides three main sampling methods: Simple Random Sampling, Systematic Sampling, and Sequence-balance Sampling. Each technique serves a different purpose, and users can choose the most suitable one based on their specific requirements.

Hint

You can process data sampling within your data slice.

Simple Random Sampling

Simple Random Sampling randomly selects data items from the data slice, excluding any specified items, until the desired sample size is reached. This technique ensures an unbiased representation of the population and is ideal for cases where a random subset of the data is required.

Systematic Sampling

Systematic Sampling involves selecting data items from the data slice in sequential order, excluding any specified items, until the desired sample size is reached. This method provides a more structured approach to sampling and can be useful when a more evenly distributed representation of the dataset is needed.

Sequence-balance Sampling

Sequence-balance Sampling selects a fixed number of data items from each sequence, excluding any specified items, and fills up the sample with the available data if the fixed number is not met. This technique ensures a balanced representation of data from different sequences and is suitable for situations where equal representation across sequences is important.

Class-balance Sampling

The 'Class Balance' sampling strategy will sample the data with considering balancing across your target classes you selected. It will use undersampling technique which will reduce the over-represented classes. If the number of entries containing all selected classes falls short of the desired sample size, the system will randomly select additional entries to meet the quota.

Tag-balance Sampling

The 'Tag Balance' sampling strategy aims to evenly distribute images based on the presence of target tags (option and boolean types) across datasets. We endeavor to achieve tag balance within the set number of images, but if the desired balance is not attainable due to insufficient tag instances, the system will randomly select images to fill the remainder.

Example

Consider the following scenario: seq1 has 100 images, and seq2 has 900 images. If you want to sample 200 images, the sampling techniques would work as follows:

  • Sequence-balance Sampling: seq1 would contribute 100 images (all available images), and seq2 would contribute the remaining 100 images.

  • Systematic Sampling: The 1,000 images (100 from seq1 and 900 from seq2) would be arranged in order, and every 5th image (1st, 6th, 11th, etc.) would be selected until 200 images are sampled.

By providing various sampling techniques, Dataverse enables users to tailor their data selection process according to their specific needs, ensuring an efficient and targeted approach to data analysis, model training, and evaluation.

Last updated