Training/Validation Split
It is rather standard, but we will look at special mechanism: Stratified Sampling, to avoid any bias in the output labels
Problem
For example, in a classification problem, we have unevenly distributed labels. If we split the dataset by random, then there will be a chance where some labels is not in the training set or not in the testing set.
Solution
To address this problem, we use "Stratified Sampling". Meaning that keeping ratio of the labels consistent across the train, validataion and test set. In this way, we won't have the situation where one class is more dense in training set rather than test set.
Demo
In this code, we need to specify on what value out stratified sampling algorithm based on
Use index to add the labels
PreviousExploratory Data Analysis and Pre-processingNextEncode with Tokenizer and Prepare the dataset for Machine Learning
Last updated
Was this helpful?