Training/Validation Split

It is rather standard, but we will look at special mechanism: Stratified Sampling, to avoid any bias in the output labels

Problem

For example, in a classification problem, we have unevenly distributed labels. If we split the dataset by random, then there will be a chance where some labels is not in the training set or not in the testing set.

Solution

To address this problem, we use "Stratified Sampling". Meaning that keeping ratio of the labels consistent across the train, validataion and test set. In this way, we won't have the situation where one class is more dense in training set rather than test set.

Demo

X_train, X_val, y_train, y_val = train_test_split(
    df.index.values,
    df.label.values,
    test_size = 0.2,
    random_state = 10,
    stratify = df.label.values # the stratified sampling part
)

In this code, we need to specify on what value out stratified sampling algorithm based on

Use index to add the labels

df.loc[x_train, "data_type"] = "train"
df.loc[x_val, "data_type"] = "val"
df.groupby(['category', 'label', 'data_type']).count()

PreviousExploratory Data Analysis and Pre-processing NextEncode with Tokenizer and Prepare the dataset for Machine Learning

Last updated 5 years ago

Was this helpful?