# Training/Validation Split

## Problem

For example, in a classification problem, we have unevenly distributed labels. If we split the dataset by random, then there will be a chance where some labels is not in the training set or not in the testing set.

## Solution

To address this problem, we use "Stratified Sampling". Meaning that keeping ratio of the labels consistent across the train, validataion and test set. In this way, we won't have the situation where one class is more dense in training set rather than test set.

## **Demo**

```python
X_train, X_val, y_train, y_val = train_test_split(
    df.index.values,
    df.label.values,
    test_size = 0.2,
    random_state = 10,
    stratify = df.label.values # the stratified sampling part
)
```

In this code, we need to specify on what value out stratified sampling algorithm based on

## Use index to add the labels

```python
df.loc[x_train, "data_type"] = "train"
df.loc[x_val, "data_type"] = "val"
df.groupby(['category', 'label', 'data_type']).count()
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://michael-mao.gitbook.io/sentiment-analysis-bert/pytorch-bert/training-validation-split.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
