# Encode with Tokenizer and Prepare the dataset for Machine Learning

## Package

1. **BertTokenizer** - provided in the [transformers](https://huggingface.co/transformers/index.html) package
2. **TensorDataset** - provided in the [torch](https://pytorch.org/docs/stable/torch.html) package

```python
from transformers import BertTokenizer
from torch.utils.data import TensorDataset
```

## BertTokenizer

It is a tokenzier provided by BERT model, we can choose want kind of tokenizer we want to use by specifying the parameters. In this case: we use the following tokenizer

```python
tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncase',
    do_lower_case=True
)
```

And by using this tokenizer, we are able to transform our data into tokens

```python
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].text.values, # the data we provide
    add_special_token=True, # let BERT know the begin & end of a sentence
    return_attention_mask=True, # attention mask is a fixed size that specify the amount of data we are interested in
    pad_to_max_length=True, # if size is less than attention mask, then fill pad data in
    max_length=256, # the size of the attention mask
    return_tensor='pt'
)
```

So now our input texture data is tokenized into number-based data. The returned `encoded_data_train` is a dictionary

```python
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.value)
```

## **TorchDataset**

```python
dataset_train = TensorDataset(
    input_ids_train, # data
    attention_masks_train, # mask
    labels_train # label
)
```

Then we have the data prepared to feed into the machine learning model


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://michael-mao.gitbook.io/sentiment-analysis-bert/pytorch-bert/encode-with-tokenizer-and-prepare-the-dataset-for-machine-learning.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
