Encode with Tokenizer and Prepare the dataset for Machine Learning

Usually, what we have done in the “Train-Validation-Test Split” section is good for ML. But NLP is different, we need Tokeniser to convert text into numbers representation for the machine to learn

Package

  1. BertTokenizer - provided in the transformers package

  2. TensorDataset - provided in the torch package

from transformers import BertTokenizer
from torch.utils.data import TensorDataset

BertTokenizer

It is a tokenzier provided by BERT model, we can choose want kind of tokenizer we want to use by specifying the parameters. In this case: we use the following tokenizer

tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncase',
    do_lower_case=True
)

And by using this tokenizer, we are able to transform our data into tokens

encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].text.values, # the data we provide
    add_special_token=True, # let BERT know the begin & end of a sentence
    return_attention_mask=True, # attention mask is a fixed size that specify the amount of data we are interested in
    pad_to_max_length=True, # if size is less than attention mask, then fill pad data in
    max_length=256, # the size of the attention mask
    return_tensor='pt'
)

So now our input texture data is tokenized into number-based data. The returned encoded_data_train is a dictionary

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.value)

TorchDataset

dataset_train = TensorDataset(
    input_ids_train, # data
    attention_masks_train, # mask
    labels_train # label
)

Then we have the data prepared to feed into the machine learning model

Last updated