Encode with Tokenizer and Prepare the dataset for Machine Learning
Usually, what we have done in the “Train-Validation-Test Split” section is good for ML. But NLP is different, we need Tokeniser to convert text into numbers representation for the machine to learn
from transformers import BertTokenizer
from torch.utils.data import TensorDataset
BertTokenizer
It is a tokenzier provided by BERT model, we can choose want kind of tokenizer we want to use by specifying the parameters. In this case: we use the following tokenizer
And by using this tokenizer, we are able to transform our data into tokens
encoded_data_train = tokenizer.batch_encode_plus(
df[df.data_type=='train'].text.values, # the data we provide
add_special_token=True, # let BERT know the begin & end of a sentence
return_attention_mask=True, # attention mask is a fixed size that specify the amount of data we are interested in
pad_to_max_length=True, # if size is less than attention mask, then fill pad data in
max_length=256, # the size of the attention mask
return_tensor='pt'
)
So now our input texture data is tokenized into number-based data. The returned encoded_data_train is a dictionary