Encode with Tokenizer and Prepare the dataset for Machine Learning
Usually, what we have done in the “Train-Validation-Test Split” section is good for ML. But NLP is different, we need Tokeniser to convert text into numbers representation for the machine to learn
Package
BertTokenizer - provided in the transformers package
TensorDataset - provided in the torch package
BertTokenizer
It is a tokenzier provided by BERT model, we can choose want kind of tokenizer we want to use by specifying the parameters. In this case: we use the following tokenizer
And by using this tokenizer, we are able to transform our data into tokens
So now our input texture data is tokenized into number-based data. The returned encoded_data_train
is a dictionary
TorchDataset
Then we have the data prepared to feed into the machine learning model
Last updated