Encode with Tokenizer and Prepare the dataset for Machine Learning
Usually, what we have done in the “Train-Validation-Test Split” section is good for ML. But NLP is different, we need Tokeniser to convert text into numbers representation for the machine to learn
Package
BertTokenizer - provided in the transformers package
TensorDataset - provided in the torch package
from transformers import BertTokenizer
from torch.utils.data import TensorDataset
BertTokenizer
It is a tokenzier provided by BERT model, we can choose want kind of tokenizer we want to use by specifying the parameters. In this case: we use the following tokenizer
tokenizer = BertTokenizer.from_pretrained(
'bert-base-uncase',
do_lower_case=True
)
And by using this tokenizer, we are able to transform our data into tokens
encoded_data_train = tokenizer.batch_encode_plus(
df[df.data_type=='train'].text.values, # the data we provide
add_special_token=True, # let BERT know the begin & end of a sentence
return_attention_mask=True, # attention mask is a fixed size that specify the amount of data we are interested in
pad_to_max_length=True, # if size is less than attention mask, then fill pad data in
max_length=256, # the size of the attention mask
return_tensor='pt'
)
So now our input texture data is tokenized into number-based data. The returned encoded_data_train
is a dictionary
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.value)
TorchDataset
dataset_train = TensorDataset(
input_ids_train, # data
attention_masks_train, # mask
labels_train # label
)
Then we have the data prepared to feed into the machine learning model
Last updated
Was this helpful?