Encode with Tokenizer and Prepare the dataset for Machine Learning
Usually, what we have done in the “Train-Validation-Test Split” section is good for ML. But NLP is different, we need Tokeniser to convert text into numbers representation for the machine to learn
Package
from transformers import BertTokenizer
from torch.utils.data import TensorDatasetBertTokenizer
tokenizer = BertTokenizer.from_pretrained(
'bert-base-uncase',
do_lower_case=True
)encoded_data_train = tokenizer.batch_encode_plus(
df[df.data_type=='train'].text.values, # the data we provide
add_special_token=True, # let BERT know the begin & end of a sentence
return_attention_mask=True, # attention mask is a fixed size that specify the amount of data we are interested in
pad_to_max_length=True, # if size is less than attention mask, then fill pad data in
max_length=256, # the size of the attention mask
return_tensor='pt'
)TorchDataset
Last updated