DataLoader, Optimizer, Scheduler
Previous we have the dataset for ML. We also downloaded the model for ML. Here we specify how we can define the ML training process: feed data, set learning rate, set learning epochs
DataLoader: how our data will be feed into the model
Let’s say we have 5000 data, it is not realistic to feed all 5000 data to the model to learn at once, it is also not optimized to feed only 1 piece of data to the model each time (since batch gradient will have better performance in learning and prevent over fitting). Therefore, we need the concept of DataLoader. It help to specify how much data we feed into the model, and how those data are picker
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
# it is good use RandomSampler for training data, because we don’t want the sequence that data is listed to disrupt our learning process
# it is ok to use SequenctialSampler for validataion data, because it doesn’t matter on the sequence for verification
dataloader_train = DataLoader(
dataset_train, # the tokenized data we are going to feed in the model
sampler=RamdomSampler(dataset_train), # necessary to use random sampler
batch_size=32 # how many pieces of data we feed in at once
)
Optimizer: how fast is the model going to learn
Optimizer specify the type of learning rate update algorithm for the model, as well as how fast this algorithm should update. Adam optimizer is generally a good choice.
from transformers import AdamW
optimizer = AdamW(
model.parameters(),
lr=1e-5,
eps=1e-8
)
Scheduler: tell the model how to train
Scheduler include the optimizer in its setting. It meant to tell the model general rules of the training: How many epochs to train, what’s the learning rate etc.
from transformers import get_linear_schedule_with_warmup
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=0, # default value
num_training_steps=len(dataloader_train)*epochs # total num of instance * epochs
)
Last updated
Was this helpful?