DataLoader, Optimizer, Scheduler

Previous we have the dataset for ML. We also downloaded the model for ML. Here we specify how we can define the ML training process: feed data, set learning rate, set learning epochs

DataLoader: how our data will be feed into the model

Let’s say we have 5000 data, it is not realistic to feed all 5000 data to the model to learn at once, it is also not optimized to feed only 1 piece of data to the model each time (since batch gradient will have better performance in learning and prevent over fitting). Therefore, we need the concept of DataLoader. It help to specify how much data we feed into the model, and how those data are picker

from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

# it is good use RandomSampler for training data, because we don’t want the sequence that data is listed to disrupt our learning process
# it is ok to use SequenctialSampler for validataion data, because it doesn’t matter on the sequence for verification

dataloader_train = DataLoader(
    dataset_train, # the tokenized data we are going to feed in the model
    sampler=RamdomSampler(dataset_train), # necessary to use random sampler
    batch_size=32 # how many pieces of data we feed in at once
)

Optimizer: how fast is the model going to learn

Optimizer specify the type of learning rate update algorithm for the model, as well as how fast this algorithm should update. Adam optimizer is generally a good choice.

from transformers import AdamW

optimizer = AdamW(
    model.parameters(),
    lr=1e-5,
    eps=1e-8
)

Scheduler: tell the model how to train

Scheduler include the optimizer in its setting. It meant to tell the model general rules of the training: How many epochs to train, what’s the learning rate etc.

from transformers import get_linear_schedule_with_warmup

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0, # default value
    num_training_steps=len(dataloader_train)*epochs # total num of instance * epochs
)

Last updated