1. Introduction
How to prepare a large dataset from the Hub
How to use the high-level Trainer API to fine-tuning a model
How to use a custom training loop
How to leverage the Accelerate library to easily run that custom training loop on any distributed setup
1.1. Fine-tuning이란?
Pre-trained 모델을 주어진 문제에 맞도록 튜닝 하는 작업을 의미한다.
2. Processing the data
이번 챕터에서는 MRPC(Microsoft Research Paraphrase Corpus) Dataset을 활용하여 학습을 진행해 보기로 한다.
2.1. Loading a dataset from the Hub
Hugging Face는 model, tokenizer 뿐만 아니라 dataset도 제공을 한다.
Copy !pip install datasets
Copy from datasets import load_dataset
raw_datasets = load_dataset ( "glue" , "mrpc" )
raw_datasets
Copy DatasetDict({
train: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
num_rows: 3668
})
validation: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
num_rows: 408
})
test: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
num_rows: 1725
})
})
Copy raw_train_dataset = raw_datasets [ "train" ]
raw_train_dataset [ 0 ]
Copy {'idx': 0,
'label': 1,
'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}
이미 label이 숫자표기로 되어있는 것을 확인할 수 있다. 어떤 label 값을 가지고 있는지 확인해 보자
Copy raw_train_dataset . features
Copy {'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
'idx': Value(dtype='int32', id=None)}
위 내용은 두 문장을 넣고 equivalent 여부를 체크하는 것으로 보인다.
2.2. Preprocessing a dataset
tokenizer를 활용하여 convert 한다.
Copy from transformers import AutoTokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer . from_pretrained (checkpoint)
tokenized_sentences_1 = tokenizer (raw_datasets[ "train" ][ "sentence1" ])
tokenized_sentences_2 = tokenizer (raw_datasets[ "train" ][ "sentence2" ])
model에 넣을 때 sentence1과 2를 각각 넣는 것이 아니다. 두 문장을 한번에 token화하게 된다.
Copy inputs = tokenizer ( "This is the first sentence." , "This is the second one." )
inputs
Copy {
'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}
token_type_ids
: 첫 문장과 두번째 문장의 구분을 나타내는 값
input_ids
값을 decode 하게 되면 아래와 같이 얻을 수 있다.
Copy tokenizer . convert_ids_to_tokens (inputs[ "input_ids" ])
Copy ['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']
[CLS], [SEP] 같은 special token을 확인할 수 있다.
Dataset 객체를 생성하는 방법에 대해 알아보자. 우선 기본적인 방법으로 아래와 같이 정의할 수 있다.
Copy tokenized_dataset = tokenizer (
raw_datasets[ "train" ][ "sentence1" ],
raw_datasets[ "train" ][ "sentence2" ],
padding = True ,
truncation = True ,
)
하지만 위 방법은 dictionary 형태의 return이 불가능하다. 그래서 Dataset.map method를 사용하게 된다. 이 method는 dataset의 각 element들에게 적용된다. tokenize 함수를 아래와 같이 설정하고 dataset.map을 통해 모든 element를 tokenize할 수 있도록 한다.
Copy def tokenize_function ( example ):
return tokenizer (example[ "sentence1" ], example[ "sentence2" ], truncation = True )
Copy tokenized_datasets = raw_datasets . map (tokenize_function, batched = True )
tokenized_datasets
Copy DatasetDict({
train: Dataset({
features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
num_rows: 3668
})
validation: Dataset({
features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
num_rows: 408
})
test: Dataset({
features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
num_rows: 1725
})
})
위와 같이 input_ids
,attention_mask
token_type_ids
이 추가된 것을 확인할 수 있다.
2.3. Dynamic padding
Dataset을 DataLoader에 담아 데이터를 꺼내어 사용하게 되는데 우리는 불필요한 패딩을 줄이기위해 각 batch별로 가장 큰 길이를 지정하여 padding을 생성하게 된다.
여기서는 DataCollatorWithPadding 함수를 사용해 padding을 만들어 본다.
Copy from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding (tokenizer = tokenizer)
Copy samples = tokenized_datasets [ "train" ] [ : 8 ]
samples = {
k : v for k , v in samples . items () if k not in [ "idx" , "sentence1" , "sentence2" ]
}
[ len (x) for x in samples [ "input_ids" ] ]
Copy [50, 59, 47, 67, 59, 50, 62, 32]
샘플 데이터를 추출하여 data_collator
에 넣어보면
Copy batch = data_collator (samples)
{ k : v . shape for k , v in batch . items ()}
Copy {'attention_mask': torch.Size([8, 67]),
'input_ids': torch.Size([8, 67]),
'token_type_ids': torch.Size([8, 67]),
'labels': torch.Size([8])}
size가 67로 가장 큰 size로 잡힌 것을 확인 할 수 있다.
3. Fine-tuning a model with the Trainer API
Transformer는 Trainer 클래스를 제공해 fine-tune 하기 쉽도록 하였다.
3.1. Training
Copy from datasets import load_dataset
from transformers import AutoTokenizer , DataCollatorWithPadding
raw_datasets = load_dataset ( "glue" , "mrpc" )
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer . from_pretrained (checkpoint)
def tokenize_function ( example ):
return tokenizer (example[ "sentence1" ], example[ "sentence2" ], truncation = True )
tokenized_datasets = raw_datasets . map (tokenize_function, batched = True )
data_collator = DataCollatorWithPadding (tokenizer = tokenizer)
우선 Dataset과 tokenizer를 가져와서 Pre-processing을 거친다.
TrainingArguments
클래스를 가져와 생성해준다.
Copy from transformers import TrainingArguments
training_args = TrainingArguments ( "test-trainer" )
다음으로 모델 객체를 생성해 준다.
Copy from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification . from_pretrained (checkpoint, num_labels = 2 )
Copy from transformers import Trainer
trainer = Trainer (
model,
training_args,
train_dataset = tokenized_datasets[ "train" ],
eval_dataset = tokenized_datasets[ "validation" ],
data_collator = data_collator,
tokenizer = tokenizer,
)
정말 너무 간단하게도 train method를 호출하는 것 만으로 학습이 진행된다.
3.2. Evaluation
Copy predictions = trainer . predict (tokenized_datasets[ "validation" ])
print (predictions.predictions.shape, predictions.label_ids.shape)
Copy import numpy as np
preds = np . argmax (predictions.predictions, axis =- 1 )
Copy from datasets import load_metric
metric = load_metric ( "glue" , "mrpc" )
metric . compute (predictions = preds, references = predictions.label_ids)
Copy {'accuracy': 0.8578431372549019, 'f1': 0.8996539792387542}
Copy def compute_metrics ( eval_preds ):
metric = load_metric ( "glue" , "mrpc" )
logits , labels = eval_preds
predictions = np . argmax (logits, axis =- 1 )
return metric . compute (predictions = predictions, references = labels)
Copy training_args = TrainingArguments ( "test-trainer" , evaluation_strategy = "epoch" )
model = AutoModelForSequenceClassification . from_pretrained (checkpoint, num_labels = 2 )
trainer = Trainer (
model,
training_args,
train_dataset = tokenized_datasets[ "train" ],
eval_dataset = tokenized_datasets[ "validation" ],
data_collator = data_collator,
tokenizer = tokenizer,
compute_metrics = compute_metrics
)
trainer . train ()
4. A full training
Trainer
클래스의 도움없이 학습을 진행하는 일련의 과정을 살펴보기로 한다.
Copy from datasets import load_dataset
from transformers import AutoTokenizer , DataCollatorWithPadding
raw_datasets = load_dataset ( "glue" , "mrpc" )
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer . from_pretrained (checkpoint)
def tokenize_function ( example ):
return tokenizer (example[ "sentence1" ], example[ "sentence2" ], truncation = True )
tokenized_datasets = raw_datasets . map (tokenize_function, batched = True )
data_collator = DataCollatorWithPadding (tokenizer = tokenizer)
우선 Dataset을 준비한다. 또 주어진 파라미터로 변환할 수 있도록 한다.
Copy tokenized_datasets = tokenized_datasets . remove_columns (
[ "sentence1" , "sentence2" , "idx" ]
)
tokenized_datasets = tokenized_datasets . rename_column ( "label" , "labels" )
tokenized_datasets . set_format ( "torch" )
tokenized_datasets [ "train" ]. column_names
Copy ['attention_mask', 'input_ids', 'labels', 'token_type_ids']
다음으로는 Dataloader를 정의한다.
Copy from torch . utils . data import DataLoader
train_dataloader = DataLoader (
tokenized_datasets[ "train" ], shuffle = True , batch_size = 8 , collate_fn = data_collator
)
eval_dataloader = DataLoader (
tokenized_datasets[ "validation" ], batch_size = 8 , collate_fn = data_collator
)
Dataloader 검증하기
Copy for batch in train_dataloader :
break
{ k : v . shape for k , v in batch . items ()}
Copy {'attention_mask': torch.Size([8, 65]),
'input_ids': torch.Size([8, 65]),
'labels': torch.Size([8]),
'token_type_ids': torch.Size([8, 65])}
모델을 준비한다.
Copy from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification . from_pretrained (checkpoint, num_labels = 2 )
배치를 넣어 잘 동작하는지 확인한다.
Copy outputs = model ( ** batch)
print (outputs.loss, outputs.logits.shape)
Copy tensor(0.5441, grad_fn=<NllLossBackward>) torch.Size([8, 2])
Optimizer 와 learning rate scheduler 설정.
Copy from transformers import AdamW
optimizer = AdamW (model. parameters (), lr = 5e-5 )
Copy from transformers import get_scheduler
num_epochs = 3
num_training_steps = num_epochs * len (train_dataloader)
lr_scheduler = get_scheduler (
"linear" ,
optimizer = optimizer,
num_warmup_steps = 0 ,
num_training_steps = num_training_steps
)
print (num_training_steps)
4.1. Training loop
학습을 위해 GPU를 연결
Copy import torch
device = torch . device ( "cuda" ) if torch . cuda . is_available () else torch . device ( "cpu" )
model . to (device)
device
Copy from tqdm . auto import tqdm
progress_bar = tqdm ( range (num_training_steps))
model . train ()
for epoch in range (num_epochs):
for batch in train_dataloader :
batch = { k : v . to (device) for k , v in batch . items ()}
outputs = model ( ** batch)
loss = outputs . loss
loss . backward ()
optimizer . step ()
lr_scheduler . step ()
optimizer . zero_grad ()
progress_bar . update ( 1 )
4.2. Evaluation loop
Copy from datasets import load_metric
metric = load_metric ( "glue" , "mrpc" )
model . eval ()
for batch in eval_dataloader :
batch = { k : v . to (device) for k , v in batch . items ()}
with torch . no_grad ():
outputs = model ( ** batch)
logits = outputs . logits
predictions = torch . argmax (logits, dim =- 1 )
metric . add_batch (predictions = predictions, references = batch[ "labels" ])
metric . compute ()
4. Accelerate ✔
Accelerate library 를 활용하여 Multiple GPU에서 사용이 가능하다. 그와 관련된 소스를 보도록한다.
Copy from transformers import AdamW , AutoModelForSequenceClassification , get_scheduler
model = AutoModelForSequenceClassification . from_pretrained (checkpoint, num_labels = 2 )
optimizer = AdamW (model. parameters (), lr = 3e-5 )
device = torch . device ( "cuda" ) if torch . cuda . is_available () else torch . device ( "cpu" )
model . to (device)
num_epochs = 3
num_training_steps = num_epochs * len (train_dataloader)
lr_scheduler = get_scheduler (
"linear" ,
optimizer = optimizer,
num_warmup_steps = 0 ,
num_training_steps = num_training_steps
)
progress_bar = tqdm ( range (num_training_steps))
model . train ()
for epoch in range (num_epochs):
for batch in train_dataloader :
batch = { k : v . to (device) for k , v in batch . items ()}
outputs = model ( ** batch)
loss = outputs . loss
loss . backward ()
optimizer . step ()
lr_scheduler . step ()
optimizer . zero_grad ()
progress_bar . update ( 1 )
Copy + from accelerate import Accelerator
from transformers import AdamW , AutoModelForSequenceClassification , get_scheduler
+ accelerator = Accelerator ()
model = AutoModelForSequenceClassification . from_pretrained (checkpoint, num_labels = 2 )
optimizer = AdamW (model. parameters (), lr = 3e-5 )
- device = torch . device ( "cuda" ) if torch . cuda . is_available () else torch . device ( "cpu" )
- model . to (device)
+ train_dataloader , eval_dataloader , model , optimizer = accelerator . prepare (
+ train_dataloader, eval_dataloader, model, optimizer
+ )
num_epochs = 3
num_training_steps = num_epochs * len (train_dataloader)
lr_scheduler = get_scheduler (
"linear" ,
optimizer = optimizer,
num_warmup_steps = 0 ,
num_training_steps = num_training_steps
)
progress_bar = tqdm ( range (num_training_steps))
model . train ()
for epoch in range (num_epochs):
for batch in train_dataloader :
- batch = { k : v . to (device) for k , v in batch . items ()}
outputs = model ( ** batch)
loss = outputs . loss
- loss . backward ()
+ accelerator . backward (loss)
optimizer . step ()
lr_scheduler . step ()
optimizer . zero_grad ()
progress_bar . update ( 1 )
사용시에는 Accelerate 설정 및 적용을 하여야 한다.
Copy # accelerate config
# accelerate launch train.py
Copy from accelerate import notebook_launcher
notebook_launcher (training_function)
관련소스 Link