# Ch3. Fine-tuning a model with the Trainer API

## 1. Introduction

* How to prepare a large dataset from the Hub
* How to use the high-level Trainer API to fine-tuning a model
* How to use a custom training loop
* How to leverage the Accelerate library to easily run that custom training loop on any distributed setup

### 1.1. Fine-tuning이란?

Pre-trained 모델을 주어진 문제에 맞도록 튜닝 하는 작업을 의미한다.

## 2. Processing the data

이번 챕터에서는 MRPC(Microsoft Research Paraphrase Corpus) Dataset을 활용하여 학습을 진행해 보기로 한다.

### 2.1. Loading a dataset from the Hub

Hugging Face는 model, tokenizer 뿐만 아니라 dataset도 제공을 한다.

```
!pip install datasets
```

```python
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets
```

```
DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})
```

* dataset의 한 문장을 확인해보면

```python
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]
```

```
{'idx': 0,
 'label': 1,
 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}
```

* 이미 label이 숫자표기로 되어있는 것을 확인할 수 있다. 어떤 label 값을 가지고 있는지 확인해 보자

```python
raw_train_dataset.features
```

```
{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
 'idx': Value(dtype='int32', id=None)}
```

* 위 내용은 두 문장을 넣고 equivalent 여부를 체크하는 것으로 보인다.

### 2.2. Preprocessing a dataset

* tokenizer를 활용하여 convert 한다.

```python
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])
```

* model에 넣을 때 sentence1과 2를 각각 넣는 것이 아니다. 두 문장을 한번에 token화하게 된다.

```python
inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs
```

```
{ 
  'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102],
  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}
```

* `token_type_ids`: 첫 문장과 두번째 문장의 구분을 나타내는 값

`input_ids` 값을 decode 하게 되면 아래와 같이 얻을 수 있다.

```python
tokenizer.convert_ids_to_tokens(inputs["input_ids"])
```

```
['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']
```

* \[CLS], \[SEP] 같은 special token을 확인할 수 있다.

Dataset 객체를 생성하는 방법에 대해 알아보자. 우선 기본적인 방법으로 아래와 같이 정의할 수 있다.

```python
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)
```

하지만 위 방법은 dictionary 형태의 return이 불가능하다. 그래서 Dataset.map method를 사용하게 된다. 이 method는 dataset의 각 element들에게 적용된다. tokenize 함수를 아래와 같이 설정하고 dataset.map을 통해 모든 element를 tokenize할 수 있도록 한다.

```python
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
```

```python
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets
```

```
DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 408
    })
    test: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 1725
    })
})
```

위와 같이 `input_ids` ,`attention_mask` `token_type_ids` 이 추가된 것을 확인할 수 있다.

### 2.3. Dynamic padding

Dataset을 DataLoader에 담아 데이터를 꺼내어 사용하게 되는데 우리는 불필요한 패딩을 줄이기위해 각 batch별로 가장 큰 길이를 지정하여 padding을 생성하게 된다.

여기서는 DataCollatorWithPadding 함수를 사용해 padding을 만들어 본다.

```python
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
```

```python
samples = tokenized_datasets["train"][:8]
samples = {
    k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]
}
[len(x) for x in samples["input_ids"]]
```

```
[50, 59, 47, 67, 59, 50, 62, 32]
```

* 샘플 데이터를 추출하여 `data_collator` 에 넣어보면

```python
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}
```

```
{'attention_mask': torch.Size([8, 67]),
 'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'labels': torch.Size([8])}
```

size가 67로 가장 큰 size로 잡힌 것을 확인 할 수 있다.

## 3. Fine-tuning a model with the Trainer API

Transformer는 Trainer 클래스를 제공해 fine-tune 하기 쉽도록 하였다.

### 3.1. Training

```python
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
```

* 우선 Dataset과 tokenizer를 가져와서 Pre-processing을 거친다.

`TrainingArguments` 클래스를 가져와 생성해준다.

```python
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")
```

다음으로 모델 객체를 생성해 준다.

```python
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
```

```python
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)
```

```python
trainer.train()
```

정말 너무 간단하게도 train method를 호출하는 것 만으로 학습이 진행된다.

### 3.2. Evaluation

```python
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)
```

```
(408, 2) (408,)
```

```python
import numpy as np
preds = np.argmax(predictions.predictions, axis=-1)
```

```python
from datasets import load_metric

metric = load_metric("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)
```

```
{'accuracy': 0.8578431372549019, 'f1': 0.8996539792387542}
```

* 위 과정을 하나의 함수로 만들게 되면,

```python
def compute_metrics(eval_preds):
    metric = load_metric("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)
```

```python
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()
```

## 4. A full training

`Trainer` 클래스의 도움없이 학습을 진행하는 일련의 과정을 살펴보기로 한다.

```python
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
```

&#x20;**우선 Dataset을 준비한다. 또 주어진 파라미터로 변환할 수 있도록 한다.**

```python
tokenized_datasets = tokenized_datasets.remove_columns(
    ["sentence1", "sentence2", "idx"]
)
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names
```

```
['attention_mask', 'input_ids', 'labels', 'token_type_ids']
```

&#x20;다음으로는 Dataloader를 정의한다.

```python
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)
```

&#x20;Dataloader 검증하기

```python
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}
```

```
{'attention_mask': torch.Size([8, 65]),
 'input_ids': torch.Size([8, 65]),
 'labels': torch.Size([8]),
 'token_type_ids': torch.Size([8, 65])}
```

&#x20;**모델을 준비한다.**

```python
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
```

&#x20;배치를 넣어 잘 동작하는지 확인한다.

```python
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)
```

```
tensor(0.5441, grad_fn=<NllLossBackward>) torch.Size([8, 2])
```

Optimizer 와 learning rate scheduler 설정.

```python
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)
```

```python
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)
print(num_training_steps)
```

```
1377
```

### 4.1. Training loop

&#x20;학습을 위해 GPU를 연결

```python
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device
```

```
device(type='cuda')
```

```python
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
```

### 4.2. Evaluation loop

```python
from datasets import load_metric

metric= load_metric("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()
```

### 4. Accelerate ✔

&#x20;[Accelerate library](https://github.com/huggingface/accelerate)를 활용하여 Multiple GPU에서 사용이 가능하다. 그와 관련된 소스를 보도록한다.

* 기존 학습

```python
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
```

* Accelerate 학습

```python
+ from accelerate import Accelerator
  from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

+ accelerator = Accelerator()

  model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
  optimizer = AdamW(model.parameters(), lr=3e-5)

- device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
- model.to(device)

+ train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
+     train_dataloader, eval_dataloader, model, optimizer
+ )

  num_epochs = 3
  num_training_steps = num_epochs * len(train_dataloader)
  lr_scheduler = get_scheduler(
      "linear",
      optimizer=optimizer,
      num_warmup_steps=0,
      num_training_steps=num_training_steps
  )

  progress_bar = tqdm(range(num_training_steps))

  model.train()
  for epoch in range(num_epochs):
      for batch in train_dataloader:
-         batch = {k: v.to(device) for k, v in batch.items()}
          outputs = model(**batch)
          loss = outputs.loss
-         loss.backward()
+         accelerator.backward(loss)

          optimizer.step()
          lr_scheduler.step()
          optimizer.zero_grad()
          progress_bar.update(1)
```

* &#x20;사용시에는 Accelerate 설정 및 적용을 하여야 한다.

```
# accelerate config
# accelerate launch train.py
```

* On jupyter Notebooks

```python
from accelerate import notebook_launcher

notebook_launcher(training_function)
```

{% embed url="<https://github.com/huggingface/accelerate/tree/main/examples>" %}

## 관련소스 Link

{% embed url="<https://colab.research.google.com/drive/1XL-koWW9BgUQjtpGBBnjixBtlEBr4BSC?usp=sharing>" %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://lswkim322.gitbook.io/til/til-ml/untitled-2/hugging-face-tutorial/ch3.-fine-tuning-a-model-with-the-trainer-api.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
