Ch2. Using Transformers

1. Introduction

How to use tokenizers and models to replicate the pipeline API's behavior
How to load and save models and tokenizers
Different tokenization approaches, such as word-based, character-based, and subword-based
How to handle multiple sentences of varying lengths

Transformer library를 통해 쉽고, 유동적인 사용을 할 수 있도록 하고자 하였다.

이번 챕터를 통해 model과 tokenizer의 사용에 대해 시작부터 끝까지 배워보는 시간을 가진다.

2. Behind the pipeline

학습과정은 위와 같이 주어진 Raw Text를 Tokenizer로 가공하고 Model 학습을 진행하고 예측결과를 출력하는 단계로 이루어져 있다. 이 과정에 대해 살펴보도록 하자.

2.1. Preprocessing with a Tokenizer

학습 모델은 주어진 문장을 가지고 바로 학습을 진행할 수 있다. 그렇기에 우선 raw text를 token화 하는 작업을 진행하게 되는데 이를 Tokenization 이라고 한다.

Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
Mapping each token to an integer
Adding additional inputs that may be useful to the model

Pre-Trained Model과 동일한 Preprocessing 과정을 거쳐야하므로 Model Hub에서 해당 정보를 다운로드 받아야한다. 아래 예시 코드를 참고하자.

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

AutoTokenizer 클래스의 from_pretrained 함수를 사용하여 해당 pipeline의 토큰화 방법을 가져온다.

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{
'input_ids': 
  tensor([
    [  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 
     17662, 12172,  2607,  2026,  2878,  2166,  1012,   102],
    [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,
         0,     0,     0,     0,     0,     0,     0,     0]
  ]), 
'attention_mask': 
  tensor([
    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
  ])
}

만들어진 tokenizer를 통해 raw text를 손쉽게 토큰화 할 수 있다.

2.2. Going through the model

Tokenizer와 마찬가지로 pre-trained model을 가져올 수 있다.

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

2.3. A high-dimensional vector?

Batch Size the number of sequence processed at a time
Sequence length The length of the numerical representation of the sequence
Hidden size The Vector dimension of each model input

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])

보통 마지막 hidden size 값으로 인해 매우 큰 차원의 값이 생성된다.

2.4. Model heads: Masking sense out of numbers

[잘 이해 하지 못함]

from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

print(outputs.logits.shape)

torch.Size([2, 2])

2.5. Postprocessing the output

output data에 softmax를 취해 각 토큰의 확률값을 구하게 된다.

import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>)

model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

즉, 첫문장의 'NEGATIVE': 0.041, 'POSITIVE': 0.9598 의 확률로 labels의 값을 나타내게 됨을 확인할 수 있다.

3. Models

모델의 생성과 사용법에 대해 좀 더 자세히 알아보자. AutoModel 클래스를 사용하여

3.1. Creating a Transformer

Configuration 객체를 호출하여 BERT 모델을 초기화한다.

from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.10.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

config에는 위와 같이 parameter에 대한 정보가 담겨져 있다.

3.2. Different loading methods

3.1. 과 같이 모델을 호출하게되면 임의로 초기화 된 모델이 생성이 된다. 하지만 우리는 Pre-Trained Model을 사용하여 학습 성능향상과 시간 단축을 할 수 있다.

from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

"bert-base-cased"의 weights로 초기화 된다.

3.3. Saving methods

model.save_pretrained("directory_on_my_computer")

config.json attributes necessary to build the model architecture contains metadata such as where the checkpoint originated
pytorch_model.bin state dictionary cotains all your model's weights

3.4. Using a Transformer model for inference

Tokenizer Encoding
Make it to Tensor
Get output by call the model with the inputs

4. Tokenizers

토큰화하는 방식도 Boostcamp를 통해 배운 바 있다. 간략하게 사용법에 대해 익히도록 한다.

Word-based
- 각 단어별로 나누는 방식
- "Jim Henson was a puppeteer" => [Jim, Henson, was, a, puppeteer]
- Unknown Token([UNK])이 많아지게 된다.
Charater-based
- 알바벳 단위로 나누는 방식
- Jim Henson was a puppeteer => [J, i, m, ..., t, e, e, r]
- 단어로서의 의미가 사라지게 되는 문제가 있다. (중국어는 한 자마다 의미가 있어 좋은 성능을 낼 수 있다.)
Subword-Tokenization
- 단어를 의미단위로 나누는 방식
- Let's do tokenization! => [Let's, do, token, ization, !]
- 이 방법으로 [UNK]를 줄일 수도 있고, 단어로써의 의미도 보존할 수 있다.

그 외에도 Byte-level BPE(GPT-2), WordPiece(BERT), SentencePiece or Unigram 등 기법이 존재한다.

4.1. Loading and Saving

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 11303, 1200, 2443, 1110, 3014, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

# Saving Tokenizer
tokenizer.save_pretrained("directory_on_my_computer")

4.2. Encoding

STEP1) split the text into tokens
STEP2) adds special tokens the model expects

4.3. Tokenization

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']

4.4. From tokens to input IDs

Convert tokens to integer

ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 11303, 1200, 2443, 1110, 3014]

4.5. Decoding

decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

'Using a Transformer network is simple'

5. Handling multiple sequences

5.1. Models expect a batch of inputs

앞서배운 내용으로 코드를 작성하고 실행하게 되면 오류가 발생한다.

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
# This line will fail.
model(input_ids)

그 이유는 model의 parameter로 input_ids 즉 list가 들어와야 하기 때문이다.

input_ids = torch.tensor([ids])

위에서는 입력문장이 1개여서 변환이 필요했지만, 여러 문장을 가지고 생각하면 더 간단하게 느껴질 수 있다. 모델이 처리할 문장을 batch에 담아서 input parameter로 전달한다고 생각하면 된다.

batched_ids = [ids, ids]

5.2. Padding the inputs

입력값들의 size를 동일하게 맞춰주기 위한 padding 작업이다.

batched_ids = [
  [200, 200, 200],
  [200, 200]
]
##############################
padding_id = 100

batched_ids = [
  [200, 200, 200],
  [200, 200, padding_id]
]

실제로는 tokenizer.pad_token_id를 가지고 사용한다.

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [[200, 200, 200], [200, 200, tokenizer.pad_token_id]]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)

여기서 참고할만한 내용은 두번째 입력값에 대한 마지막 출력값이 다르게 나오는데 이는 문맥의 흐름에 대한 가중치가 포함되기 때문이다. 이를 만약 같은 값을 추출하고 싶다면 attention mask 를 사용하면 된다.

5.3. Attention Masks

[Mask] Token을 사용하여 학습에 영향을 주지않도록 제외 시킨다.

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id]
]

attention_mask = [
  [1, 1, 1],
  [1, 1, 0]
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

5.4. Longer sequences

너무 문장의 길이가 긴 경우, 긴 문장을 감당할 수 있는 모델을 사용하거나 최고 길이 만큼 잘라서 사용하기도 한다.

6. Putting it all together

6.1. Tokenizer

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

sequences = [
  "I've been waiting for a HuggingFace course my whole life.",
  "So have I!"
]

model_inputs = tokenizer(sequences)

# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

sequences = [
  "I've been waiting for a HuggingFace course my whole life.",
  "So have I!"
]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

sequences = [
  "I've been waiting for a HuggingFace course my whole life.",
  "So have I!"
]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

6.2. Special tokens

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

6.3. 한줄로 보기

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
  "I've been waiting for a HuggingFace course my whole life.",
  "So have I!"
]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)