Ch1. Transformer models

0. Setup

!pip install transformers # light version
!pip install transformers[sentencepiece] # ✔ development version

import transformers

1. Introduction

이 코스를 통해 Hugging Face에서 Transformers, Datasets, Tokenizers, Accelerate, Hugging Face Hub 의 사용법을 배울 수 있다.

처음 과정인 Ch. 1~4을 통해서 hugging face hub로 부터 트랜스포머 모델을 가져오고 사용하는 법에 대해서 배울 수 있게 된다.

How to pipline function to solve NLP tasks such as text generation and classification
About the Transformer architecture
How to distinguish between encoder, decoder and encdoer-decoder architectures and use cases

2. Natural Language Processing (NLP)

NLP는 사람의 언어와 연관된 모든 것들을 이해하는 기계학습과 언어학의 분야이다. 단순히 단어를 이해하는 것에 그치지 않고 문장이 내포하는 의미까지 파악하고자 하는 목표를 가지고 있다.

Classifying whole sentences
Classifying each word in a sentences
Generating text content
Extracting an answer from a text
Generating a new sentence from an input text

위와 같이 다양한 Task를 다루며, 음성인식과 Computer Vision에서 그림에 대한 설명글을 만드는 등 활용이 가능하다.

컴퓨터는 사람과 정보를 수집하는 방법이 다르다. 예를 들어 'I am hungry'와 'I am sad'라는 두 문장이 의미하는 유사점을 쉽게 알 수 있지만, ML 모델에게는 매우 어려운 일이다. 이처럼 복잡한 언어를 학습하기 위한 모델을 알아볼 것이다.

3. Transformers, what can they do?

이미 다양한 분야와 회사에서 Hugging Face와 Transformer models를 사용하고 있다. Transformers library는 직접 만들고 나눌 수 있다.

pipeline

It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer

pipeline을 통해 가져온 모델로 여러 Task를 확인해보기로 한다.

위와 같이 감정분석 모델을 가져와서 input을 주는 것만으로 그 결과를 알 수 있다.
모델을 지정해 주지 않으면 default 모델 (distilbert-base-uncased-finetuned-sst-2-english)을 사용한다.
해당 모델은 classifier 객체를 생성하게 되면 다운로드되어 메모리에 적재된다.

pipeline을 통해 진행되는 과정은 아래 3단계를 거친다.

The text is preprocessed into a format the model can understand.
The preprocessed inputs are passed to the model.
The predictions of the model are post-processed, so you can make sense of them.

3.1. Zero-shot classification

문장 분류 Task

3.2. Text generation

문장 생성 Task

3.3. Using any model from the Hub in a pipeline

HuggingFace Hub에 있는 모델을 가져와 학습하기
num_return_sequence 와 max_length parameter 사용해보기
- num_return_sequence : 문장 갯
- max_length : 문장의 최대 길이

3.4. Mask filling

빈 공간 문자(<MASK>, [MASK]) 채우기
top_k : 출력할 갯수

3.5. Named entity recognition

고유명사의 분류 (이름, 장소, 단체)

3.6. Question Answering

질문과 문장을 주고 응답을 찾아내는 Task

3.7. Summarization

from transformers import pipeline

summarizer = pipeline("summarization")
summarizer("""
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
""")

문장요약

3.8. Translation

from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

[{'translation_text': 'This course is produced by Hugging Face.'}]

번역 문제

4. How do Transformers work?

4.1. Transformers are language models

Language Model이란 주어진 단어들 다음에 나올 단어를 예측하는 Task를 의미한다.

Unfortunately, training a model, especially a large one, requires a large amount of data. This becomes very costly in terms of time and compute resources. 큰 모델을 학습하게 되면 수많은 양의 데이터가 필요하게 된다.

매 학습때마다, 새로운 데이터를 학습시킨다면 어마어마한 손해일 것이다.

4.2. Transfer Learning (전이학)

A Task에 대해 학습한 Model A 를 Task B 에 적용하는 것. 이렇게 할 경우 더 좋은 성능을 가지는 것이 확인되었다.

Fine-tuning, on the other hand, is the training done after a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task

Fine-Tuning 하게 되면, 결과적으로 적은시간, 적은 리소스, 적은 데이터로도 충분히 학습이 가능하고 또한 여러 hyperparameter로 다양한 테스트가 가능하게 한다.

4.3. General architecture

Transformer의 기본구조는 Boostcamp - [U Stage] NLP - (07-08강) Transformer에 정리해 두었으니 해당 내용을 참고하기로 한다.

우선 Transformer는 Encoder와 Decoder로 나뉘게 되는데 Input에 대한 data를 Encoder에서 가공하고 이 정보를 Decoder로 보내 Output Data를 생성해 내도록 나누어져 있다.

4.4. Attention layers

마찬가지로 Boostcamp에 정리가 잘 되어있으니 해당 내용을 참고하기로 하자.

Attention은 Seq to Seq모델에서 생기는 bottle neck 문제점 등을 해소할 수 있는 방법으로 소개 되었다. Transformer의 경우 Seq to Seq에서 RNN 부분을 사용하지 않고 Attention만을 사용하여 모델을 만들었고, 이 기법이 성능 향상에 큰 역할을 하였다.

4.5. The original architecture

4.6. Architectures vs. checkpoints

Architecture This is the skeleton of the model - the definition of each layer and each operation that happens within the model.
Checkpoints These are the weights that will be loaded in a given architecture.
Model This is an umbrella term that isn't as precise as "architecture" or "checkpoint": it can mean both.

예를 들어, BERT는 Architecture이고, bert-base-cased는 checkpoint이다.

5. Encoder models

Transformer의 Encoder부분만으로 만들어진 모델이다. 대표적으로 BERT모델이 있다. 이런 Encoder모델의 특징에 대해 다음과 같이 나열한다.

Bi-directional : 양방향성을 포함하여 문맥의 흐름에 대해 더 잘 파악한다.
Good at extracting meaningful information
Sequence classification, Question Answering, masked language modeling등에서 좋은 성능을 내고 있다.
NLU : Natural Language Understanding
BERT, RoBERTa, ALBERT

문맥을 이해하는 성능이 좋은 모델이라고 생각된다.

6. Decoder models

Encoder model이 Transformer에서 Encoder부분만 사용한 것이라면, Decoder model은 Transformer에서 Decoder부분만을 사용한 모델이다. 대표적으로 GPT 모델이 있다.

Great at causal tasks: generating sequences
- Guessing the next word in a sentence
NLG: Natural Language Generation
GPT-2, GPT Neo

7. Sequence-to-sequence models

Encoder-Decoder models(=seq-to-seq models) 는 Transformer의 Encoder와 Decoder를 모두 사용하는 모델들이다. summarization, translation, or generative question answering 등과 같이 주어진 문장에 대해 새로운 문장을 만드는 것에서 좋은 성능을 낸다.

Sequence to Sequence Tasks: many to many (Translations, Summarization)
Weights are not necessarily shared across the encoder and decoder
input distribution different from output distribution
BART, mBART, Marian, T5

8. Bias and limitations

Pre-Trained 모델은 많은 데이터를 긁어오다보니 좋지않은 데이터도 첨가되어 있다는 한계가 있음을 알고 있어야 한다.

from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

['lawyer', 'carpenter', 'doctor', 'waiter', 'mechanic']
['nurse', 'waitress', 'teacher', 'maid', 'prostitute']

위 예시의 경우 성별에 따라 그 성별이 많이 분포한 직업을 나열하고 있음을 볼 수 있다. 이는 BERT가 믿을만한 데이터 English Wikipedia and BookCorpus 의 데이터로 학습하였기 때문이다.

9. Summary

PreviousHugging Face Tutorial NextCh2. Using Transformers

Last updated 3 years ago

Was this helpful?