(04강) LSTM and GRU

LSTM, GRU에 대해서 알아보고 기존 RNN과의 차의를 알아보자
Further Reading
Understanding LSTM Networks
Further Question
BPTT 이외에 RNN/LSTM/GRU의 구조를 유지하면서 gradient vanishing/exploding 문제를 완화할 수 있는 방법이 있을까요?
RNN/LSTM/GRU 기반의 Language Model에서 초반 time step의 정보를 전달하기 어려운 점을 완화할 수 있는 방법이 있을까요?

※ 기존 RNN이 가진 문제점

현재노드와 먼 과거의 상태를 사용한 문맥처리가 어렵다.
거리가 멀어짐에 따라 Gradient vanishing/exploding 현상이 발생한다.

1. LSTM (Long Short-Term Memory)

Pass cell state information straightly without any transformation
Short-Term memory(단기기억)을 길게 가져간다는 의미

1.1. Basic Structure

i : input gate, whether to write to cell
f : forget gate, whether to erase cell
o : output gate, how much to reveal cell
g : gate gate, how much to write to cell

LSTM은 위 그림과같이 기존 RNN의 input값인 $x_t, h_t$ 뿐만 아니라 Cell State라는 $c_t$ 값도 Input으로 가지게된다. 이 $c_t$ 값은 지금까지 지나온 layer들 즉, 과거의 단어에 대한 정보를 잘 담고 있다.

Forget Gate

$f_t = \sigma(W_f\ \cdot \ [h_{t-1}, x_t] + b_f)$
과거로부터 넘어온 데이터에서 $w_f$ 만큼 가중치를 제거한다.(생략할 부분을 정한다.)

Gate Gate (Input Gate)

Generate information to be added and cut it by input gate
- $i_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$
- $\widetilde{C} = tanh(W_c \cdot [h_{t-1}, x_t]+b_c)$
Generate new cell state by adding cureent information to previous cell state
- $C_t = f_t \cdot C{t-1} + i_t \cdot \widetilde{C}_t$

Output Gate

Generate gidden state by passing cell state to tanh and output gate
Pass this hidden state to next time step, and output or next layer if needed
- $o_t = \sigma(W_o[h_{t-1}, x_t] + b_o)$
- $h_t = o_t \cdot tanh(C_t)$

2. GRU (Gated Recurrent Unit)

LSTM의 모델을 경량화한 모델

$z_t = \sigma(W_z \cdot [h_{t-1}, x_t])$
$r_t = \sigma(W_r \cdot [h_{t-1}, x_t])$
$\widetilde{h_t} = tanh(W \cdot [r_t \cdot h_{t-1}, x_t])$
$h_t = (1-z_t)\cdot h_{t-1} + z_t \cdot \widetilde{h_t}$ 🌟가중치의 합이 1이 되게끔 되어있다.
c.f) $C_t = f_t \cdot C_{t-1} + i_t \cdot \widetilde{C_t}$ in LSTM

적은 메모리 요구량과 빠른 계산이 가능하도록 하였다. LSTM에는 Cell State와 Hidden State가 있는 반면 GRU에는 Hidden State 만이 존재한다.

$C_t$ 를 사용하지 않고 $h_t$ 를 사용
forget gate를 1 - input gate 가중치로 사용

3. Backpropagation in LSTM&GRU

필요로 하는 정보를 곱셈이 아닌 덧셈연산으로 이루어지게되어 RNN에 비해 길이에 대해 Gradient Vanishing, exploding 문제를 완화할수 있게 되었다.