Keyword
- Parameter sharing
- Sequence
- Back-propagation through time (BPTT)
* this post is structed and based on "Deep Learning" by Ian goodfellow with my own opinion.
* continous post from my last post :
https://24bean.tistory.com/entry/Sequence-Modeling-Recurrent-Recursive-Nets-as-introduction
Recurrent Neural Networks
Here are some examples of important design patterns for recurrent neural networks
- Recurrent networks that produce an output at each time step and have recurrent connections between hidden units.
- Recurrent networks that produce an output at each time step and have rercurrent connections only from the output at one time step to the hidden units at the next time step.
Forward propagation
Forward propagation begins with a specification of the inital state h(0). Then, we apply the following update equations:
An RNN whose only recurrence is the feedback connection from the output to the hidden layer. At each time step t, the input is Xt, the hidden layer activations are h(t), the outputs are o(t), the targets are y(t) and the loss is L(t).
This RNN, actually, less powerful than thos in the family represented by the figure at first. the figure is trained to put a specific output value into o, and o is the only information it is allowed to send to the future. There are no direct connections from h going forward. The previous h is connected to the present only indirectly. Unless o is very high-dimensional and rich, it will usually lack important information from the past. This makes the RNN in this figure less powerful. but each time step can be trained in isolation from the others.
Here's the basic equation when the feedforwarding is working on.
Computing the gradient of this loss function with respect to the parametres is an expensive operation. The runtime is O(T) and cannot be reduced by parallelization because the forward propagation graph is inherently sequential.
Teacher Forcing and Networks with Output Recurrence
The Illustration of teacher forcing. Teacher forcing is a training technique that is applicatble to RNNs that have connections from their output to their hidden states at the next time step.
The disadvantage of strict teacher forcing arises if the network is going to be later used in an open-loop mode, with the network outputs fed back as input. In this case, the kind of inputs that the network sees during training could be quite different from the kind of inputs that it will see at test time. which means the model would be dependable.
Computing the Gradient in a Recurrent Neural Network
Computing the gradient through a recurrent neural network is straightforward. One simple applies the generalized back-propagation algorithm of Sec. so just the use of back-propagation on the unrolled graph is called back-propagation through time algorithm.
Begining the recursion with the nodes immediately preceding the final loss.
In this derivation we assume that the outputs o(t) are used as the argument to the softmax function to obtain the vector y-hat of probabilities over the output. We also assume that the loss is the negative log-likelihood of the true target y(t) given the input so far. The gradient on the outputs at time step t, for all i, t, is as follows.
We work our way backwards, starting from the end of the sequence. At the final time step T, h(T) only has o(T) as a descendent(in terms of negative log-likelihood), so its gradient be like this.
We can then iterate backwards in time to back-propagate gradients through time.
Once the gradients on the internal nodes of the computational graph are obtained, we can obtain the gradients on the parameter nodes. Because the parameters are shared across many time steps, we must take some care when denoting calculus operations involving these variables.
So basically do back-propagate with unfolded graph by time, and get weight and gradient per every step and then share parameter with every layer.
Modeling Seuences conditioned on Context and as directed graphical model would be post soon. and the BI-RNN, encoder-decoder stuffs as well.
'MACHINE LEARNING > Artificial Neural Network' 카테고리의 다른 글
Likelihood, posteriori, prior (+Bayesian Statistics) 연관성 정리 (0) | 2023.02.08 |
---|---|
Deep Learning Applications / 활용 (0) | 2023.01.30 |
Sequence Modeling : Recurrent & Recursive Nets as introduction (0) | 2023.01.15 |
Attention / 어텐션이란 무엇인가? (분량 주의) (0) | 2023.01.09 |
Few-Shot Learning? 관련 논문을 중심으로 이해해보자! (0) | 2022.12.27 |