MACHINE LEARNING/Artificial Neural Network

Sequence Modeling : Recurrent & Recursive Nets as RNN

24_bean 2023. 1. 16. 21:02

Keyword

  • Parameter sharing
  • Sequence
  • Back-propagation through time (BPTT)

 

* this post is structed and based on "Deep Learning" by Ian goodfellow with my own opinion.

* continous post from my last post :

https://24bean.tistory.com/entry/Sequence-Modeling-Recurrent-Recursive-Nets-as-introduction

 

Sequence Modeling : Recurrent & Recursive Nets as introduction

Keyword Parameter sharing Sequence Back-propagation through time (BPTT) * this post is based on "Deep Learning" by Ian goodfellow with my own opinion. Intro a recurrent nueral network is a neural network that is speialized for processing a seqeunce of valu

24bean.tistory.com


Recurrent Neural Networks

The Computational graph to compute the training loss of a recurrent network that maps an input sequence of x values to a corresponding sequence of output O values.

 

 

 Here are some examples of important design patterns for recurrent neural networks

  • Recurrent networks that produce an output at each time step and have recurrent connections between hidden units.
  • Recurrent networks that produce an output at each time step and have rercurrent connections only from the output at one time step to the hidden units at the next time step.

Forward propagation

 Forward propagation begins with a specification of the inital state h(0). Then, we apply the following update equations:

 

 

(left) Circuit diagram. (right) Unfolded computational graph

 

 An RNN whose only recurrence is the feedback connection from the output to the hidden layer. At each time step t, the input is Xt, the hidden layer activations are h(t), the outputs are o(t), the targets are y(t) and the loss is L(t).

 

 This RNN, actually, less powerful than thos in the family represented by the figure at first. the figure is trained to put a specific output value into o, and o is the only information it is allowed to send to the future. There are no direct connections from h going forward. The previous h is connected to the present only indirectly. Unless o is very high-dimensional and rich, it will usually lack important information from the past. This makes the RNN in this figure less powerful. but each time step can be trained in isolation from the others.

 

 Here's the basic equation when the feedforwarding is working on.

where the parameters are the bias vectors b and c along with the weight matrices U, V and W, respectively for input-to-hidden, hidden-to-output, hidden-to-hidden connections.

 

 Computing the gradient of this loss function with respect to the parametres is an expensive operation. The runtime is O(T) and cannot be reduced by parallelization because the forward propagation graph is inherently sequential.

 


Teacher Forcing and Networks with Output Recurrence

 

 The Illustration of teacher forcing. Teacher forcing is a training technique that is applicatble to RNNs that have connections from their output to their hidden states at the next time step.

An example of Teacher forcing. literally letting know the model the ground truth which comes next word at the time for the sequence.

 

 The disadvantage of strict teacher forcing arises if the network is going to be later used in an open-loop mode, with the network outputs fed back as input. In this case, the kind of inputs that the network sees during training could be quite different from the kind of inputs that it will see at test time. which means the model would be dependable.

 


Computing the Gradient in a Recurrent Neural Network

 Computing the gradient through a recurrent neural network is straightforward. One simple applies the generalized back-propagation algorithm of Sec. so just the use of back-propagation on the unrolled graph is called back-propagation through time algorithm.

 

 Begining the recursion with the nodes immediately preceding the final loss.

 

 

 In this derivation we assume that the outputs o(t) are used as the argument to the softmax function to obtain the vector y-hat of probabilities over the output. We also assume that the loss is the negative log-likelihood of the true target y(t) given the input so far. The gradient on the outputs at time step t, for all i, t, is as follows.

 

 

 We work our way backwards, starting from the end of the sequence. At the final time step T, h(T) only has o(T) as a descendent(in terms of negative log-likelihood), so its gradient be like this.

 

 

 We can then iterate backwards in time to back-propagate gradients through time.

 

 

 Once the gradients on the internal nodes of the computational graph are obtained, we can obtain the gradients on the parameter nodes. Because the parameters are shared across many time steps, we must take some care when denoting calculus operations involving these variables.

 

 So basically do back-propagate with unfolded graph by time, and get weight and gradient per every step and then share parameter with every layer.

 


 

Modeling Seuences conditioned on Context and as directed graphical model would be post soon. and the BI-RNN, encoder-decoder stuffs as well.