MACHINE LEARNING/Artificial Neural Network

Sequence Modeling : Recurrent & Recursive Nets as introduction

24_bean 2023. 1. 15. 19:45

Keyword

  • Parameter sharing
  • Sequence
  • Back-propagation through time (BPTT)

 

* this post is based on "Deep Learning" by Ian goodfellow with my own opinion.


 

Intro

a recurrent nueral network is a neural network that is speialized for processing a seqeunce of values x(1),x(2)...x(i).

 

Parameter sharing makes it possible to extend and apply the model to examples of different forms(different lengths..) and generalize across them. If we had separate parameters for each value of the time index, we could not generalize to sequence lengths not seen during training, nor share statistical strength across different sequence lengths and across different positions in time.

 

Such sharing is particularly important when a specific piece of information can occur at multiple positions within the seqeunce.

 

A traditional fully connceted feedforward network would have separate parameters for each input features, so it would need to learn all of the rules of the language separately at each position in the sentence. By comparison, a recurrent neural network shares the same weights across several time steps.

 

The convolution operation allows a network to share parameters across time, but is shallow. The idea of parameter sharing manifests in the application of the same convolution kernel at each time step.

This recurrent formulation results in the sharing of parameters through a very deep computational graph.

In practice, recurrent networks usually operate on minibatches of such sequences, with a difference sequence length for each member of the minibatch.

 


 

Unfolding Computational Graphs

 

Unfolding a recursive or recurrent computation into a computational graph that has a repetitive structure, typically corresponding to a chain of events.

 So basically, "unfolding" stands for unfold the computational graph literally. To indicate that the state is the hidden units of the networks, let's check the equation out.

 

Below is the basic formal equation for a dynamic system.

classical form of a dynamical system

 

This would be the typical RNN's formation

typical RNN's formation

 

 As could illustrated in figure, typcial RNNs will add extra architectural features such as output layers that read information out of the state h to make predictions.

 When the recurrent network is trained to perform a task that requires predicting the future from past, the network typically learns to use h(t) as a kind of lossy summary of the task-relevant aspects of the past sequence of inputs up to t.

* h(t) would be the fixed vector

 

A recurrent network with no outputs. This recurrent network just processes information from the input x by incorporating it into the state h that is passed forward through time.

 

What we call unfolding is the operation that maps a circuit as in the left side of the figure to a computational graph with repeated pieces as in the right side. The unfolded graph now has a size that depends on the sequence length.

 

So now, We can represent the unfolded recurrence after t steps with a function g(t) as

 

The function g(t) takes the whole past sequence x(i) as input and produces the current state, but the unfolded recurrent structure allows us to factorize g(t) into repeated application of a function f. The unfolding process thus introduces two major advantages

 

  1. Regardless of the sequence length, the learned model always has the same input size, because it is specified in terms of transition from one state to another state, rather than specified in terms of a variable-length history of states.
  2. It is possible to use the same transition function f with the same parameters at every time step.

 

This two factors make it possible to learn a single model f that operates on all time steps an all sequence lengths, rather than needing to learn a separate model for all possible time steps. also the shared model allows generalization to sequence lengths that did not appear in the training set, and allows the model to be estimated with far fewer training examples than would be required without parameter sharing.

 


 

* The content will be coming up soon...