Introduction to Neural Turing Machine

This is the 24th article of ISer Advent Calendar 2019.

Introduction

"Neural Turing Machine"(NTM) is the neural network architecture introduced by DeepMind team in 2014. *1It is the combination of recurrent neural networks and external memory resources. As the name of the architecture mentions, it is analogous to a Turing Machine. In this article, I will summarize the basic idea of NTM and show the implementation using Pytorch. The repository is here.

 

github.com

 

 

What is Neural Turing Machine?

f:id:ski_sukisuki:20191224225434p:plain

Fig1

 The overall architecture of  NTM is demonstrated in Fig1, where the "Controller" is a usual neural network such as RNN, and it controls two types of heads - "Read Heads" and "Write Heads". Through heads, the controller interacts with "Memory" which is N × M matrix, where N is the number of memory locations(rows) and M is the vector size at each location. Next, let's see the details of how each component works.

 

Reading

At each time \(t\), let \(\boldsymbol{w}_t\) be a (normalized) vector of weightings over the \(N\) locations emitted by a read head. The length \(M\) read vector \(\boldsymbol{r}_t\) returned by the head is defined as a convex combination of the row-vectors \(\boldsymbol{M}_t(i)\) in the memory.

$$ \boldsymbol{r}_t \leftarrow \sum_i w_t(i)\boldsymbol{M}_t(i)$$

It means that the read head has a weight vector and reading mecanism is just the multiplication of it and the matrix in the memory.

Writing

The write has two parts: an erase followed by an add. Define \(\boldsymbol{w}_t\) same as above emitted by a write head. Let \(\\boldsymbol{e}_t\) be an erase vector whose \(M\) elements all lie in the range (0,1). Erase mechanism works as follows.

$$ \tilde{\boldsymbol{M}}_t \leftarrow \boldsymbol{M}_{t-1}(i)[\boldsymbol{1} - w_t(i)\boldsymbol{e}_t] $$

Each write head also produces a length \(M\) add vector \(\boldsymbol{a}_t\), which is added to the memory after the erase step has been performed.

$$ \boldsymbol{M}_t \leftarrow \tilde{\boldsymbol{M}}_{t}(i) + w_t(i)\boldsymbol{a}_t $$

 

f:id:ski_sukisuki:20191224233525p:plain

Fig2



 Then, how \(\boldsymbol{w}_t\) is determined? That mechanism (so-called Addressing Mechanism) is shown in Fig2. It would be too redundant to introduce all the steps in the diagram, so I skip it in this article.

Intuition

Let's think about the intuition of how each mechanism works. Firstly, this architecture looks like a Turing machine if you see the memory as the tape in the Turing machine. In the reading mechanism, if you set \(w_t(i') = 1\) for specific \(i'\) and set other \(w_t(i) = 0\), the result would be the specific row of the memory. Read head can extract information in many ways by changing the weight vector from the memory. Write head can do similar things. It can erase and erase the information in a linear way. 

Of course, you can think this architecture as the extension of the neural networks. It is known that the RNN is Turing-Complete, but the low speed of the learning and the poor ability of the generalization was a problem especially for algorithmic tasks. In NTM, the existence of external memory (you can also say "working memory") lets the neural networks to learn how to manipulate the memory resource. So, it learns the algorithmic aspect of the task more which leads to the higher ability of the generalization.

Implementation

I implemented NTM referring to this repository. The practical tips for implementation are introduced by M. Collier and J. Beel in 2018. *2

I tested it by copying tasks, in which the output should be the same 0,1 pattern of the input. 

f:id:ski_sukisuki:20191225001141p:plain

Fig3

Fig3 shows the training convergence drops after 15 thousand sequences. It took almost an hour to complete the leaning on my PC.

f:id:ski_sukisuki:20191225001845g:plain

Fig4

Fig4 shows how the networks work in different learning steps. Note that the network was trained with sequences of length 1 to 20. However, as Fig4 shows, it performs well even if the input length is 80. This is the ability of the generalization of NTM.

 

 

There are more tasks which NTM shows high performance, but I have not done yet... I will update if I test it more. Bye!!

*1:Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. “Neural Turing Machines,” 1–26. http://arxiv.org/abs/1410.5401

*2:Collier, Mark, and Joeran Beel. 2018. “Implementing Neural Turing Machines.” Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11141 LNCS: 94–104. https://doi.org/10.1007/978-3-030-01424-7_10