Transformer and BERT in mlpack

Mon, August 29

FINAL REPORT

Organization	mlpack
Project	Transformer and BERT in mlpack
Student	Mrityunjay Tripathi
Mentor	Mikhail Lozhnikov

Abstract

In this report, I will try to sum up the work done during the 3 months as my GSoC project. The aim of the project was to implement the Transformer Model and the BERT Model in mlpack.

Objectives

Add Embedding Layer.
Add Linear3D Layer.
Add Multihead Attention Layer.
Add Positional Encoding Layer.
Add Transformer Encoder.
Add Transformer Decoder.
Add the Transformer model.
Implement the BERT Model.
Add tokenizers for BERT.
Add BLEU Score Metric.
Load pre-trained weights of the BERT Model from Tensorflow in mlpack.
Fine-tune BERT model on GLUE benchmarks.

Contributions before GSoC

Added Poisson Negative Log-Likelihood Loss [PR #2196]
Added Huber Loss [PR #2199]
Templating return type of Loss functions [PR #2339]
Inclusion of all layers in layer.hpp file [PR #2353]
Added Softmax Layer [PR #2351]

Project Work

Embedding Layer [PR #2398]
The Lookup class stores word embeddings and retrieves them using tokens. The existing Lookup layer in mlpack had a limitation that it can process only one batch and this PR aimed to fix that limitation. For more details view the pull request.
Linear3D Layer [PR #2508]
The existing Linear Layer in mlpack didn’t take account of the extra dimensions any data point can have. When we have 3D input where every batch contains multiple data points and each data points has 'n' features. The problem with the existing Linear layer was that we needed to vectorize each slice so that shape becomes (sequenceLength * embeddingSize, batchSize). And the number of features is then sequenceLength * embeddingSize which is not true. Even if we try to do such a thing, the number of parameters (the shape of weight will be (sequenceLength * embeddingSize, outSize)) will be much higher than what it should be. For more details view the pull request.
Multihead Attention Layer [PR #2375]
I further started to implement the Multihead Attention layer. Multihead Attention is the main component of the Transformer Model. Refer to the paper Attention Is All You Need. Initially, I faced some problems regarding the API as FFN in mlpack takes just one input. The implementation part was also a little challenging since in mlpack we need to explicitly define the function for backpropagation, unlike other libraries that use automatic differentiation. By implementing the algorithm in C++, I understood each and everything very clearly about the paper. For more details view the pull request.
Fix Initialization methods of weights and add accessor methods for weights and biases in ANN layers [PR #2404]
Some of the ANN Layers in mlpack didn’t have the method to access and modify the weights and biases of the layers. This PR aimed to add this feature. While working on this PR, I found some problems in the initialization of matrices also. For more details view the pull request.
Positional Encoding [PR #2557]
Positional Encoding injects some information about the relative or absolute position of the tokens in the sequence. This kind of encoding is introduced just before the Transformer Encoder layer and after the Embedding layer. The implementation is again based on Attention Is All You Need. For more details view the pull request. The BERT model requires three types of Embedding- Token, Segment, and Positional. The Token and the Segment embedding can be implemented using the Lookup layer in ann namespace of mlpack but injecting information about the relative positions of tokens required a separate layer to be implemented.
Transformer [PR #16]
The Transformer model entirely relies on attention mechanisms. Unlike earlier methods such as RNN or LSTMs, the Transformer model allows parallelization. It was not straight forward to implement the Transformer model in mlpack using the existing ANN layers. The workaround was to use the Concat layer to merge two ANN modules suggested by Mikhail Lozhnikov. The current implementation is not merged to the master but it is almost complete and will be soon merged.
BERT [PR #30]
BERT is designed to pre-train deep bidirectional representations from the unlabeled text by jointly conditioning on both left and right contexts in all layers. The BERT model uses the Transformer Encoder part of the Transformer model and then fine-tuned for various tasks. The current implementation of the BERT model is complete. The tokenizers required for the preprocessing are yet to be worked on and I hope it will be completed soon.
BLEU Score [PR #2477]
BLEU, or the Bilingual Evaluation Understudy, is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. For more details view the pull request.

Future Work

There are some things remaining to be implemented in BERT and related preprocessing of textual data. The textual data that has to be fed to the BERT model needs to be preprocessed in a special way known as Word Piece tokenization according to the paper BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding.

Conclusion

It was a really thrilling experience with mlpack. The project was also challenging to implement. Almost all of the objectives were achieved and hopefully, those remaining will be soon completed and merged. I got to learn many new things in these three months like the concept of visitor and variants, using Valgrind and Leak Sanitizer for memory checks, creating documentation using Doxygen, building C/C++ project using CMake, writing unit tests using Boost or Catch2, the concept of serialization and how it works, etc. I just cannot compare myself six months back.

Acknowledgment

The mlpack community is extremely helpful and especially I am immensely indebted to Mikhail Lozhnikov who was there to help me at each and every point. Without his help, the project would have been very much complicated. I look forward to contributing to mlpack and enrich it in the best possible way.

GSoC with mlpack

Transformer and BERT in mlpack

FINAL REPORT

Abstract

Objectives

Contributions before GSoC

Project Work

Embedding Layer [PR #2398]

Linear3D Layer [PR #2508]

Multihead Attention Layer [PR #2375]

Fix Initialization methods of weights and add accessor methods for weights and biases in ANN layers [PR #2404]

Positional Encoding [PR #2557]

Transformer [PR #16]

BERT [PR #30]

BLEU Score [PR #2477]

Future Work

Conclusion

Acknowledgment

Click here for weekly blogs.