Abstract: This study focuses on constructing a Korean Sentence-BERT model in a novel method, using student-teacher knowledge distillation. The limitations of BERT have been well explored in previous publications. BERT has proven to be ineffective in deriving sentence-level embeddings and not applicable in practical situations where large amounts of sentence-level embeddings are required, such as document classification and clustering. Sentence-BERT was developed to alleviate these issues and create a model that can derive sentence embeddings in an efficient and accurate manner.

This study explores a transfer learning method in Sentence-BERT, which allows even low-resource language models to leverage the power of models trained in high-resource languages such as Korean. Using translated sentence pairs in the source and target languages, the student model learns to map the translated sentence to the same points in the vector space as the teacher model using a simple mean squared error loss method. In this experiment, an English model was used as the teacher model and a cross-linguistic model was used as the student model. To the knowledge of this author, no Korean Sentence-BERT model has been trained using this novel method to the date of publication of this paper.

Authors: Yours truly, Feb 2022.

–

Paper Summary

Introduction

The performance of high-performing NLP models is dependent on obtaining large amounts of reliable and clean data for training. This is not an issue for high-resource languages (English, German etc.) but it is a large issue for low-resource languages, such as Korean.
To alleviate the discrepancy in the size of training data, researchers have continually experimented with cross-lingual transfer learning methods
Transfer learning distills knowledge learned from one dataset to another dataset
This paper focuses on knowledge distillation for Sentence-BERT (S-BERT) from English to Korean data

SBERT Knowledge Distillation

The paper also has a short section about BERT, since that is the godfather of all NLP models that all recent advancements are based on. However, I’ve already discussed it multiple times on the blog (I think), and the details about BERT will be omitted from this post.

Training Architecture

Like all embedding models, the main goal of this model is to map similar sentences to each other. But because this is a cross-lingual model, the main goal is to ensure that En-Ko translated sentence pairs should be mapped to the same point in the vector space. This means that the model is learning semantic meanings of the sentences and aligning them (theoretically).
The model architecture is shown in the figure below. Basically “knowledge disillation” is a fancy word for transfer learning. There is a high-performing teacher model and a student model that distills knowlege of the teacher model

fig1

Advantages

Knowledge distillation allows us to leverage the power of models trained with large amounts of high-resource languages to tasks for low-resource languages.
This method ultimately trains a model that is multilingual. It can be used for cross-lingual tasks. Although this paper trains a model that is biligual, this method can be used to train a model that is truly multilingual (more than 2 languages).
Unlike LASER and LSTM models, this moel is trained for specific tasks. It can achieve a better performance on a specific task (although it may be less apt for a wide range of general tasks).

English and Korean Multilingual SBERT

This section describes the actual model trained and evauluated in this paper.

Setup

The teacher model: English monolingual bert-base-nli-stsb-mean-tokens
- Why? For its proven ability in English across avariety of tasks, including STSb and clustering.
The student model: base XLM-RoBERTa (XLM-R)
- Why? XLM-R is a multilingual model that is trained in over 88 language and uses the SentencePiece tokenizer, which is not language-specific and works with non-Roman langauges.

Data
Despite all the advantages mentioned above, this model is difficult to train because you need parallel datasets between the source and training languages.

Existing En-Ko datasets: Bible, Conversations, JHE, En-Ko News Corpus, KAIST Parallel Corpus, OPUS, JW300, WikiMatrix
- The data was pre-processed and heavily filtered using a rule-based algorithm
Augmented datasets: English and Korean dictionaries and news sites were crawled and aligned.
- The alignment method was completed using the En-Ko SBERT trained using the existing En-Ko datasets.

Training Two models were trained for this experiment.

Model trained with the clean and augmented datasets
Model trained with raw data (of the existing En-Ko datasets) w/o augmented datasets

Preliminary training results revealed that model #1 outperformed #2 in all training and unsupervised tasks (MSE, KorSTS, KLUE STS), but by a small margin.

Evaluation Tasks

The models were trained on the following five evaluation tasks against the base XLM-R model and KR-SBERT, an monolingual Korean SBERT model.

Supervised KorSTS: The clean model outperforms all other models
Supervised KLUE STS: The clean model outperforms all other models
En-Ko Translation Matching: XLM-R outperforms the clean model by 1 point
Document Classification: KR-SBERT outperforms the lcean model by about 1.6 points

Model Analysis

A heatmap of English and Korean sentences of the model shows that despite the differences in length/tokenization in Korean and English, the model is drawing similar patterns for each sentence.
This is a rough indication of the success of the knowledge distillation or translated sentence alignment.

Conclusion

Further research could involve deeper comparisons with XLM-R and why XLM-R was able to outperform the bilingual model trained in this paper on the translation task.
Another interesting aspect that requires further research is the effect of the quality of the training parallel datasets. Does higher quality mean better results or does quantity beat out quality?

–

My Thoughts

내가 쓴 논문이지만 실험하고 글을 쓴지 어느덧 벌써 4개월 넘게 지났다. 솔직히 이 포스트를 쓰기 전에 기억이 잘 안났는데 막상 pdf를 열고 서론을 다시 읽어보니 기억이 팍! 났다. 그래도 논문을 쓴 보람이 있구나… 라는 생각했다.
개인적으로 evaluation 결과가 조금 아쉽긴 하다. 특히 번역된 문장을 맞추는 task에서 왜 XLM-R이 결과가 더 잘 나왔는지 의문이다. 비록 1포인트의 차이였지만, 이 격차를 줄이고 싶어서 실험을 다시 몇 번 해봤지만 격차가 줄지 않았다. 근데 논문 심사를 받을 때, 네이버에 계신 박사 연구자분께서 파라미터를 계속 조정하면 결과가 같거나 더 좋아질 수도 있는 격차라고 말씀 주셨다. 그러면서 실험을 더 하라고 하셨다. ㅋㅋ
석사 내내 페이퍼를 읽으면서 궁금했던 점이 데이터의 양 vs. 질이었다. 이 페이퍼에서 작게라도 연구할 수 있어서 좋았다. 근데 데이터 규모가 너무 작아서 결과가 딱 conclusive하다고 말하기 좀 그렇다. 그래도 이렇게 작은 데이터로도 결과에 격차가 있었다는 사실이 흥미로웠다. 데이터의 양 vs. 질 논쟁은 사실 결론이 나지 않을 것 같다. 앞으로 연구자들이 해결해야 되는 수수께기라는 생각이 여전히 든다.
실험을 굉장히 많이 진행했다. 파라미터도 계속 바꿔보고, 학습 데이터도 계속 이렇게 저렇게 해보면서 좋은 결과를 얻는 것에 집중했다. 실험을 꾸준히 하고, 하루에 3-4개도 돌리는 경우가 있으니 모델을 tracking하는 게 굉장히 어려웠다. 막상 실험을 돌려놓고 어떤 파라미터 설정을 했는지, 어떤 데이터를 사용했는지 까먹는 것이다. 이러면서 기록의 중요성을 다시 한 번 깨닫고 하루의 실험을 날린 이후에 엑셀 시트에 몇 시에 돌린 모델인지, 어떤 조건으로 돌렸는지 디테일을 빽빽이 정리했다. 회사 와서 보니 이걸 정리해주는 model serving 플랫폼이 있는 것 같더라. 나중에 회사 플랫폼이 어떻게 구성되어 있는지, 어떤 기능이 있는지 알아보고 싶다는 생각이 든다.
논문 쓰기 시작할 때 이미 취직한 상태였고, 솔직히 일하면서 급하게 쓴 감이 없지 않아 있었고 심사 받는 마지막 순간까지 아쉬움이 있었다. 석사 논문인데 너무 간단한 수준 같았고, 모델 분석을 어떻게 할지 감이 잘 안 왔는데, 교수님께서 특별히 방향을 제시해주시지 않았다. 아직까지 조금 속상하긴 하지만 막상 다시 읽어보니 구성도 알차고 논리적으로 잘 쓴 것 같다. 너무 자화자찬인가? 지나가면 역시 힘들었던 일, 속상한 일, 아쉬운 일 다 자연스럽게 잊혀진다.
어쨋든 이렇게 내 컴퓨터 언어학 석사 학위를 마무리 했다. 재미있었고 후회없는 시간이었다.

English to Korean Multilingual Transfer Learning with Sentence-BERT