Facebook LASER
Last year, Facebook released code for LASER, or Language-Agnostic SEntence Representations. As it states on their github, LASER is “a library to calculate and use multilingual sentence embeddings.” I’ve read their papers, including the one where they explain the algorithm behind LASER and also the one where they use the data mined from LASER to train their M2M-100 language model.
To summarize very briefly, LASER is an encoder model trained on 93 languages that is theoretically able to derive quality sentence embeddings for all of those languages. You can use it for multilingual document classification, multilingual mining, multilingual similarity comparisons and basically for performing various multilingual tasks. Also, the authors tweaked the algorithm used to compare sentence embeddings so that the cosine similarity threshold for similar sentences could be dynamically adjusted according to your dataset. This all to make the process more efficient and more accurate.
It all sounds very good. Maybe too good… When I actually set out to use LASER for multilingual text alignment, I found it to be error-ridden and of poor quality. Here’s why.
Importing functions in LASER…
If you go to the official LASER github, it seems to have been updated as recently as four months ago, at the time I’m writing this. To be fair I don’t think the models or any of the main components of the github has been updated that frequently since its conception in 2019.
Still, that should be no excuse. When you clone the git repository and run any of their pre-written codes, you will 100% get an error. And honestly with any repo that you clone, you are likely to get an error becuase of the differences in environments or version changes that the authors have not yet been able to account for. However, the first error you get when you use LASER is an import error: the classes and functions in LASER/source/lib
is not made available to the main classes and functions in LASER/source/
.
And this is just such a stupid error that I can’t believe the researchers at Facebook made it. It’s not even difficult to fix – just annoying. Anyway when you fix this error, you get another one…
Languages for LASER
LASER proudly states on its repo…
We now provide an encoder which was trained on 93 languages, written in 23 different alphabets [6]. This includes all European languages, many Asian and Indian languages, Arabic, Persian, Hebrew, ..., as well as various minority languages and dialects.
So reading this you obviously assume that it’s going to be a seemless integration to use all of these 93 languages using this repo. However, when I entered the language code for Korean, ko
, I was greeted with a language code error.
Upon a closer examination of their source codes, I realized they only provide tokenizing/task support for a couple select languages, less than 10. I ended up writing my own code for Korean tokenization and processing to feed to the BPE and then the encoder.
Also, this part is not that difficult. I just wish they had been more transparent in the first page what languages can be processed using this specific repo and what languages need to be tokenized manually. For those who haven’t dealt with Korean tokenizing before, this involves installing a package and learning how to use it. Fortunately I did have experience in tokenizing Korean sentences. So while this was not a difficult problem to solve, if they were more transparent from the beginning, I would not have to have gone through their code line by line to see what the problem was.
Using LASER for Multilingual Mining
So I’ve fixed all the import errors, figured out how to integrate Korean tokenization package to LASER and I’m ready to go. Just to clarify, the task I’m attempting is to align English and Korean sentences when they are given unordered. This is the similarity
task in the LASER repo.
This task basically follows these steps:
- Tokenzie text of both languages
- Apply FastBPE to tokenized text file
- Encode the BPE text files using the LASER encoder
- Create a FAISS index of the embeddings from (3)
- For each sentence, search the index for the most similar embedding. The most similar should be the parallel sentence in the other language.
There are obvious benefits to this architecture. It’s simple and fast. LASER already provides you with an encoder that they assure has good performance.
I tried to align bilingual news articles from Joongang Ilbo using this method.
On some articles, LASER did a pretty good job:
Chanel's pragmatism is considered the key to her success, along with the marketing targeting the high class. 이렇듯 샤넬의 실용주의는 최상류층을 겨냥한 귀족 마케팅과 함께 브랜드의 성공 요인으로 꼽힌다.
On July 1, prices of major Chanel products are expected to rise by 12 percent, and the waiting lines are getting even longer. 내달 1일 샤넬백 주요 제품의 가격이 12% 인상한다는 소식이 들려오자 최근 샤넬 매장의 대기행렬은 더욱더 길어지고 있다.
On others it did okay…
A Korean diplomatic official leaked to the local press the allegation that Japan had provisionally agreed to a brief meeting between the two leaders but unilaterally called it off because of Korea's military drill to defend the Dokdo islets in the East Sea. 발단은 지난주 말 영국에서 열린 주요 7개국(G7) 정상회의에서 문재인 대통령과 스가 요시히데(菅義偉) 일본 총리 간의 회담이 불발된 데 있다.
On some it completely missed the mark…
More than 20,000 people joined the event, and reviewers wrote that they had eaten their first theater popcorn in a long time. 일회용품 자제 차원에서 뚜껑 있는 다회용 식품용기를 가져오면 6000원에 가득 채워주는 행사였다.
This is disappointing because I already had set a pretty high threshold for the similarity scores that these sentences had to overcome. Not only were most pairs in the okay~bad level, this caused a serious loss of data. In the end, I only came away with 3,766 data pairs, when I had started with 1,187 articles, each comprised of 20~30 sentences. Even with modest estimations, that’s around 20,000 sentences in English and Korean.
Using SBERT for Multilingual Mining Instead
To see if SBERT could perform better, I used tested the same similarity task using a Multilingual SBERT that I had been training. I had doubts about this model because the data I used to fine-tune it was extremely limited. Turns out I had nothing to worry about.
Here are the same sentences LASER got so wrong…
A Korean diplomatic official leaked to the local press the allegation that Japan had provisionally agreed to a brief meeting between the two leaders but unilaterally called it off because of Korea's military drill to defend the Dokdo islets in the East Sea. 한국 외교 당국자는 "약식 회담을 하기로 잠정합의를 해 놓고 독도 훈련을 핑계로 일본이 일방적으로 약속을 깨버렸다"는 사실을 언론에 공개했다.
More than 20,000 people joined the event, and reviewers wrote that they had eaten their first theater popcorn in a long time. 전국 참여자가 2만명에 이르렀고 "오랜만에 극장 팝콘 맛 봤다"는 후기가 줄이었다.
As you can see if you are fluent in both languages, SBERT did an excellent job selecting the appropriate pairs. Using SBERT, I was able to derive 13,171 English-Korean sentence pairs, more than 3 times more than LASER, even if it is lacking to the original dataset (I also set a certain threshold here).
LASER Datasets
If the method of LASER is flawed, it’s without a question the data colelcted by LASER is too. The LASER team released 2 large-scale dataesets: the CCMAtrix and WikiMatrix. The CCMatrix is a collection of datasets collected from the web then aligned using LASER. MikiMatrix is “bitext extraction for 1620 language pairs in WikiPedia”.
CCMatrix
I actually had trouble even downloading this dataset… and after I did manage to download it there was an issue with alignment… Instead of tyring to figure it out I actually ended up not downloading it after I saw the quality of the WikiMatrix.
WikiMatrix
So I’m aware that Wikipedia is an open-source online encyclopedia in which users contribute to each page and that the English and Korean pages aren’t going to be exact translations of each other. But there are several problems to the WikiMatrix Dataset. First, not all the sentences in the English file are English. I found Japanese, Arabic, Korean texts mixed in with the English. Same with vice versa. There are English sentences in the Korean dataset. This means that sometimes, there are duplciate sentences in each of the Eng/Ko datasets in the same language. Second, the aligned sentences sometimes have nothing to do with each other…
(1) "Dutch get a kick out of baseball, too". 2013년 9월 1일에 확인함. “Dutch get a kick out of baseball, too”.
(2) "20명 사망 女사우나 진입 시기 등 구조 골든타임 논란(종합)". “20명 사망 女사우나 진입 시기 등 구조 골든타임 논란(종합)”.
(3) Yet life is not the less worth living because of any of these, nor has any man truly lived until he has tasted of them all. 이러한 이상은 내세(來世)에서만 추구할 수 있다는 것이며, 현세에서>는 불가능하다.
The most common type of error was that of (1). Two English sentences with a small snippet of Korean. These are included as “parallel-data”. It is more difficult to find sentences that are in fact parallel…
My Thoughts
- So when I first read about LASER, I thought it was going to be this amazing tool that was fast and effective. What I found was very different. Not only are there errors in something as basic as importing a module, there are errors in downloading datasets and other implementations.
- Also, having used their algorithm and encoder, I find it difficult that they were able to achieve such great results just using this data…
- Is it okay to sacrifice quanlity for quantity? Is more always better when we’re talking about data for machine learning?