Abstract: Almost all existing machine translation models are built on top of character-based vocabularies: characters, subwords or words. Rare characters from noisy text or character-rich languages such as Japanese and Chinese however can unnecessarily take up vocabulary slots and limit its compactness. Representing text at the level of bytes and using the 256 byte set as vocabulary is a potential solution to this issue. High computational cost has however prevented it from being widely deployed or used in practice. In this paper, we investigate byte-level subwords, specifically byte-level BPE (BBPE), which is compacter than character vocabulary and has no out-of-vocabulary tokens, but is more efficient than using pure bytes only is.
–
Paper Summary
Goal: examine byte-level “subwords” that are used to tokenize text into variable-length byte n-grams, as opposed to character-level subwords in which we represent text as a sequence of character n-grams.
Byte Level Text Representation
- UTF-8 Encoding: each Unicode character is 1 to 4 bytes; a sentence is a sequence of bytes, not characters; there are 138K Unicode characters covering over 150 languages but byte-level representation of all of them possible in UTF-8 bytes
- Byte sequence is longer than character sequence –> segment byte sequence into variable length n-grams (byte-level “subwords”) or BBPE
- BBPE symbols can be partial charafters shared by different characters of the combination of complete and partial characters –> necessitates large context surrounding each symbol for disambiguation and learning character boundaries
- Contextualize BBPE embeddings using depth-wise convolutional layer or bidirectional recurrent layer with gated recurrent units
Decoding with Byte-Level Subwords
- While any sentence can be represented as a byte sequence, the converse is, however, not necessarily true in that there are byte sequences that do not translate to valid character sequences.
- But such cases are rare
- Common error is redundant repeating bytes –> we try to recover as many Unicode characters as possible from this error pattern efficiently in linear time.
Experimental Settings
- Datasets: En-De, Ja-En, Si-En, X-En
- Transformers using Fairseq
Results and Analysis (BPE vs. BBPE)
- Symbol Frequency Distribution: BBPE vocabulary has the flexibility of decomposing rare characters into byte n-grams from the vocabulary instead of including them directly–freeing vocabulary slots for other frequent symbols
- BBPE symbols are more evenly distributed than BPE ones
- By setting different BBPE vocabulary sizes, we can control the level of rare character decomposition and symbol sharing across different characters
- Cross-Lingual Sharing: in multilingual settings, symbol sharing also happens across different languages despite the different writing systems –> allows maximizing parameter sharing not only for the model part but also the vo- cabulary part in a universal model
- Sequence Length: BBPE results in longer tokenized sequnces and loger training time bc of shorter byte-level lengths
- But BBPE is optimized for for compression-based objective (same as BPE) and is still more efficient than character vocabulary
- Contextualization: We compare three different ways of contextualizing token embeddings; none, 1-layer convolution and 1-layer bi-GRU, on X-En with T-base model
- Figure 4: all kinds of vocab benefit from contextualization
- Performance gains are more significant on fine- grained vocabularies: byte, character and BBPE
- For BBPE, long-range contextual information from Bi-GRU brings over 4% gain on validation BLEU in all the cases.
- Encoding context in the token embeddings reduces the difficulties of learning attentions on multiple source tokens and makes model training easier.
- Noisy Character Sets: BPE includes all chars, even rare ones while BBPE with small vocab setting (2K and 4K) excludes rare characters. Performance is comparable to BPE 32K baseline with smaller model capacity
- Character-rich Languages: Languages using logographic writing systems, such as Chi- nese and Japanese, can have over 50K characters, while only a small portion of them are frequently used. BBPE 4K on such languages comparable to and outperforms BPW when using transformer-large model
- Many-to-En Translation: eval on 58 languages parallel to English and 10.8K characters from different writing systems, between which characters are not necessarily shared
- Characters do share byte n-grams
- BBPE 2K and 4K both beat BPE baseline on overall BLEU
- Byte model and character model perform significantly better than all BPE and BBPE models in this multilingual setting. May be because BBPE and BPE suffer from imbalanced source and target sentence lengths as well as various token granularities in multilingual parallel sen- tences
- Nonetheless, BBPE is still the most practical solution since it makes a good balance between performance (better BLEU than BPE) and speed (much shorter tokenized sentences than characters and bytes).
- Transfer Learning: Because BBPE contains all UTF-8 bytes and has no out- of-vocabulary tokens, BBPE-based models can be trans- ferred between languages with non-overlapping character sets.
- the transferred model gains 0.9-1.8 BLEU points compared to the baselines, suggesting the generality of pretrained BBPE embeddings and its ability to adapt to different languages with unseen characters.
My Thoughts
- 어제 SentencePiece에 이어서 조금 더 최근 논문인 BBPE 논문을 읽었다. 모델에 따라 나오는 토크나이저를 항상 사용하지만, 왜 특정 토크나이저를 사용하고 어떻게 작용하는지 궁금해서 WordPiece에 진화 버전 SentencePiece 논문을 읽고 BPE 진화버전 BBPE 논문을 읽어봤다.
- 우선 이 논문을 읽으면서 SentencePiece의 디코더가 왜 인코더의 inverse라고 설명을 했는지 이해가 되었다. Byte-level representation은 종종 바로 디코딩이 인코딩과 같은 방법으로는 되지 않기 때문에 (invalid output/redundant byte error) 디코딩을 다른 알고리즘을 사용해서 한다. 텍스트는 기본적으로 sequence가 중요한 데이터이기 때문에 dynamic programming을 사용해서 에러 패턴을 프로세싱한다. 논문을 역시 많이 읽어야 다른 논문에서도 무슨 소리를 하는건지 알 수 있다…
- 어떻게 보면 SentencePiece보다도 더 low-level representation인 BBPE. 약간 인간이 이해할 수 없을 정도로 쪼개는 거다. 점점 representation이 추상화되는 느낌? byte subword를 사용하기 때문에 character vocab보다는 사이즈가 작고 byte를 사용하는 것 보다 효율적이다. 사실 뜯어보면 byte subword는 사람들한테 별 의미는 없을 것 같다. 컴퓨터에게 더 맞춰진 토크나이징 방법 같은? NLP의 궁극적은 목표는 컴퓨터가 언젠간 자연어를 이해하는 것인데, 인간이 이해하듯이 컴퓨터가 자연어를 이해할 것이라고 생각하면 안 되는 거 같다. 어느정도 컴퓨터처럼(?) 생각해야 되는 거 같다.
- 논문 읽기 전에 저자가 다 뭔가 아시안 계열 학자들 같아서 뭐지? 했는데 동아시아 언어, 특히 중국어와 일본어 텍스트 프로세싱을 영어와 같이 할 수 있도록 신경을 많이 쓴 거 같았다. Character-rich 한 다른 언어들에도 적용할 수 있겠지만 어느정도 이런… 파워? global influence? scholarly influence!?가 있는 나라에 대한 연구가 초점이 맞춰진다는 것을 또 한번 느꼈다. 다행히 한국어 자연어처리 연구도 중국어와 일본어 NLP가 발전할 수록 수혜를 받는다. ㅎㅎ