Codility Lesson 6
Lesson 6: Sorting 1. Distinct Instructions Write a function def solution(A) that, given an array A consisting of N integers,...
English to Korean Multilingual Transfer Learnin...
Abstract: This study focuses on constructing a Korean Sentence-BERT model in a novel method, using student-teacher knowledge distillation. The limitations...
The Uncomfortable Truth About Facebook LASER...
Facebook LASER Last year, Facebook released code for LASER, or Language-Agnostic SEntence Representations. As it states on their github, LASER...
Billion-scale similarity search with GPUs
Abstract: Similarity search finds application in specialized database systems handling complex data such as images or videos, which are typically...
Extractive Summarization in NLP: Training with ...
Summarization in NLP Among the many challenges faced by Natural Language Processing (NLP) researchers today, the summarization task is perhaps...
Codility Lessons 2-5
Codility I’ve been putting my algorithm coding skills to the test through Codility. Here’s how I solved the problems in...
How to Use Streamlit
Streamlit Introduction Again, I was training a simple chatbot and I wanted to upload it online so that other users...
How to use Pytorch Lightning
Pytorch Lightning Introduction I was training a simple GPT2 chatbot the other day and came across a code that utilized...
편리한 코딩을 위한 Python 알쓸신잡
지난 번에 자연어처리에 필요한 python 함수와 팁들을 정리했었는데 다른 사람들은 보는지 안 보는지 모르겠지만 내가 계속 들오가서 보게 된다;;ㅋㅋㅋ 사실상...
Margin-based Parallel Corpus Mining with Multil...
Abstract: Machine translation is highly sensitive to the size and quality of the training data, which has led to an...
Sentence-BERT: Sentence Embeddings using Siames...
Abstract: BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new state-of-the-art performance on sentence-pair...
Mutual Information and Diverse Decoding Improve...
Abstract: Sequence-to-sequence neural translation models learn semantic and syntactic relations between sentence pairs by optimizing the likelihood of the target...
DIALOGPT : Large-Scale Generative Pre-training ...
Abstract: We present a large, tunable neural conversa- tional response generation model, D IALOGPT (dialogue generative pre-trained transformer). Trained on...
GPT-2 and GPT-3: Towards a More General Languag...
About GPT Text Generation에 크게 관심이 없었기 때문에 GPT에 대해서 그냥 Autoregressive한 모델이구나… generation에 쓰이구나… 정도만 알고 있었다. 근데 최근...
Beyond English-Centric Multilingual Machine Tra...
Abstract: Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to...
HuggingFace 정복하기
최근 NLP 분야 공부 또는 연구하는 사람이라면 당연히 HuggingFace를 사용해봤거나 들어봤을 것이다. BERT부터 GPT까지 웬만한 NLP 분야 논문에 나온 모델을...
Attention is All You Need
Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The...
Reformer: The Efficient Transformer
Abstract: Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively...
LexRank: Graph-based Lexical Centrality as Sali...
Abstract: We introduce a stochastic graph-based method for computing relative importance of textual units for Natural Language Processing. We test...
MAD-X: An Adapter-Based Framework for Multi-Tas...
Abstract: The main goal behind state-of-the-art pre-trained multilingual models such as multilingual BERT and XLM-R is enabling and boot- strapping...
자연어처리를 위한 Python 알쓸신잡
오늘은 자연어처리 작업을 할 때 알면 좋은 Python 기본 함수를 정리해보려고 한다. 실제로 내가 과제나 실험을 진행하면서 많이 사용하는 함수들...
Parameter-Efficient Transfer Learning for NLP
Abstract: Fine-tuning large pre-trained models is an effective transfer mechanism in NLP. However, in the presence of many downstream tasks,...
Neural Machine Translation with Byte-Level Subw...
Abstract: Almost all existing machine translation models are built on top of character-based vocabularies: characters, subwords or words. Rare characters...
SentencePiece: A simple and language independen...
Abstract: This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation....
Korean NLP Preprocessing Module | 한국어 전처리 모듈 만들기
전처리는 매번 느끼지만 정말 손이 많이 가는 작업이다. 하지만 굉장히 중요한 작업이다. 학습 데이터의 질에 따라 모델 성능도 천차만별이기 때문이다....
Korean-English Parallel Corpora | OPUS 말뭉치 다운로드...
Parallel Corpus To train multilingual NLP models for cross-lingual tasks, you need parallel corpora. Usually a parallel corpus consists of...
한국어 코퍼스 리스트 및 전처리 준비
한국어 코퍼스 Pre-train 모델의 성능은 데이터의 양질에 완전히 의존한다. 한국어는 영어만큼 많은 코퍼스가 존재하지 않지만, 점점 늘어나는 추세이다. 아래는 내가...
RoBERTa: A Robustly Optimized BERT Pretraining ...
Abstract: We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many...
Github Pages 블로그 만들기
Github과 별로 친하지 않은 사람으로써 Github Pages 블로그를 만들기로 결심한 이후 조금 애를 썼다. 혹시 나중에 또 블로그를 만들 때...
Cross-Lingual Alignment vs. Joint Training: A C...
Abstract: Learning multilingual representations of text has proven a successful method for many cross-lingual transfer learning tasks. There are two...
Making Monolingual Sentence Embeddings Multilin...
Abstract: We present an easy and efficient method to extend existing sentence embedding models to new languages. This allows to...
Adversarial NLI: A New Benchmark for Natural La...
Abstract: We introduce a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure. We show that training...