suyeon

자연어처린이의 성장 기록장 📝

Writing
문장 기술

문장 기술

최근에 “국문 글을 어떻게 더 잘 쓸 수 있을까?”라는 고민을 하고 있던 참에 우연히 회사 러닝 포털에서 배상복 기자님의 글쓰기...

Python
Codility Lesson 6

Codility Lesson 6

Lesson 6: Sorting 1. Distinct Instructions Write a function def solution(A) that, given an array A consisting of N integers,...

Paper Review
English to Korean Multilingual Transfer Learning with Sentence-BERT

English to Korean Multilingual Transfer Learnin...

Abstract: This study focuses on constructing a Korean Sentence-BERT model in a novel method, using student-teacher knowledge distillation. The limitations...

The Last Year

The Last Year

Where I’ve Been 딱히 이 블로그를 다른 사람들 읽으라고 쓰지는 않지만 내 기록을 위해 근황을 좀 알리자면 대학원 졸업도 하고...

NLP Model
The Uncomfortable Truth About Facebook LASER...

The Uncomfortable Truth About Facebook LASER...

Facebook LASER Last year, Facebook released code for LASER, or Language-Agnostic SEntence Representations. As it states on their github, LASER...

Paper Review
Billion-scale similarity search with GPUs

Billion-scale similarity search with GPUs

Abstract: Similarity search finds application in specialized database systems handling complex data such as images or videos, which are typically...

NLP Model
Extractive Summarization in NLP: Training with BERT

Extractive Summarization in NLP: Training with ...

Summarization in NLP Among the many challenges faced by Natural Language Processing (NLP) researchers today, the summarization task is perhaps...

Python
Codility Lessons 2-5

Codility Lessons 2-5

Codility I’ve been putting my algorithm coding skills to the test through Codility. Here’s how I solved the problems in...

Guides
How to Use Streamlit

How to Use Streamlit

Streamlit Introduction Again, I was training a simple chatbot and I wanted to upload it online so that other users...

Guides
How to use Pytorch Lightning

How to use Pytorch Lightning

Pytorch Lightning Introduction I was training a simple GPT2 chatbot the other day and came across a code that utilized...

Guides
편리한 코딩을 위한 Python 알쓸신잡

편리한 코딩을 위한 Python 알쓸신잡

지난 번에 자연어처리에 필요한 python 함수와 팁들을 정리했었는데 다른 사람들은 보는지 안 보는지 모르겠지만 내가 계속 들오가서 보게 된다;;ㅋㅋㅋ 사실상...

Paper Review
Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

Margin-based Parallel Corpus Mining with Multil...

Abstract: Machine translation is highly sensitive to the size and quality of the training data, which has led to an...

Paper Review   NLP Model
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Sentence-BERT: Sentence Embeddings using Siames...

Abstract: BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new state-of-the-art performance on sentence-pair...

Algorithm
Mutual Information and Diverse Decoding Improve Neural Machine Translation

Mutual Information and Diverse Decoding Improve...

Abstract: Sequence-to-sequence neural translation models learn semantic and syntactic relations between sentence pairs by optimizing the likelihood of the target...

Paper Review
DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation

DIALOGPT : Large-Scale Generative Pre-training ...

Abstract: We present a large, tunable neural conversa- tional response generation model, D IALOGPT (dialogue generative pre-trained transformer). Trained on...

Paper Review   NLP Model
GPT-2 and GPT-3: Towards a More General Language Model

GPT-2 and GPT-3: Towards a More General Languag...

About GPT Text Generation에 크게 관심이 없었기 때문에 GPT에 대해서 그냥 Autoregressive한 모델이구나… generation에 쓰이구나… 정도만 알고 있었다. 근데 최근...

Paper Review
Beyond English-Centric Multilingual Machine Translation

Beyond English-Centric Multilingual Machine Tra...

Abstract: Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to...

Guides
HuggingFace 정복하기

HuggingFace 정복하기

최근 NLP 분야 공부 또는 연구하는 사람이라면 당연히 HuggingFace를 사용해봤거나 들어봤을 것이다. BERT부터 GPT까지 웬만한 NLP 분야 논문에 나온 모델을...

Paper Review   NLP Model
Attention is All You Need

Attention is All You Need

Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The...

Paper Review
Reformer: The Efficient Transformer

Reformer: The Efficient Transformer

Abstract: Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively...

Paper Review   Algorithm
LexRank: Graph-based Lexical Centrality as Salience in Text Summarization

LexRank: Graph-based Lexical Centrality as Sali...

Abstract: We introduce a stochastic graph-based method for computing relative importance of textual units for Natural Language Processing. We test...

Paper Review
MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer

MAD-X: An Adapter-Based Framework for Multi-Tas...

Abstract: The main goal behind state-of-the-art pre-trained multilingual models such as multilingual BERT and XLM-R is enabling and boot- strapping...

Guides
자연어처리를 위한 Python 알쓸신잡

자연어처리를 위한 Python 알쓸신잡

오늘은 자연어처리 작업을 할 때 알면 좋은 Python 기본 함수를 정리해보려고 한다. 실제로 내가 과제나 실험을 진행하면서 많이 사용하는 함수들...

Paper Review   NLP Model
Parameter-Efficient Transfer Learning for NLP

Parameter-Efficient Transfer Learning for NLP

Abstract: Fine-tuning large pre-trained models is an effective transfer mechanism in NLP. However, in the presence of many downstream tasks,...

Paper Review
Neural Machine Translation with Byte-Level Subwords

Neural Machine Translation with Byte-Level Subw...

Abstract: Almost all existing machine translation models are built on top of character-based vocabularies: characters, subwords or words. Rare characters...

Paper Review
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

SentencePiece: A simple and language independen...

Abstract: This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation....

Guides
Korean NLP Preprocessing Module | 한국어 전처리 모듈 만들기

Korean NLP Preprocessing Module | 한국어 전처리 모듈 만들기

전처리는 매번 느끼지만 정말 손이 많이 가는 작업이다. 하지만 굉장히 중요한 작업이다. 학습 데이터의 질에 따라 모델 성능도 천차만별이기 때문이다....

Guides
Korean-English Parallel Corpora | OPUS 말뭉치 다운로드 및 사용방법

Korean-English Parallel Corpora | OPUS 말뭉치 다운로드...

Parallel Corpus To train multilingual NLP models for cross-lingual tasks, you need parallel corpora. Usually a parallel corpus consists of...

Guides
한국어 코퍼스 리스트 및 전처리 준비

한국어 코퍼스 리스트 및 전처리 준비

한국어 코퍼스 Pre-train 모델의 성능은 데이터의 양질에 완전히 의존한다. 한국어는 영어만큼 많은 코퍼스가 존재하지 않지만, 점점 늘어나는 추세이다. 아래는 내가...

Paper Review   NLP Model
RoBERTa: A Robustly Optimized BERT Pretraining Approach

RoBERTa: A Robustly Optimized BERT Pretraining ...

Abstract: We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many...

Guides
Github Pages 블로그 만들기

Github Pages 블로그 만들기

Github과 별로 친하지 않은 사람으로써 Github Pages 블로그를 만들기로 결심한 이후 조금 애를 썼다. 혹시 나중에 또 블로그를 만들 때...

Paper Review
Cross-Lingual Alignment vs. Joint Training: A Comparative Study and a Simple Unified Framework

Cross-Lingual Alignment vs. Joint Training: A C...

Abstract: Learning multilingual representations of text has proven a successful method for many cross-lingual transfer learning tasks. There are two...

Paper Review
Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

Making Monolingual Sentence Embeddings Multilin...

Abstract: We present an easy and efficient method to extend existing sentence embedding models to new languages. This allows to...

Paper Review

Adversarial NLI: A New Benchmark for Natural La...

Abstract: We introduce a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure. We show that training...

Paper Review
Cross-Lingual Ability of Multilingual BERT: An Empirical Study

Cross-Lingual Ability of Multilingual BERT: An ...

Abstract: Recent work has exhibited the surprising cross-lingual abilities of multilingual BERT (M-BERT) – surprising since it is trained without...

Guides
Markdown Cheatsheet

Markdown Cheatsheet

Markdown is a way to style text on the web. You control the display of the document; formating words as...