Abstract: We present a large, tunable neural conversa- tional response generation model, D IALOGPT (dialogue generative pre-trained transformer). Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017, DialoGPT extends the Hugging Face PyTorch transformer to attain a performance close to human both in terms of automatic and human evaluation in single-turn dialogue settings. We show that conversational systems that leverage DialoGPT generate more relevant, contentful and context-consistent responses than strong baseline systems. The pre-trained model and training pipeline are publicly released to facilitate research into neural response generation and the development of more intelligent open-domain dialogue systems.

Authors: Zhang et al. 2019 (MICROSOFT)

–

Paper Summary

Introduction

GPT-2: demonstrated that transformer models trained on very large datasets can capture long-term depen- dencies in textual data and generate text that is flu- ent, lexically diverse, and rich in content.
DIALOGPT extends GPT-2 to address the challenges of conversational neural response generation
Challenges in modelling conversations: possibly competing goals of two participants, is intrinsically more diverse in the range of potential responses; poses a greater one-to-many problem than is typical in other text generation tasks such as neural machine translation, text summarization and paraphrasing.
Problems with open-domain neural response generation systems: content or style inconsistency (Li et al., 2016b; Zhang et al., 2019; Gao et al., 2019c), lack of long-term contextual in- formation (Serban et al., 2017), and blandness (Li et al., 2016a; Zhang et al., 2018; Qin et al., 2019).
DialoGPT: autoregressive (AR) language model that uses multi-layer transformer as model architecture
- Trained on large-scale dialogue pairs/sessions extracted from Reddit discussion chains
- Assumption: capture joint distribution of $P(Target, Source)$ in conversaional flow with finer granularity
- Evaluation: benchmark dataset (DSTC-7) & new reference test dataset extracted from Reddit

Dataset

The dataset is extracted from comment chains scraped from Reddit spanning from 2005 till 2017.
We filter the data by removing the instances where…
1. there is a URL in source or target,
2. where the target contains word repetitions of at least three words
3. where the response does not contain at least one of the top-50 most frequent English words (e.g., “the”, “of”, “a”), since this probably indicates it might not be an English sentence
4. where the response contains special markers such as “[” or “]”, as this could be markup language
5. where source and target sequences together are longer than 200 words
6. where the target contains offensive language, identified by phrase matching against a large blocklist.
We also excluded a large number of subreddits that had been identified as likely to contain offensive content.
In addition, we aggressively filtered out blandness, e.g., removing instances where the responses contained 90% of tri-grams that have been seen more than 1000 times.
After filtering, the dataset comprises 147,116,725 dialogue instances, in total 1.8 billion words.

Method

Model Architecture

based on GPT-2 architecture, which adopts the generic transformer language model (Vaswani et al., 2017) and leverages a stack of masked multi-head self-attention layers to train on massive web-text data.
GPT-2 is able to characterize human language data distributions at a fine grained level, presumably due to large model capacity and superior efficiency
GPT-2: 12-to-48 layer transformer with layer normalization, a initialization scheme that accounts for model depth that we modified, and byte pair encodings for the tokenizer
Use GPT-2 to model a multi-turn dialogue session as a long text and frame the generation task as language modeling
1. Concat all dialog turns within a dialog session into a long text $x_1, …, x_N$ (where $N$ is the sequence length) ended by end-of-text token
2. Denote source sentence (dialogue history) as $S = x_1, …, x_m$ and target sentence (ground truth response) as $T = x_{m_1}, …, x_N$
3. Conditional probability of $P(T \mid S)$ can be written as the product of a series of conditional probabilities: $p(T|S) = \Pi_{n=m_1}^{N} p(x_n \mid x_1, …, x_{n-1})$
4. Conditional probability of multi-turn dialogue session $T_1, …, T_K$ can be written as $p(T_i \mid T_1, …, T_{i-1})$

Mutual Information Maximaization

Open-domain text generation models are notorious for generating bland, uninformative samples. To address this problem, we implement a maximum mutual information (MMI) scoring function (Li et al., 2016a; Zhang et al., 2018)
MMI employs a pre-trained backward model to predict source sentences from given responses, i.e., $P(Source \mid target)$.
We first generate a set of hypotheses using top-K sampling. Then we use the probability of $P(Source \mid Hypothesis)$ to rerank all hypotheses.
Intuitively, maximizing backward model likelihood penalizes the bland hypotheses, as frequent and repetitive hypotheses can be associated with many possible queries, thus yielding a lower probability for any specific query.

Result

Experimental Details

Model Size: 117M, 345M, 762M
Vocab: 50,257
GPU: 16 Nvidia V100 with NVLink
Noam learning rate scheduler with 16000 warm-up steps
LR based on validation loss
Up to 5 epochs for small and medium model, 3 epochs for large model
Speeding up training: compress data into lazy-loading database file & leverage separate asynchronous data processes to scale training

DSTC-7 Dialogue Generation Challenge

The DSTC (Dialog System Technology Chal- lenges) 7 track (Galley et al., 2019) is an end-to-end conversational modeling task, in which the goal is to generate conversation responses that go beyond chitchat by injecting information that is grounded in external knowledge.
There is no specific or predefined goal; instead, it targets human-like interactions where the underlying goal is often ill-defined or unknown in advance, of the kind seen in work and other productive environments
The DSTC-7 test data contains conversation threads from Reddit data. We filtered conversation by response number, turn length etc and yielded a 5-reference test set of size 2208.
Evaluatoin metric: BLEU, METEOR, NIST, Entropy, Dis n

Results

Scores for DIALOGPT with 345M parameters are better across the board than with 117M parameters
Beam search (with beam width 10) dramatically improves BLEU and DIST scores, and marginally improves NIST and METEOR.
Presumably, the model learns background information during pre-training and is unhindered by the lack of a grounding document.
The automatic scores of DIALOGPT are higher than those for humans. This does not mean that the generation is more “realistic” than human, but is probably attributable to the one-to-many nature of conversation.

A New Reddit Multi-reference Dataset

We further evaluate DIALOGPT on a multi- reference test set with 6K examples.
We test our method on two settings: training from scratch and fine-tuning using GPT-2 as the pre-trained model.
Comparing training from scratch to fine-tuning from the pre-trained GPT-2 model, when applying to smaller model, using GPT-2 model gives larger performance gains.

Re-ranking the Response Using MMI

we generate 16 samples for each input source sentence by using top-K sampling (K = 10) using the 345M model fine-tuned from the GPT-2 medium model
This is followed by a re-ranking step using a backward model, which is also a 345M model fine-tuned from the GPT-2 medium model. The re- sponse that yields lowest backward model loss is evaluated.
MMI re-ranking produces more diverse responses with higher NIST, METEOR and Entropy and Dist scores, but with a slight drop in BLEU.

Generation Examples

Interestingly, our model exhibits the ability to address commonsense questions to some extent, presumably owing to the rich amount of information that can be learned from Reddit data.
In some cases, instead of giving the “desired” an- swer, the system generates an alternative, reasonable answer.

Human Evaluation

We evaluated 2000 randomly sampled test sources from the Reddit 6K test dataset using crowd-sourcing.
Systems were paired and each pair of system outputs was randomly presented to 3 judges, who ranked them for relevance, informativeness and how human like the generating is using a 3-point Likert-like scale.
A strong preference can be observed for DialoGPT over PersonalityChat.
Table 7 also suggests that the ”vanilla” DialoGPT medium model may already be close to human response quality.
Unexpectedly, we found that judges may prefer the MMI variant over human responses, probably because of many of the true human responses are erratic or idiosyncratic, or are tied to internet memes that happened to be unfamiliar to the judges.

My Thoughts

논문 읽으면서 생각한 것: re-ranking 모델 학습도 같은 세팅으로 했는지, 같은 데이터로 했는지? evaluation할 때 train data랑 겹치지 않는지 확인을 했는지? reddit 6K dataset을 공개했는지?
위 질문들에 대해서 좀 더 알아보고 포스팅 수정하겠음

DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation