英文摘要 |
To build a Retrieval-based dialog system, we can exploit conversation log to extract question-answer pairs. However, the question-answer pairs are hidden in the conversation log, interleaving each other. The conversation task that separates different sub-topics from the interspersed messages is called conversation disentanglement. In this paper, we examined the task of judging whether two Reddit messages belong to the same topic dialogue and found that the performance is worse if training and testing data are splitted by time. In practice, it is also a very hard task even for human beings as there are only two messages and no context. However, if our goal is to predict whether a message is a reply to the other, the problem becomes much easier to judge. By changing the way of data preparation, we are able to achieve better performance through DA-LSTM (Dual Attention LSTM) and BERT-based models in the newly defined Reply prediction task. |