英文摘要 |
To better understand and utilize lexical and syntactic mapping between various language expressions, it is often first necessary to perform sentence alignment on the provided data. Up until now, the character trigram overlapping ratio was considered to be the best similarity measure on the text simplification corpus. In this paper, we aim to show that a newer embedding-based similarity metric will be preferable to the traditional SOTA metric on the paragraph-paraphrased corpus. We report a series of experiments designed to compare different alignment search strategies as well as various embedding- and non-embedding-based sentence similarity metrics in the paraphrased sentence alignment task. Additionally, we explore the problem of aligning and extracting sentences with imposed restrictions, such as controlling sentence complexity. For evaluation, we use paragraph pairs sampled from the Webis-CPC-11 corpus containing paraphrased paragraphs. Our results indicate that modern embedding-based metrics such as those utilizing SentenceBERT or BERTScore significantly outperform the character trigram overlapping ratio in the sentence alignment task in the paragraph-paraphrased corpus. |