中文摘要 |
由於網際網路的飛速發展,促成大資料時代的來臨,也因此自動摘要(Automatic Summarization)成為近年來一項熱門的研究議題。節錄式(Extractive)自動摘要 是依據事先定義的摘要比例,從文字文件(Text Documents)或語音文件(Spoken Documents)中選取一些能夠代表原始文件主旨或主題的重要語句當作摘要。節 錄式摘要可被視為一個資訊檢索(Information Retrieval, IR)的問題,在相關研究 中,使用語言模型(Language Modeling)來挑選重要語句之方法,已初步地被驗 證在文字與語音文件的自動摘要任務上有不錯的成果。本論文延續此項研究, 進一步地提出三個主要的研究貢獻。首先,有鑑於關聯性(Relevance)資訊的概 念在資訊檢索領域中已有不錯的發展成果,本論文嘗試結合關聯性資訊來重新 估測並建立語句的語言模型,並嘗試使用三混合(Tri-Mixture Model, TriMM) 模型,期待得以更精準地描述語句的語意內容,進而提升自動摘要之效能。第 二,除了語言模型之外,本論文進一步地嘗試探究機率式檢索模型於語音文件摘要任務上之成效。最後,本論文亦探討不同的語言模型平滑化技術對於語音文件摘要任務之影響。本論文的語音文件摘要實驗語料是採用公視廣播新聞(MATBN);實驗結果顯示,相較於其它現有的非監督式摘要方法,我們所應用的新穎式摘要方法能提供明顯的效能改善。 |
英文摘要 |
Due to the rapid-developed Internet and with the big data era coming, the automatic summarization research has been emerged a popular research topic. The aim of automatic summarization is in attempt to select important text or spoken sentence to represent the topic (theme) of original text or spoken document according to a predefined summarization ratio. In this study we frame automatic summarizaiton task as an ad-hoc information retrieval (IR) problem and employ the mathematical sound language modeling (LM) framework for extractive speech summarization, which can perform important sentence selection in an unsupervised manner and has shown its preliminary success. The main contribution of this paper is three-fold. First, by the virtue of relevance modeling, we explore several effective sentence modeling formulations to enhance the sentence models involved in the LM-based summarization framework and the first use of tri-mixture model to improve the performance of extractive speech summarization. Second, since the language modeling will suffer from data sparseness problem and the common solution is to adopt smoothing techniques, in this research we investigate three different smoothing approaches to evaluate how they influence the summarization performance. Third, we further apply the well-studied ranking model (BM25) and also its variants in IR community for ranking important sentence in extractive speech summarization. Experiments conducted on public avaiable dataset (MATBN) and the results show that our applied methods have effective summarization performance when compared to the other well-practiced and state-of-the-art unsupervised methods. |