中文摘要 |
現階段的入口網站雖然提供整合新聞服務,但由於更新頻繁,且又是從各大新聞網站彙整而成,導致使用者資訊過載,不容易評斷新聞文件的重要性。新聞文件本身具有以下兩個特性:(一)描述新聞事件;(二)可能包含不同的時間點。因此本研究考慮到新聞文件的兩項特性:(一)文件產生時間;(二)不同事件的重要性。首先,我們使用潛在語意分析(latent semantic analysis, LSA),藉由維度約化篩檢文件中之雜訊,並將潛在的語意表現出來且可避免同義詞(synonymy)及一詞多義(polysemy)之問題。然後,我們使用基因演算法(genetic algorithm, GA),同時考慮搜尋空間中多個點,而非單一個點,因此可以較快地獲得整體區域最佳解。最後達到產生使用者新聞查詢詞語建議之目的。根據實驗結果,我們發現在LSA原始矩陣混合加入重要性特徵及時間性特徵後,其所產生的效能確實比LSA原始矩陣優良。然而,在加入GA後並無更佳之結果,原因是其在廣泛的空間採隨機式的搜尋,會找出一些較不相關的詞語。 |
英文摘要 |
Many of the portal sites provide integrated news content. However, the users suffer from the information overload problem since the news articles are updated frequently and summarized from different news sources. The news articles have the following two interesting properties: (a) it describes the news events; (b) it may contain different times of the news events. In this thesis, we consider the following two features in the news articles: (i) the generated times; (ii) the importance of different terms. We first use the Latent Semantic Analysis (LSA) to reduce the noise of news articles and present the latent semantic of terms and news articles to users in order to address the problems of synonymy and polysemy. We then use the Genetic Algorithm (GA) to find many possible solutions simultaneously in order to quickly find the local optimal solution. According to the results of experiments, we found that the performance of the LSA matrix with the features of times and importance is greater than the benefit from its original LSA matrix. However, GA did not outperform a better result since it uses a random search technique to guide the wide exploratory search that may result in the search process may lead to some unrelated terms. |