中文摘要 |
本研究比較類神經網路、邏輯斯迴歸及決策樹三種資料探勘演算法,使用不同診斷年份的樣本作模型訓練,對預測子宮頸癌五年存活情形的效能,並進行外部通用性(External Generalization)驗證。本研究採用美國國家癌症研究所(NCI: National cancer Institute)所提供的流行病學調查(SELR: the Surveillance, Epidemiology, and End Results)數據中的癌症登記資料庫(CIPUD, Cancer Incidence Public-use Database),從西元1973年至西元2000年間選取156,502筆資料記錄及72個變項,經過資料清理後,留下與預測子宮頸癌五年存活較相關的18個變項,與子宮頸癌診斷年份為1988-1996年的資料共2,022筆,依診斷年份將樣本,分成8組不同的模型訓練樣本與測試樣本,帶入類神經網路(artificial neural network)、決策樹(decision tree)以及邏輯斯迴歸(logistic regression)三種演算法造出模型,以AUC (area under the ROC curve)、準確率(accuracy),作為演算法預測能力評估,並找出可以得到良好預測結果的模型設計。結果顯示:內部驗證的模型預測力最好的為類神經網路的模型1,其AUC與準確率值分別為0.9392、0.9474。外部驗證的AUC結果,以類神經網路的模式7表現最好,其值分別為0.6455。在內部驗證(internal validation)的AUC與準確率結果表現,類神經網路與決策樹都較邏輯斯迴歸佳。在外部驗證(external validation)的AUC結果表現,類神經網路與邏輯斯迴歸都較決策樹好。類神經網路與邏輯斯迴歸建造的模型,有較好的外部通用性,而類神經網路與決策樹建造的模型,有較好的模型準確率。若想要得到較好的外部驗證結果,訓練樣本可以取過去的2-3年以上的資料。 |
英文摘要 |
The purpose of the study was to compare the performances of an artificial neural network (ANN), decision tree (C5), and logistic regression (LR) for predicting the 5-year survivability of cervical cancer and their external validation for generalization. The data was collected from SEER (Surveillance, Epidemiology, and End Results) of the NCI (National Cancer Institute) in the United States during the years 1973~2000. There were 156,502 cases with 72 variables. After the data was cleaned, there were 2,022 cases and 18 variables remaining during years 1988~1996. The dataset was divided into 8 categories of training sets and test sets, according to the year the patients were diagnosed. The 8 training sets were applied to three algorithms: 1) ANN, 2) C5, and 3) LR to build 8 models. The parameters of performance of the models were accuracy and AUC (Area under the ROC curve) for predicting 5-year survivability of cervical cancer patients. ANN had the best internal validation of the AUC and accuracy (AUC, 0.9392; accuracy, 0.9474) on model 1 and the best external validation of the AUC (0.6455) on model 7. ANN and C5 outperformed LR with respect to internal validation. ANN and LR both performed better than C5 in the external validation of the AUC. All in all, algorithms of ANN and LR performed better for external generalization, and algorithms of ANN and C5 performed more accurately for classification. |