| 英文摘要 |
According to the Ministry of Health and Welfare’s announcement on June 16, 2025, of the 2024 leading causes of death among Taiwanese citizens, malignant neoplasms have ranked first among the top ten causes for consecutive years, with lung cancer remaining the leading cause of cancer mortality (19.4%). As lung cancer continues to be one of the deadliest cancers globally, the importance of disease prevention and early diagnosis has become increasingly prominent. However, lung cancer etiology involves multifaceted factors, including demographic characteristics, lifestyle behaviors, comorbidities, and environmental exposures, with no single indicator currently available to comprehensively and accurately describe its risk formation mechanisms. This study aims to investigate the associations between factors related to early-stage lung cancer patients and model predictive performance, while evaluating the application potential of various machine learning methods in early lung cancer risk assessment. Data were sourced from the cancer registry systems of Hsinchu Cathay General Hospital, Taipei Cathay General Hospital, and Sijhih Cathay General Hospital between 2010 and 2023, focusing on confirmed early-stage lung cancer cases. Analyzed variables encompassed demographic data and cancer-related clinical features. Five machine learning models were constructed and compared: Logistic Regression, Decision Tree, Random Forest, Extreme Gradient Boosting (XGBoost), and the hybrid model proposed in this study (Monotonic XGBoost). Model performance was evaluated using accuracy, recall, F1-score, ROC-AUC, and confusion matrices, with SHAP (SHAPley Additive exPlanations) employed to analyze feature relative importance and enhance model interpretability. Results indicated that, under the data framework and variable settings of this study, the Hybrid (Monotonic XGBoost) model exhibited the best overall predictive performance, achieving a ROC-AUC of 0.819, recall of 0.872, and a low number of false negatives (FN = 10), demonstrating potential clinical value in reducing missed diagnoses and aligning with the high-sensitivity requirements of clinical screening.In feature importance analysis, SHAP results revealed that the EGFR-ALK gene interaction term contributed substantially to model predictions in this dataset. This finding reflects how the model leverages available molecular-level information within existing clinical data structures to enhance classification performance, rather than implying a definitive or universal causal role of genetic factors in early lung cancer. Given that genetic testing is not uniformly performed in clinical practice and is influenced by testing timing and healthcare workflows, the gene variable importance presented here should be interpreted as data-driven predictive associations, not as direct bases for general population lung cancer risk assessment. In contrast, demographic variables such as age and education level demonstrated consistent auxiliary predictive effects across models, underscoring the role of non-molecular factors in risk stratification.In summary, the Hybrid (Monotonic XGBoost) model achieved an F1-score of 0.814, exhibiting robust overall classification performance and suitability as the optimal model for early lung cancer risk assessment. This study seeks to establish a high-sensitivity early warning model to support clinical decision-making and promote early diagnosis. |