| 英文摘要 |
Closed- and open-vocabulary analysis are two approaches to computerized text analysis. In closed-vocabulary analysis, the frequencies of words in a text that have been collected in a psycholinguistic dictionary are counted. Psychologists have indicated that the frequencies of words in such a dictionary reflect mental status. In open-vocabulary analysis, machine learning or artificial intelligence (AI) are used to extract textual features. AI language models can generate semantic representation vectors for words. According to previous studies, the features extracted in open-vocabulary analysis can accurately predict the authors’sex and age, among other attributes. However, this finding requires reexamination because flaws in the studies may have been overlooked. First, these demographic variables may have been marked by the cohorts of specific words, which may not have been collected in the dictionary, thereby favoring open-vocabulary analysis. Second, none of these studies involved an AI language model. AI language models are developed for general natural language processing and are therefore suitable for vocabulary analysis. Third, the performance of closed- and open-vocabulary analysis has been evaluated using linear regression, with extracted features included as predictors and demographic variables included as dependent variables. However, no linear relationship has been identified between linguistic features and dependent variables. In this study, we compared closed-and open-vocabulary analysis. For closed-vocabulary analysis, three dictionaries were tested, and for open-vocabulary analysis, an AI language model called BERT was tested. In the closed-vocabulary analysis, words in dictionaries were used as linguistic features, whereas in the open-vocabulary analysis, tokenized representation vectors for words in a text were used as linguistic features. These linguistic features were fed into a three-layered neural network to identify linear and nonlinear relationships between the independent (input) and dependent (output) variables. The output nodes of this model were subsequently corrected to understand the sentiment of the text (i.e., positive or negative). According to the results, BERT outperformed the three dictionaries in terms of predicting the sentiment of the texts, a finding consistent with those of previous studies. This high performance was presumably because the linguistic features generated by BERT represented both the meanings of the words and their relationships with neighboring words (i.e., context). |