Toward Enhancing Automated Credibility Assessment: A Model for Question Type Classification and a Tool for Linguistic Analysis
Assistant Professor, Rutgers University
The three objectives of this dissertation were to develop a question type model for predicting linguistic features of responses to interview questions, create a tool for linguistic analysis of documents, and use lexical bundle analysis to identify linguistic differences between fraudulent and non-fraudulent financial reports. First, The Moffitt Question Type Model (MQTM) was developed to aid in predicting linguistic features of responses to questions. It focuses on three context independent features of questions: tense (past vs. present vs. future), perspective (introspective vs. extrospective), and abstractness (concrete vs. conjectural). The MQTM was tested on responses to real-world pre-polygraph examination questions in which guilty (n = 27) and innocent (n = 20) interviewees were interviewed. The responses were grouped according to question type and the linguistic cues from each groups' transcripts were compared using independent samples t-tests with the following results: future tense questions elicited more future tense words than either past or present tense questions and present tense questions elicited more present tense words than past tense questions; introspective questions elicited more cognitive process words and affective words than extrospective questions; and conjectural questions elicited more auxiliary verbs, tentativeness words, and cognitive process words than concrete questions. Second, a tool for linguistic analysis of text documents, Structured Programming for Linguistic Cue Extraction (SPLICE), was developed to help researchers and software developers compute linguistic values for dictionarybased cues and cues that require natural language processing techniques. SPLICE implements a GUI interface for researchers and an API for developers. Finally, an analysis of 560 lexical bundles detected linguistic differences between 101 fraudulent and 101 non-fraudulent 10-K filings. Phrases such as "the fair value of," and "goodwill and other intangible assets" were used at a much higher rate in fraudulent 10-Ks. A principal component analysis reduced the number of variables to 88 orthogonal components which were used in a discriminant analysis that classified the documents with 71% accuracy. Findings in this dissertation suggest the MQTM could be used to predict features of interviewee responses in most contexts and that lexical bundle analysis is a viable tool for discriminating between fraudulent and non-fraudulent text.