A System of Deception and Fraud Detection Using Reliable Linguistic Cues Including Hedging, Disfluencies, and Repeated Prases

Sean Humpherys
Assistant Professor, West Texas A&M
Given the increasing problem of fraud, crime, and national security threats, assessing credibility is a recurring research topic in Information Systems and in other disciplines. Decision support systems can help. But the success of the system depends on reliable cues that can distinguish deceptive/truthful behavior and on a proven classification algorithm. This investigation aims to identify linguistic cues that distinguish deceivers from truthtellers; and it aims to demonstrate how the cues can successfully classify deception and truth. Three new datasets were gathered: 202 fraudulent and nonfraudulent financial disclosures (10-Ks), a laboratory experiment that asked twelve questions of participants who answered deceptively to some questions and truthfully to others (Cultural Interviews), and a mock crime experiment where some participants stole a ring from an office and where all participants were interviewed as to their guilt or innocence (Mock Crime). Transcribed participant responses were investigated for distinguishing cues and used for classification testing. Disfluencies (e.g., um, uh, repeated phrases, etc.), hedging words (e.g., perhaps, may, etc.), and interjections (e.g., okay, like, etc.) are theoretically developed as potential cues to deception. Past research provides conflicting evidence regarding disfluency use and deception. Some researchers opine that deception increases cognitive load, which lowers attentional resources, which increases speech errors, and thereby increases disfluency use (i.e., Cognitive-Load Disfluency theory). Other researchers argue against the causal link between disfluencies and speech errors, positing that disfluencies are controllable and that deceivers strategically avoid disfluencies to avoid appearing hesitant or untruthful (i.e., Suppression-Disfluency theory). A series of t -tests, repeated measures GLMs, and nested-model design regressions disconfirm the Suppression-Disfluency theory. Um, uh, and interjections are used at an increased rate by deceivers in spontaneous speech. Reverse order questioning did not increase disfluency use. Fraudulent 10-Ks have a higher mean count of hedging words. Statistical classifiers and machine learning algorithms are demonstrated on the three datasets. A feature reduction by backward Wald stepwise with logistic regression had the highest classification accuracies (69%-87%). Accuracies are compared to professional interviewers and to previously researched classification models. In many cases the new models demonstrated improvements. 10-Ks are classified with 69% overall accuracy.