Humans vs Machines in TOEFL Speaking Scoring

Modified on Mon, 17 Oct, 2022 at 2:04 PM

The following article is a summary of "Complementary strengths? Evaluation of a hybrid human-machine scoring approach for a test of oral academic English".

Question	When scoring TOEFL Speaking, how effective is the current human-machine assessment model? To what extent does a machine evaluate TOEFL Speaking performance phenomena the way a human does?
Teaching point	ETS uses human raters and SpeechRater to calculate TOEFL Speaking scores. While ETS does not disclose how human scores and SpeechRater scores are combined to generate TOEFL Speaking scores, we can have fun guessing. For example, it’s possible (maybe even likely) that human raters do more of the Topic Development scoring, while machines do more of the Delivery and Language Use scoring.
Why this matters	Many teachers use only the TOEFL Speaking rubrics to estimate holistic TOEFL Speaking scores for their students. When teachers do this, they ignore a chunk of important and knowable data - namely the dimension scores measured analytically by the ETS SpeechRater scoring engine.

Article Information

Title	Complementary strengths? Evaluation of a hybrid human-machine scoring approach for a test of oral academic English
Abstract	Human raters and machine scoring systems potentially have complementary strengths in evaluating language ability; specifically, it has been suggested that automated systems might be used to make consistent measurements of specific linguistic phenomena, whilst humans evaluate more global aspects of performance. We report on an empirical study that explored the possibility of combining human and machine scores using responses from the speaking section of the TOEFL iBT® test. Human raters awarded scores for three sub-constructs: delivery, language use and topic development. The SpeechRaterSM automated scoring system produced scores for delivery and language use. Composite scores computed from three different combinations of human and automated analytic scores were equally or more reliable than human holistic scores, probably due to the inclusion of multiple observations in composite scores. However, composite scores calculated solely from human analytic scores showed the highest reliability and reliability steadily decreased as more machine scores replaced human scores.
Authors	Larry Davis and Spiros Papageorgiou
Published online	09/21/2021
Citation	Davis, L., & Papageorgiou, S. (2021). Complementary strengths? evaluation of a hybrid human-machine scoring approach for a test of oral academic English. Assessment in Education: Principles, Policy & Practice, 28(4), 437–455. link: https://doi.org/10.1080/0969594x.2021.1979466

Article Summary

Hybrid scoring	Theoretically, there are at least three ways to hybridize scoring: a confirmatory hybrid approach (machine confirms human score) a parallel contributory approach (machine and human score the same response, scores are combined) a divergent contributory approach (machine scores and human score different sub-constructs, scores are combined)
ETS scoring	ETS uses a parallel approach in TOEFL Speaking. Human raters score "holistically-by-task" and consider three main “sub-constructs”: Delivery (fluency, pronunciation, rhythm, etc) Language Use (vocabulary, grammar, complexity of syntax, etc) Topic Development (how coherently the response is structured)
Parallel Approach	ETS uses a parallel hybrid model. Human raters evaluate each of the four speaking responses (tasks) “holistically-by-task” according to a 0-4 scoring system in the TOEFL Speaking rubrics. SpeechRater independently evaluates each of the four speaking tasks “analytically” according to information contained in the audio. "To the extent possible, machine scoring incorporates linguistic phenomena that reflect the same sub-constructs and...holistic scores from human and machine raters are combined to help ensure adequate construct coverage in the final scores."
Critique of the parallel approach	Automated scoring has flaws, including “construct under-representation”. Basically, computers are bad at measuring high-level language phenomena like persuasiveness or completeness of ideas (e.g. Topic Development).
The study	Under one divergent contributory approach, humans could evaluate Topic Development and SpeechRater could evaluate Delivery and Language Use. Humans were asked to score a large dataset of pre-recorded TOEFL responses analytically - similar to SpeechRater. Instead of the traditional “holistic” score they would typically assign, they awarded scores based on specific sub-constructs. The following 4 divergent contributory approach combinations were evaluated: 1. Delivery (machine) + Language Use (human) + Topic Development (human) 2. Delivery (human) + Language Use (machine) + Topic Development (human) 3. Delivery (machine) + Language Use (machine) + Topic Development (human) 4. Delivery (human) + Language Use (human) + Topic Development (human)
Reliability	Human-machine composite scores were slightly more reliable than human holistic scores, although "the most reliable composite scores were constructed solely from human analytic scores."
Topic Development	Reaching agreement about the quality of Topic Development was difficult for human raters.
Adding it up	When the analytic scores were averaged into a composite score, hybrid combinations of human and machine outperformed human holistic scoring. A composite score formed of human analytic scores was more accurate than any hybrid one.
The takeaway	Human-machine hybrid scoring was slightly more reliable than human holistic scores. This increase in reliability was very slight, however, and human raters were no less reliable in scoring Delivery and Language Use.