Humans vs Machines in TOEFL Speaking Scoring
Modified on Mon, 17 Oct 2022 at 02:04 PM
When scoring TOEFL Speaking, how effective is the current human-machine assessment model? To what extent does a machine evaluate TOEFL Speaking performance phenomena the way a human does?
ETS uses human raters and SpeechRater to calculate TOEFL Speaking scores. While ETS does not disclose how human scores and SpeechRater scores are combined to generate TOEFL Speaking scores, we can have fun guessing. For example, it’s possible (maybe even likely) that human raters do more of the Topic Development scoring, while machines do more of the Delivery and Language Use scoring.
Why this matters
Many teachers use only the TOEFL Speaking rubrics to estimate holistic TOEFL Speaking scores for their students. When teachers do this, they ignore a chunk of important and knowable data - namely the dimension scores measured analytically by the ETS SpeechRater scoring engine.
Complementary strengths? Evaluation of a hybrid human-machine scoring approach for a test of oral academic English
Human raters and machine scoring systems potentially have complementary strengths in evaluating language ability; specifically, it has been suggested that automated systems might be used to make consistent measurements of specific linguistic phenomena, whilst humans evaluate more global aspects of performance. We report on an empirical study that explored the possibility of combining human and machine scores using responses from the speaking section of the TOEFL iBT® test. Human raters awarded scores for three sub-constructs: delivery, language use and topic development. The SpeechRaterSM automated scoring system produced scores for delivery and language use. Composite scores computed from three different combinations of human and automated analytic scores were equally or more reliable than human holistic scores, probably due to the inclusion of multiple observations in composite scores. However, composite scores calculated solely from human analytic scores showed the highest reliability and reliability steadily decreased as more machine scores replaced human scores.
Larry Davis and Spiros Papageorgiou
Davis, L., & Papageorgiou, S. (2021). Complementary strengths? evaluation of a hybrid human-machine scoring approach for a test of oral academic English. Assessment in Education: Principles, Policy & Practice, 28(4), 437–455.
Theoretically, there are at least three ways to hybridize scoring:
uman raters score "holistically-by-task" and consider three main “sub-constructs”:
ETS uses a parallel hybrid model. Human raters evaluate each of the four speaking responses (tasks) “holistically-by-task” according to a 0-4 scoring system in the TOEFL Speaking rubrics. SpeechRater independently evaluates each of the four speaking tasks “analytically” according to information contained in the audio. "To the extent possible, machine scoring incorporates linguistic phenomena that reflect the same sub-constructs and...holistic scores from human and machine raters are combined to help ensure adequate construct coverage in the final scores."
Critique of the parallel approach
Automated scoring has flaws, including “construct under-representation”. Basically, computers are bad at measuring high-level language phenomena like persuasiveness or completeness of ideas (e.g. Topic Development).
Under one divergent contributory approach, humans could evaluate Topic Development and SpeechRater could evaluate Delivery and Language Use.
Humans were asked to score a large dataset of pre-recorded TOEFL responses analytically - similar to SpeechRater. Instead of the traditional “holistic” score they would typically assign, they awarded scores based on specific sub-constructs.
The following 4 divergent contributory approach combinations were evaluated:
1. Delivery (machine) + Language Use (human) + Topic Development (human)
2. Delivery (human) + Language Use (machine) + Topic Development (human)
Human-machine composite scores were slightly more reliable than human holistic scores, although "the most reliable composite scores were constructed solely from human analytic scores."
Reaching agreement about the quality of Topic Development was difficult for human raters.
Adding it up
When the analytic scores were averaged into a composite score, hybrid combinations of human and machine outperformed human holistic scoring. A composite score formed of human analytic scores was more accurate than any hybrid one.
Human-machine hybrid scoring was slightly more reliable than human holistic scores. This increase in reliability was very slight, however, and human raters were no less reliable in scoring Delivery and Language Use.
Was this article helpful?
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
We appreciate your effort and will try to fix the article