Humans vs Machines in TOEFL Speaking Scoring

Modified on Mon, 17 Oct 2022 at 02:04 PM

The following article is a summary of "Complementary strengths? Evaluation of a hybrid human-machine scoring approach for a test of oral academic English".


When scoring TOEFL Speaking, how effective is the current human-machine assessment model? To what extent does a machine evaluate TOEFL Speaking performance phenomena the way a human does?

Teaching point

ETS uses human raters and SpeechRater to calculate TOEFL Speaking scores. While ETS does not disclose how human scores and SpeechRater scores are combined to generate TOEFL Speaking scores, we can have fun guessing. For example, it’s possible (maybe even likely) that human raters do more of the Topic Development scoring, while machines do more of the Delivery and Language Use scoring.

Why this matters

Many teachers use only the TOEFL Speaking rubrics to estimate holistic TOEFL Speaking scores for their students. When teachers do this, they ignore a chunk of important and knowable data - namely the dimension scores measured analytically by the ETS SpeechRater scoring engine. 

Article Information


Complementary strengths? Evaluation of a hybrid human-machine scoring approach for a test of oral academic English


Human raters and machine scoring systems potentially have complementary strengths in evaluating language ability; specifically, it has been suggested that automated systems might be used to make consistent measurements of specific linguistic phenomena, whilst humans evaluate more global aspects of performance. We report on an empirical study that explored the possibility of combining human and machine scores using responses from the speaking section of the TOEFL iBT® test. Human raters awarded scores for three sub-constructs: delivery, language use and topic development. The SpeechRaterSM automated scoring system produced scores for delivery and language use. Composite scores computed from three different combinations of human and automated analytic scores were equally or more reliable than human holistic scores, probably due to the inclusion of multiple observations in composite scores. However, composite scores calculated solely from human analytic scores showed the highest reliability and reliability steadily decreased as more machine scores replaced human scores.


Larry Davis and Spiros Papageorgiou 

Published online



Davis, L., & Papageorgiou, S. (2021). Complementary strengths? evaluation of a hybrid human-machine scoring approach for a test of oral academic English. Assessment in Education: Principles, Policy & Practice, 28(4), 437–455. 


Article Summary

Hybrid scoring

Theoretically, there are at least three ways to hybridize scoring: 

  1. a confirmatory hybrid approach (machine confirms human score)
  2. a parallel contributory approach (machine and human score the same response, scores are combined)
  3. a divergent contributory approach (machine scores and human score different sub-constructs, scores are combined) 

ETS scoring

ETS uses a parallel approach in TOEFL Speaking. Human raters score "holistically-by-task" and consider three main “sub-constructs”: 

  1. Delivery (fluency, pronunciation, rhythm, etc)
  2. Language Use (vocabulary, grammar, complexity of syntax, etc)
  3. Topic Development (how coherently the response is structured)

Parallel Approach

ETS uses a parallel hybrid model. Human raters evaluate each of the four speaking responses (tasks) “holistically-by-task” according to a 0-4 scoring system in the TOEFL Speaking rubrics. SpeechRater independently evaluates each of the four speaking tasks “analytically” according to information contained in the audio. "To the extent possible, machine scoring incorporates linguistic phenomena that reflect the same sub-constructs and...holistic scores from human and machine raters are combined to help ensure adequate construct coverage in the final scores."

Critique of the parallel approach

Automated scoring has flaws, including “construct under-representation”. Basically, computers are bad at measuring high-level language phenomena like persuasiveness or completeness of ideas (e.g. Topic Development).

The study

Under one divergent contributory approach, humans could evaluate Topic Development and SpeechRater could evaluate Delivery and Language Use.

Humans were asked to score a large dataset of pre-recorded TOEFL responses analytically - similar to SpeechRater. Instead of the traditional “holistic” score they would typically assign, they awarded scores based on specific sub-constructs.

The following 4 divergent contributory approach combinations were evaluated:

1. Delivery (machine) + Language Use (human) + Topic Development (human)

2. Delivery (human) + Language Use (machine) + Topic Development (human)

3. Delivery (machine) + Language Use (machine) + Topic Development (human)

4. Delivery (human) + Language Use (human) + Topic Development (human)


Human-machine composite scores were slightly more reliable than human holistic scores, although "the most reliable composite scores were constructed solely from human analytic scores."

Topic Development

Reaching agreement about the quality of Topic Development was difficult for human raters.  

Adding it up 

When the analytic scores were averaged into a composite score, hybrid combinations of human and machine outperformed human holistic scoring. A composite score formed of human analytic scores was more accurate than any hybrid one.

The takeaway 

Human-machine hybrid scoring was slightly more reliable than human holistic scores. This increase in reliability was very slight, however, and human raters were no less reliable in scoring Delivery and Language Use.

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select atleast one of the reasons

Feedback sent

We appreciate your effort and will try to fix the article