Scoring TOEFL® Speaking responses - Overview

Modified on Sat, 20 May 2023 at 02:28 PM

Summary (TL;DR)

  • My Speaking Score provides insights into TOEFL Speaking performance using SpeechRater data.
  • SpeechRater data includes norm-referenced measures (e.g., Speaking Rate, Discourse Coherence) and criterion-referenced measures (overall SpeechRater score).
  • Norm-referenced measures compare performance to other test takers, while criterion-referenced measures assess performance based on predetermined criteria.
  • ETS applies calculations and algorithms to ensure fairness and accuracy in the assessment process.
  • The TOEFL Speaking section also involves human scoring, providing a valuable human perspective on speech evaluation.

My Speaking Score & SpeechRater

My Speaking Score uses SpeechRater data to give you insights into your TOEFL Speaking performance. SpeechRater data is a rich source of information that can help you make decisions about how you are preparing for your TOEFL exam.

My Speaking Score provides you with two types of TOEFL Speaking performance data: your SpeechRater score, on four, and twelve individual speech "dimensions".

Norm-referenced Scoring

The twelve dimensions on a My Speaking Score SpeechRater report are norm-referenced measures. They include percentile scores for aspects of your speech such as Speaking Rate, Discourse Coherence, and Vocabulary Diversity. 

My Speaking Score groups this data by construct: Delivery (Fluency), Language Use (Vocabulary and Grammar), and Topic Development (Discourse Coherence). 

See a sample report here.

In this norm-referenced twelve-dimension dataset, your results compare your performance to a reference group - a large number of test takers who have already taken the test. In other words, these dimension measures provide information about how your performance compares to that of other test takers. For example, Speaking Rate may indicate whether your speaking pace is faster or slower compared to the average rate of other test takers. 


Criterion-referenced Scoring

Criterion-referenced measures, on the other hand, assess an individual's performance against specific predetermined criteria. The overall score given by SpeechRater, ranging from 0 to 4, is a criterion-referenced measure. It indicates the proficiency level achieved by the test taker based on the predefined scoring criteria established by ETS and expressed by SpeechRater for the TOEFL Speaking section.

By incorporating both norm-referenced and criterion-referenced measures, My Speaking Score provides users with a comprehensive understanding of their performance in relation to other test takers as well as the established criteria for TOEFL Speaking scoring. This combination of data helps users gauge their strengths and areas for improvement, enabling them to enhance their overall speaking abilities.

SpeechRater: Limitations

SpeechRater employs a range of calculations and algorithms to evaluate speech performance across different dimensions, including speaking rate, pronunciation fluency, vocabulary usage, and more. While specific details about the calculations and factors considered are not publicly disclosed, ETS, the provider of SpeechRater, ensures fairness, consistency, and accuracy in the assessment process.

It is important to understand that the overall SpeechRater score is influenced by a combination of factors that are known only to ETS. While individual dimension scores provide insights into specific aspects of speech, the overall score takes into account additional calculations to provide a comprehensive assessment.

Occasionally, outliers may occur in the scoring process, causing the overall score to deviate from the percentile scores in individual dimensions. These outliers do not necessarily indicate a problem with your speech or performance but are a result of ETS's specific scoring methodology and the factors they consider.

ETS is a reputable organization with extensive expertise in assessment systems, and SpeechRater is designed to provide objective and reliable feedback. While the specific details of the calculations remain undisclosed, the ultimate aim is to deliver accurate and helpful assessments to users.

It is crucial to remember that the SpeechRater score is just one measure of speech performance. 

Considering feedback from qualified teachers, consistent practice, and continuous improvement are equally important. 

Also, gathering more data over time will help assess the consistency of your performance and progress.

Human Scoring

In addition to the automated scoring provided by SpeechRater, the TOEFL Speaking section also involves human scoring. After test takers complete their responses, they are evaluated by trained human raters who follow the official scoring rubrics provided by ETS.

Human raters assess various aspects of the responses, including delivery, language use, and topic development, just like the automated scoring system. They provide a separate score based on their evaluation of the overall quality of the response in terms of these criteria.

The human scoring component adds an important layer of evaluation to ensure a fair and comprehensive assessment of test takers' speaking abilities. It provides a valuable perspective that takes into account the nuanced aspects of spoken language and offers a human judgment that complements the automated scoring system.

By combining both automated and human scoring, the TOEFL Speaking section aims to provide a reliable and accurate assessment of test takers' speaking proficiency.

TOEFL Speaking Scoring Rubrics

Download TOEFL Speaking rubrics

Copyright © 2019 by Educational Testing Service. All rights reserved.

Each TOEFL iBT Speaking response is scored holistically on a 4-band scale, with 4 being the highest score and 1 the lowest. A score of 0 is assigned to responses in which the speaker is unwilling or unable to provide a response to the question. The score on a speaking task represents an overall judgment of how well the response communicates the intended message. Raters evaluate the overall quality of the language and discourse features of the responses.

2 Scoring Rubrics are used to guide raters in evaluating the responses. The Independent Speaking Rubric is used to evaluate responses to the independent task. The Integrated Scoring Rubric is used to evaluate responses to the 3 integrated tasks.

Key Features of the Scoring Rubrics

Both scoring rubrics, the Independent Scoring Rubric and the Integrated Scoring Rubric, define the key characteristics of responses in terms of 3 important dimensions: delivery, language use, and topic development. When raters evaluate responses, they consider all 3 dimensions equally. No one dimension is weighted more heavily than another.


The 2 key features that characterize delivery are clarity of speech and pace. Pronunciation, stress, and intonation most often determine the clarity of speech in a response. The rate of speech, length of utterances, and the degree of hesitancy or choppiness all factor into the pace of a response. For example, a level 4 delivery is characterized by speech that is generally clear and uses stress and intonation patterns effectively. Mostly fluid speech is sustained for the required time. There may be some minor problems but they do not cause difficulty for the listener. There is a certain ease of presentation in responses that show level 4 delivery. At the very lowest band level for delivery, speakers generally have problems maintaining the flow of speech for the time allotted. The speech in low level responses tends to be fragmented and contains long pauses or it may be mostly unintelligible due to serious pronunciation difficulties.

Language Use 

The most salient features pertaining to language use at the highest band level are the efficiency of word choice and grammatical structures to convey meaning and the automaticity with which they are created. What stands out in a response that shows level 4 language use are a control of a wide range of vocabulary and a comfort with a variety of structures. Moving down the scale, word choice becomes less efficient and more vague. It takes more words to convey the same information. At the lower band levels, a smaller range of vocabulary and structures is evident. Responses at the lower band levels for language use are therefore marked by frequent repetitions and greater difficulty expressing meaning clearly.

Topic Development 

The topic development demands of the Independent tasks differ to some extent from those of the integrated tasks.

For the independent task, in which test takers speak about their own experiences, preferences, or opinions, topic development is characterized by the fullness of the content provided in the response and its overall coherence. At the highest level, responses address the tasks by clearly communicating a point of view and providing well-supported reasons or explanations with some elaboration. Responses do not need to be tightly organized, with a clear beginning, middle, and end; but in responses at the highest band level, ideas progress smoothly and cohesively, making the response easy for the listener to follow. Moving down the scale, fewer supporting details and less elaboration are evident. Ideas are not as well connected and their progression not as cohesive. At the lowest band level, very little relevant content is expressed; ideas are very general and vague.

In integrated tasks, test takers speak about information that they have listened to and/or read. For these tasks topic development is characterized by the accuracy and completeness of the content provided in the response as well as its overall coherence. At the highest level, responses address the tasks by presenting relevant information from the reading and listening material and organizing it in a way that makes it easy for the listener to follow the progression of ideas. Responses in the highest band provide the major information requested as well as some supporting detail or elaboration on the topic. Responses at this level may contain minor inaccuracies or omissions, as long as they do not impede the overall coherence and completeness. Moving down the scale, fewer supporting details and less elaboration are evident, generally with more omissions and inaccuracies. The progression of ideas becomes less fluid and the connection between pieces of information becomes less clear. The amount of content decreases and the response may become somewhat repetitive. At the lowest band level, very little relevant content is expressed; ideas presented are very general or inaccurate.

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select atleast one of the reasons

Feedback sent

We appreciate your effort and will try to fix the article