MeritTrac Blog: Inter-rater Reliability

Monday, June 25, 2007

Inter-rater Reliability

This Post is Authored by Sree Krishna (During migration the authors name could not be migrated)
Spoken English Test - one of the factors that influences the quality of this test are the raters. The quality of the evaluations is dependent on the consensus factor of each rater against the standard rater.

A note on Consensus Estimates:

Consensus estimates can be useful in diagnosing problems with the raters’ interpretations of how to apply the rating scale

Consensus estimates of inter-rater reliability are based on the assumption that reasonable raters should be able to come to exact agreement about how to apply the various levels of a scoring. If two raters come to exact agreement on how to use the rating scale to score, then the two raters may be said to share a common interpretation of the construct.

“If raters can be trained to the point where they agree on how to interpret a rating scale, then scores given by the two judges may be treated as equivalent”

There are different methods of computing Consensus estimates:

Percent-agreement: The most popular method for computing a consensus estimate of inter-rater reliability is through the use of the simple percent-agreement figure. Percent agreement is calculated by adding up the number of cases that received the same rating by both raters and dividing that number by the total number of cases rated by the two raters.

The statistic also has some distinct disadvantages, however. For example, if the behavior of interest has a low incidence of occurrence in the population, then it is possible to get artificially inflated percent-agreement figures simply because most of the values fall under one category of the rating scale (Hayes & Hatch, 1999). Another disadvantage to using the simple percent-agreement figure is that it is often time consuming and labor intensive to train judges to the point of exact agreement.

Cohen’s kappa statistic - designed to estimate the degree of consensus between two judges after correcting the percent-agreement figure for the amount of agreement that could be expected by chance alone. Kappa is a highly useful statistic when one is concerned that the percent-agreement statistic may be artificially inflated due to the fact that most observations fall into a single category.

The disadvantage to the kappa coefficient is that it is can be somewhat difficult to interpret. Uebersax (1987) has noted that one major problem is that values of kappa may differ depending upon the proportion of respondents falling into each category of a rating scale. Thus, kappa values for different items or from different studies cannot be meaningfully compared unless the base rates are identical.

MeritTrac Blog

Pages

Monday, June 25, 2007

Inter-rater Reliability

No comments:

Post a Comment

Hit Counter