Sometimes the knowledge and skills deemed educationally important are difficult to pin down. Conceptual understanding, problem solving and critical thinking are widely valued, yet they are nebulous and difficult to specify. We might recognise conceptual understanding when we see it, but struggle to define it explicitly and comprehensively. This is a challenge for assessment, which traditionally relies on precise and detailed assessment criteria. In principle, precise and detailed criteria enable a shared understanding of what is being assessed. These criteria ensure marking is as reliable and objective as possible. But in practice the spirit of what is being assessed can be lost, and assessors vary in their interpretation of the criteria. This is unfair because a student’s final mark becomes partly dependent on the preferences and idiosyncrasies of the marker.
At the Mathematics Education Centre at Loughborough University we have been exploring an alternative approach to marking to help with the assessment of nebulous knowledge and skills. Comparative judgement (Pollitt 2012) involves no marking and no assessment criteria. Instead, an assessor is presented with two pieces of students’ work and asked which is better in terms of a construct such as conceptual understanding. Many such pairings are presented to several assessors and the outcomes are statistically analysed to construct a scaled rank order of students’ work from worst to best. The rank order can then be used for the assessment procedures such as grading.
Comparative judgement is based on the long standing principle that human beings are very reliable when comparing one object to another, but are often very unreliable at gauging a single object’s position on a scale (Laming 1984). For instance, it is likely that you can state whether the temperature of the room in which you are reading this article is higher or lower than that outside the building. However, you are probably unable to produce an accurate estimate of the temperature in degrees Celsius.
Recent technology developments mean the principle of comparative judgement is now viable for assessing nebulous learning outcomes. Student work can be delivered online to assessors for pairwise judging without the need for printing and sending round large numbers of hardcopies. Judgement decisions can be recorded in real-time, and used to select pairings of scripts thereby reducing the number of judgements required to construct a scaled rank order.
For example, consider a scenario in which there are 100 scripts. For every script to be compared once with every other script would require almost 5000 individual pairwise judgements. Fortunately, in practice this can be reduced to as few as 500 judgements using adaptive algorithms (Pollitt, 2012) or random sampling (Suzuki, Yasui and Ojima 2010). Even so, several times more judgements are always required than there are scripts for the statistical modelling to work.
A potential limitation of the comparative judgement approach is that it can be more resource intensive than marking. It requires a group of assessors rather than just one. This is fine for large scale contexts, such as national school tests, that traditionally need many examiners. However, it is a barrier in small scale contexts such as a university lecturer or school teacher marking his or her own students’ work.
The resource demands of comparative judgements can be addressed through peer assessment. Students can be set a task and then comparatively judge one another’s work. This satisfies the requirement for a group of assessors rather than an individual assessor. Returning to the above example, if 100 students are allocated 20 judgements then a total of 2000 judgements will be recorded. This satisfies the requirement for several times more judgements than there are scripts.
A further attraction of peer-based comparative judgement is that it might encourage higher-order learning. When undertaking comparative judgement, students need to gauge the relative quality of peers’ understanding and communication of the assessed construct. This can be particularly beneficial for nebulous constructs that lend themselves to open-ended, unstructured assessment tasks. Students, often used to structured test questions, can be unsettled when presented with an unstructured assessment. Viewing and judging how others attempted the same task can help build confidence and strategies for tackling future open-ended, unstructured tasks.
A recent example
Peer assessment needs to be reliable and valid to be of practical use. We recently evaluated school students’ performance when using comparative judgement to assess conceptual understanding. A mathematics teacher of 12-15 year olds provided us with 24 responses to the test question shown in the box. We wanted to find out whether students the same age could assess the responses using comparative judgement.
The responses to the question were anonymised and scanned, then uploaded to a secure comparative judgement website. Students aged 13 to 15 from three other schools then conducted comparative judgement on the responses. They were asked to decide, for each pairing, which student had the better understanding of fractions. Their decision were based on the ordering of the fractions and the quality of the written explanations. Once the judging was completed a scaled rank order of responses was constructed for each school. The three scaled rank orders were very similar and correlated strongly with one another, showing that the assessment was reliable.
We also recruited experts (teachers and mathematics education researchers) to assess the students’ answers using comparative judgement. This enabled us to compare the rank orders produced by the students with those of the experts. We found that the student and expert rank orders correlated strongly and the students’ collective interpretation of the better understanding of fractions was very similar to that of the experts.
In the coming months we will undertake a study to investigate the learning benefits of students undertaking peer assessment using comparative judgement. Students will sit a specially designed conceptual test similar to that outlined above. The students will then undertake pairwise judgements of their peers’ responses before sitting the test again. We will compare the gains made from pre- to post-test with reference to a control group who will not undertake the peer assessment exercise. We anticipate that the learning gains will be statistically significant, and that children’s conceptual development will be qualitatively evident in the content of the post-tests compared to the pre-tests.
We are also currently developing technological tools for implementing comparative judgement. Readers interested in using comparative judgement should contact Ian Jones at firstname.lastname@example.org
Laming, D. (1984). The relativity of “absolute” judgements. British Journal of Mathematical and Statistical Psychology, 37, 152–183.
Pollitt, A. (2012). The method of Adaptive Comparative Judgement. Assessment in Education: Principles, Policy & Practice, 19, 281–300.
Suzuki, T., Yasui, S., and Ojima, Y. (2010). Evaluating adaptive paired comparison experiments. In H. Lenz, P. Wilrich, & W. Schmid (Eds.), Frontiers in Statistical Quality Control 9 (pp. 341–350). Physica-Verlag: Germany.
University of Loughborough
If you enjoyed reading this article we invite you to join the Association for Learning Technology (ALT) as an individual member, and to encourage your own organisation to join ALT as an organisational or sponsoring member.