Exam-Grading Assistance
Making meaningful assessment possible at scale
Last updated February 28, 2025.
When assessing learning in many mathematical and scientific disciplines, the way how to arrive at a solution is important: derivations, proofs, sketches, etc.
We are exploring assisting the grading of open-ended, handwritten exams through multimodal and reasoning Large Language Models. We are using Bayesian statistics to establish the confidence level in these automated judgements.
The graph below shows the correspondence between points given by Ai and those given by teaching assistants. Setting thresholds for confidence levels, half of a large-scale thermodynamics exam could be graded with a coefficient of determination of R2=0.92; on the average, the AI system gave 3.4 out of 60 points more than the teaching assistants while exhibiting an almost perfect linear regression coefficient of 1.04.

We are now investigating additional exams in physics, mathematics, and chemistry.