Dr. Maryam Raiyat Aliabadi is a postdoctoral teaching and research fellow in the Faculty of Computer Science. In this Q&A, she explains how she and her team are using GenAI to automatically grade exams for her cloud computing course. Responses have been edited for clarity.
This website features stories and content submitted by individuals within the UBC community. We value the diversity of perspectives and experiences that these stories bring to our platform. Any views or opinions expressed in these user-generated stories are solely those of the individual authors and do not necessarily reflect the views or opinions of UBC as a whole. The presence of user-generated content on our website does not imply endorsement or agreement by UBC with any particular perspective.
Q: Can you describe what led you to develop a Generative AI auto-grading project?
I began thinking about how to apply effective assessment methods in larger classes and how to scale these methods. Computer science courses utilize online assessment platforms like PrairieLearn, which facilitate autograding and randomization across diverse question formats, including multiple-choice, true/false, fill-in-the-blank, numerical input, code/programming, multiple answer, matching, and graphical input questions. This diversity enables a comprehensive range of assessments. However, to address specialized requirements like those in cloud computing courses, supplementary tools or tailored questions may be essential.
The course relies heavily on scenario-based questions where students must justify their choice of specific cloud services based on project requirements. This type of open-ended question doesn’t fit well with automated grading systems that expect predefined correct answers. For the midterm and final exams over two semesters, the accuracy of automated grading for fill-in-the-blank questions was under 40%. This meant that TAs and I had to manually grade a significant portion of the work. Additionally, the automated system didn’t support grading for short answers or open-ended questions, requiring a lot of manual effort.
Q: How did you do address this challenge and what early results did you and you team experience?
To address this, we explored using semantic interpretation for automated grading to better understand student responses. We decided to incorporate ChatGPT into the grading process through PrairieLearn. However, integrating an external grader into the existing pipeline was not straightforward. We spent a lot of time addressing these issues before integrating the ChatGPT API specifically into the grading system. Thankfully, we were able to resolve them.
We started by increasing the accuracy of fill-in-the-blank questions grading. Traditionally, instructors had to provide all correct variations of an answer in a server.py file in PrairieLearn. However, our new approach uses AI to semantically compare student responses with the correct answers. This change improved the grading accuracy for fill-in-the-blank questions from an average of 39% to 99%, which is a significant improvement. For the 90 students in my cloud computing courses, both midterm and final exams showed high accuracy with this new system.
The second aspect was the grading of short answer questions. Currently, the system cannot automatically grade these questions. We added this feature using ChatGPT. This new system carefully evaluates student responses, highlighting which parts are correct and which are not. This feedback helps students understand their mistakes and resubmit improved answers. Additionally, the system provides partial grading based on the correctness of different parts of the answer. We tested this new feature on the exam questions from the cloud computing courses, achieving 98% accuracy in grading short answer questions. This improvement is significant and enhances the grading system’s capabilities.
[…]
Another significant aspect is the constructive feedback provided by Generative AI. It helps students clearly understand which parts of their responses are correct and which parts require further consideration. Such feedback encourages critical thinking and enables students to improve their answers when resubmitting exams or assignments.
Q: What is the next step in the project?
Currently, we are working on grading long-answer scenario-based questions, which is the most challenging part of the project.
Traditionally, there isn’t a single correct version of the answer for these questions. Each student may approach it differently, and there may be multiple correct answers. To manually grade these questions, we use rubrics as guidelines. We then format these rubrics into specific rules for the auto-grader.
Using the Language Model, the system calculates the similarity score of each rubric item with the student’s response. Based on these scores and predefined weights, the system determines a partial grading logic to calculate the overall mark for the question.
This part is currently under implementation, and we expect to complete it by the end of the summer. For future work, we are considering adding more features to assess not only text-based responses but also assignments that include code, text, and visuals. Each type of response requires different capabilities from the AI model, so we need to explore how to integrate these capabilities into our system.
Q: What challenges do you see emerge from this method of grading with Generative AI?
It’s worth mentioning that generative AI can make mistakes. We’ve encountered cases of false positives and false negatives. False positives occur when the auto-grader marks an answer as correct, but manual grading indicates it’s incorrect. This discrepancy helps us identify areas for improvement in both the auto-grading system and manual grading processes.
Conversely, false negatives occur when the auto-grader marks a correct answer as incorrect. In such cases, human oversight is necessary to ensure fairness and accuracy in grading. These instances highlight how the integration of generative AI and manual grading can complement each other to enhance accuracy and fairness in assessment.
Compared to the extensive manual effort required by TAs and instructors to grade a large number of students, the few mistakes made by the auto-grader are a reasonable compromise. So, it’s worth considering the use of AI-based assessment tools.
Q: Do you think this system could be applied to other courses and disciplines?
So far, we have tested this enhanced grading system only on my cloud computing course exams. I’m now talking to other instructors in the Faculty of Computer Science and other departments to gather more test data and achieve more comprehensive results.
We are working to make our tool available as open-source to help save time and effort in assessment for the wider community. Each instructor may use Generative AI differently based on their course context and needs. While my experience centers on auto-grading, there are numerous other applications for Generative AI in teaching and learning. For example, it can be used to create personalized tutorials or generate practice exams based on lecture slides. The creativity of teachers in utilizing Generative AI to address their specific needs is key to its effective implementation.
Q: Is there anything else you’d like to add or mention regarding your project or any other aspect?
One crucial aspect that often sparks debate is the ethical considerations surrounding privacy. Ensuring the confidentiality of student information is paramount. In our project, we take great care to preserve student privacy by only anonymizing their responses for assessment. No personal information, such as names or IDs, is shared with Generative AI models. Additionally, while our current auto-grader version is private, if we decide to make it open-source, instructors must obtain student consent before using it for grading to address concerns about unintentional cheating. Ethical considerations are essential in projects like these, and we are mindful of privacy and intellectual property aspects.