Abstract:
Large language models (LLMs) are rapidly reshaping higher education landscapes. However, their quantitative impact and the underlying human-AI collaborative mechanisms in high-stakes, high-complexity STEM assessments remain poorly understood, leaving a gap in AI-driven educational evaluation reform. This study conducted a series of complementary controlled experiments within the theoretical mechanics course at Tsinghua University, comprising an “AI Challenge" targeting elite undergraduates with competition-level difficulty, and an “AI-Assisted Final Exam Pilot" for students retaking the course under standard difficulty. Data reveals a significant synergistic gain: in the final exam pilot, the average score of the AI-assisted group reached 60.2, substantially outperforming both the control group without assistance (41.2) and the standalone AI baseline (39.0). This confirms that effective human-AI collaboration can transcend the capability boundaries of individual agents. Key analysis identifies strategic AI usage as the determinant of collaborative efficacy. Quantitative examination of interaction logs and problem-solving strategies shows that the variance in students' AI dependence across five standardized stages—defined as “AI Selectivity"—correlates significantly and positively with collaborative gains. High-performing students tended to apply AI selectively while maintaining overall cognitive independence, whereas indiscriminate AI usage often resulted in inefficient collaboration and suboptimal outcomes. Our findings suggest a critical shift in STEM education: from assessing routine algorithmic proficiency toward fostering independent critical judgment and AI supervisory capabilities. This work offers empirical evidence to inform educational paradigms in the era of Artificial Intelligence.