Abstract
The use of artificial intelligence (AI) to score foreign language proficiency exams is expanding due to the need for scalability, faster turnaround, and cost efficiency—especially in large-scale, high-stakes contexts. AI-assisted scoring commonly includes automated marking of selected-response items, automated essay scoring (AES) for writing, and speech technologies (automatic speech recognition and pronunciation/prosody models) for speaking assessments. Despite operational benefits, AI scoring introduces critical challenges: construct underrepresentation, bias and fairness risks across accents and demographic groups, limited explainability, vulnerability to gaming, domain shift across prompts and test forms, and governance issues related to data privacy and accountability. This article synthesizes key problems associated with AI-based scoring and proposes a solutions framework centered on construct validity, human-in-the-loop moderation, rigorous psychometric calibration, continuous bias auditing, robust security controls, and transparent candidate-facing policies (including appeal mechanisms). The paper argues that AI can be used responsibly in language assessment only when it is embedded within a defensible assessment design that prioritizes validity, reliability, and equity.
References

This work is licensed under a Creative Commons Attribution 4.0 International License.
