Topline
A blind study led by Stanford Law School professor Julian Nyarko published Monday found AI-generated responses outperformed those written by fellow law professors in 75% of nearly 3,000 head-to-head comparisons—a result the authors themselves called surprising.
Palo Alto, CA, USA – Sept. 17, 2015: Stanford University Hoover Tower. Completed in 1941, the 50th year of Stanford University’s anniversary, the tower was inspired by the cathedral tower in Salamanca, Spain.
getty
Key Facts
When law professors were handed a stack of anonymized answers to student contract questions and asked to pick the better one, they reached for the AI’s response three times out of four.
Across 16 law schools, professors evaluated almost 3,000 anonymized matchups without knowing whether a given answer came from a machine or a colleague.
Professors flagged AI answers as pedagogically misleading or harmful just 3.5% of the time, against 12% for peer-written answers, meaning the human responses were more than three times as likely to be deemed potentially damaging to a student’s understanding.
Nyarko, who directs Stanford’s Legal Innovation through Frontier Technology Lab, said the group is “not advocating for wholesale adoption of AI tutors,” but that “our data suggests that blanket skepticism may be equally unwarranted.”
Why Was Contract Law Tested?
Contract law was chosen precisely because it resists the answer key. The 40 questions used in the study—the kind a student might raise after class or in office hours—demanded synthesis of competing arguments and a defensible conclusion rather than rote recall, testing whether a model could reason where there is no single right answer.
Key Background
The paper was authored by Nyarko with liftlab researcher Alejandro Salinas as first author, alongside colleagues from Yale, New York University, the University of Chicago and other institutions. Participants wrote their own answers before grading anyone else’s, evaluations ran blind through multiple scoring methods and the AI outputs were calibrated to match the length and structure of human responses. The team tested a range of systems, including commercial tutoring tools and Google’s NotebookLM, and found performance varied. Even where the models were hampered by limited context, evaluators frequently still favored them over their human peers. The findings land in the middle of an unresolved debate inside legal education, where some schools are racing to integrate AI while others warn of hallucinations, student overreliance and the slow erosion of the critical-thinking skills a legal education exists to build.
What To Watch For
The authors are emphatic that quality and deployment are separate questions, and they have only addressed the first. Nyarko said the conversation should now move from whether AI can produce accurate, high-quality legal answers to how it can best benefit students.
