AI takes JEE (advanced) test, does well... but not enough for IIT seat
The findings come at a time when AI-based tools are showing promise in cracking tests, raising questions about what it holds for the future of work
An artificial intelligence (AI) module, based on the model underpinning ChatGPT, scored well enough to be in the 80-90 percentile of India’s one of the toughest engineering college admission tests but did not fare well enough to clinch scores that would secure a seat in the premier Indian Institute of Technology (IIT) colleges, an experiment by IIT-Delhi researchers has claimed.
The findings come at a time when AI-based tools are showing promise in cracking tests, such as the American graduate eligibility test SAT, and the quantitative Graduate Record Examination (GRE), raising questions about what it holds for the future of work, once powerful tools such as these become widely adopted.
Also read: Fact check: AI doctors on social media spreading fake claims
The experiments with the joint entrance exam (JEE) advanced offer new insights about what such systems, called large language models (LLMs), can achieve, and struggle with.
“GPT-4 got around 35 percent of the questions right, which would put it somewhere in the top 80 to 90 percentile of students. It would have to be in the top 90-100 percentile in order to get into an IIT... It’s almost there,” said Daman Arora, one of the researchers, who was pursuing MTech at IIT-Delhi’s computer science department along with his colleague Himanshu Gaurav Singh when they collaborated on the project. Professor Mausam, a computer science professor at IIT-Delhi and the founding head of IIT’s Yardi School of artificial intelligence, oversaw this research.
To give the AI model the challenge, the duo created JEE Bench, with 515 pre-engineering mathematics, physics, and chemistry problems from the past eight editions of the IIT JEE-Advanced Exam.
The tests showed that GPT-4, the latest version of the LLM, performed better than all older versions. The paper said that while GPT-3 had “near random performance”, GPT-3.5 could solve 30% of the questions. GPT-4 performed well with physics and chemistry questions but struggled in “retrieving relevant concepts required to solve the problem and performing algebraic manipulation and arithmetic”.
This is probably because the complexity of reasoning is highest in mathematics and least in chemistry, noted the paper. “The typical failure modes of GPT-4, the best model, are errors in algebraic manipulation, difficulty in grounding abstract concepts into mathematical equations accurately and failure in retrieving relevant domain-specific concepts,” noted their paper, which is yet to be peer reviewed.
Another issue that AI might face in comparison to humans while writing the test is risk evaluation, said the researchers. The exam contains negative marking; the candidate is awarded a score of +3 for a correct answer, -1 for an incorrect one, and zero when not answered.
A preliminary version of the research paper titled — Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models — has been submitted to conferences.
The updated paper will be presented at the Empirical Methods in Natural Language Processing conference in Singapore in December.
To be sure, LLMs do not have reasoning capabilities. “Let’s be clear: An LLM does not create any new truths; they are architecturally incapable of abductive reasoning. LLMs only generate statistically interesting strings of words that are surprisingly coherent yet untethered to any metric for truth,” wrote noted American computer scientist Grady Booch, in a post on Twitter (now X), in March this year.
Professor Mausam, who goes by just one name, said that AI is getting closer to cracking entrance tests such as JEE by the day.
He said, “In a similar study, we tested how well it could perform on material science related questions in GATE, the Master’s entrance test, and it performed rather well.”
OpenAI, when it released GPT-4, said: “GPT-4 exhibits human-level performance on the majority of these professional and academic exams. Notably, it passes a simulated version of the Uniform Bar Examination with a score in the top 10% of test takers”.
With AI making its way into allied sectors, and chatter of AI being used by students to cheat, two other IIT professors decided to test whether AI could help students with their class assignments, in February-March this year.
Professor Ishaan Gupta, assistant professor, department of Biochemical Engineering and Biotechnology at IIT-Delhi, said: “I asked students in March to take the help of AI, to frame a code to solve a problem in the field of Bioinformatics. I found that it reduced the time it took for students to finish the assignment. Once they had the framework, they could take on slightly more challenging tasks.”
He added that AI can be used in the field, as a tool to increase efficiency as his students are not coding every day and might not be familiar with the syntax.
They used Open AI’s GPT-4, a multimodal model that accepts image and text inputs and emits text outputs.