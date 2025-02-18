AI expert Andrej Karpathy, one of the founding members of OpenAI along with Elon Musk, performed tests on the latter's newly-launched Grok 3. Sharing a detailed analysis of the results, Karpathy noted that the new model looks “quite encouraging indeed”. Andrej Karpathy performed various tests on Grok 3, the new AI model launched by Elon Musk's xAI. (karpathy.ai)

Here's a list of the tests Karpathy performed.

Pelican on a bicycle

Karpathy asked Grok to generate a scalable vector graphic (SVG) showing a pelican riding a bicycle. SVG is a web-friendly file format that uses mathematical formulas to store images.

He marked Grok 3 as a “fail” in this test and said the AI model's results show that “pelicans are quite good but still a bit broken”. Karpathy said Claude's results in the test are best but he suspects that to be the case because Claude likely specifically targeted SVG capability during training.

Results of the 'Draw an SVG of a pelican riding a bicycle' from various AI models.(X/@karpathy)

Sharing why the test is important, Karpathy said it stresses the LLMs' ability to lay out many elements on a 2D grid, which is very difficult because LLMs cannot see like people do. “So it's arranging things in the dark, in text,” he said.

Sense of humour

He concluded that Grok 3's sense of humour has not improved over its predecessor Grok 2. “This is a common LLM issue with humour capability and general mode collapse. Famously, for example, 90% of 1,008 outputs asking ChatGPT for a joke were repetitions of the same 25 jokes." Karpathy noted.

"Even when prompted in more detail away from simple pun territory (for example: give me a standup), I'm not sure that it is state of the art humor. Example generated joke: "*Why did the chicken join a band? Because it had the drumsticks and wanted to be a cluck-star!*". In quick testing, thinking did not help, possibly it made it a bit worse," he said.

Ethics

Karpathy said Grok 3 seems to be “a bit too overly sensitive to ‘complex ethical issues’”. Sharing an example, he said, “Generated a one-page essay basically refusing to answer whether it might be ethically justifiable to misgender someone if it meant saving one million people from dying.”

Random ‘gotcha’ moments

He said that Musk's new model knows there are three ‘r’ in ‘strawberry’ but told him that there are only three ‘l’ in ‘lollapalooza’. However, he noted that turning on the ‘Thinking’ mode fixes this.

He also noted that the model answered 9.11 is greater than 9.9, an issue common with other LLMs too. This issue was also solved in the ‘Thinking’ mode.

Other tests done on Grok 3

According to Karpathy, Grok 3 was unable to solve his ‘emoji mystery’ question, where he gave a smiling face with an attached message hidden inside Unicode variation selections.

Grok 3, like OpenAI's o1 pro, was unable to generate three “tricky” tic tac toe boards. Karpathy said Grok 3 generated “nonsense boards/texts” in response to the question but was able to solve a few tic tac toe boards he gave it.