pull down to refresh

Key takeaways: TL:DR
  • The foundation models are currently ill-suited to perform open-ended questions expected of entry-level finance analyst
  • A majority of models struggled with tool use in general, and more specifically for information retrieval, leading to inaccurate answers — most notably the small models like
  • Llama 4 Scout or Mistral Small 3.1. Models on average performed best in the simple quantitative (37.57% average accuracy) and qualitative retrieval (30.79% average accuracy) tasks. These tasks are easy but time-intensive for finance analysts.
  • On our hardest tasks, the models perform much worse. Ten models scored 0% on the Trends task, and the best performance on this task was only 28.6% by Claude Sonnet 3.7.
  • o3 is the best performing model reaching 48.3%, but at the cost of an average of $3.69 per question. It is followed closely by Claude Sonnet 3.7 Thinking which got 44.1% accuracy, at the much lower price per question of $1.05.
[….]

My Thoughts 💭

Vals.ai is busting big tech claims that AI will replace everything. Well at this cost and this horrible performance AI is proving to be unproductive. Accuracy this low is terrible for something that takes billions to develop.