Newsletter

AI Topics for Lawyers to Watch in 2025, Part Two: Benchmarking

Desk depicting the title AI Topics for Lawyers to Watch in 2025 part two benchmarking

In this three-part series, I’m highlighting AI topics for lawyers to watch this year. Part one last week explored AI agents and offered considerations for lawyers who are interested in utilizing AI agents. Come back next week for part three, when I’ll check in on the status of the 40+ AI-related copyright infringement scraping cases, and discuss what we might see in 2025. 

Issue 12 

Welcome to part two of AI Topics for Lawyers to Watch in 2025, which explores benchmarking. In case you missed the headlines in 2024 about AI benchmarking, today I’m diving into what benchmarks are, what’s happening with benchmarking of AI tools for lawyers, why it matters, what you can expect to see in 2025, and how lawyers can think about benchmarks in relation to their evaluations of AI tools for legal practice. 

What are Benchmarks?  

Benchmarks are datasets and tasks that have been standardized in order to measure the capabilities of an AI model across an industry.[i] For example, in 2023, researchers created a benchmark called LegalBench, which included 162 legal reasoning tasks evaluated across 20 large language models (“LLMs”).[ii] Additionally, in 2024, Stanford RegLab and the Institute for Human-Centered AI published a study (which I will refer to in this newsletter as the “Stanford study”) evaluating GPT-4 and AI tools from Thomson Reuters and LexisNexis for legal research tasks.[iii] The Stanford study articulated one of its contributions as constructing the first benchmark dataset of 202 legal queries to evaluate vulnerabilities in AI tools for the legal industry.[iv]  

Benchmarks can be distinguished in the context of AI model assessments from “evals” (short for evaluations), which are intended to measure the real-world performance of an AI tool at a deeper level, as well as from tests, which are intended to validate whether a specific tool performs as anticipated.[v] It is important to recognize that there may not be perfect consistency across the industry in the ways that the terms benchmark, evals, and tests are utilized in relation to AI tools for lawyers.    

Challenges with Benchmarking and Evaluating AI Tools for the Legal Industry 

Researchers have pointed out that it is challenging to evaluate applications built using LLMs due to the open-ended response capabilities and unlimited output space of LLMs.[vi] Similarly, the sheer numbers of different use cases and AI tools that have been built for lawyers in a short period of time add significant challenge to the difficulty of creating comprehensive benchmarks and evaluations. The speed at which LLMs are evolving also adds challenge, as benchmarking information can be quickly rendered obsolete.[vii]   Additionally, the lack of transparency in the operation of LLMs makes benchmarking more difficult.[viii] (You can learn more about the challenges with transparency in relation to generative AI models in the first chapter of A Lawyer’s Practical Guide to AI, which you can read for free here.) Also, the risk of test leakage, which can occur when an LLM trains itself on data used for testing, can also add significant difficulty.[ix]

Why are Benchmarking and Evaluations Important? 

Given the challenges associated with benchmarking and evals of AI tools for lawyers, is this a topic that a typical lawyer needs to be devoting attention to in 2025? My vote is yes, benchmarking and evals are important for lawyers who use AI or are thinking about using AI to know about, because they can theoretically provide some objective standards for lawyers to compare certain AI tools, and because they will provide lawyers with some data about where things stand with AI for the legal industry at points in time, and how AI tool options are progressing as we move through 2025 and beyond. In most cases, some benchmarking/evals data, even if imperfect, will be more useful to lawyers than no data. The Stanford study identified the most important takeaway of its results as the need for thorough and transparent benchmarks and evaluations of AI tools for the legal industry.[x] 

An Essential Consideration for Assessing Benchmarks and Evals 

Lawyers who pay attention to benchmarking and evals in 2025 will need to distinguish between independent benchmarking and evaluation efforts such as the ones discussed above, and internal benchmarking and evaluation efforts by the companies making AI tools for lawyers. For instance, in 2024, Harvey published its BigLaw Bench, which it described as a public version of the company’s internal dataset for evaluating the performance of LLMS and model systems on legal tasks.[xi]  

While an AI tool company’s own benchmarks and evaluations may provide useful data, it’s important to be mindful of the source of any data you utilize for decision-making purposes. Consider the Stanford study referenced above, which found that Thomson Reuters/Westlaw’s AI-Assisted Research legal research tool hallucinated one third of the time, while Thomson Reuters product Ask Practical Law AI and Lexis+ AI legal research tools produced hallucinations in more than one of every six responses, all of which were lower hallucination rates than GPT-4 (which it found hallucinated 43 percent of the time), yet remained substantial.[xii] The study further noted that its findings were not intended to be an unbiased estimate of the population-wide hallucination rate in AI legal queries, but rather to determine whether hallucinations had been eliminated by retrieval augmented generation (“RAG”) techniques.[xiii] LexisNexis and Thomson Reuters both pushed back against the study, responding that their internal testing and customer feedback demonstrate higher rates of accuracy than the study results, with Thomson Reuters asserting an accuracy rate of approximately 90% for their AI-Assisted Research tool.[xiv] Following the preprint publication of the Stanford study, another AI tool company, Paxton AI, announced that it achieved 94% accuracy on tasks in the benchmark created by the Stanford study. As the body of available benchmarking and evaluation data grows over time, you should remember to consider the source of any information you assess in the selection of AI tools for your organization. 

What Benchmarking Efforts Can We Expect to see in 2025? 

Vals AI has announced that it is currently working with a selection of US-based law firms on a benchmarking study of legal industry AI platforms that will be published in 2025. Outside the U.S., London-based Litig is pursuing an initiative in relation to AI benchmarking.  Additionally, I anticipate that 2025 will bring more announcements of internal benchmarking and evaluation efforts from companies that make AI tools for lawyers. 

Practical Tips for Lawyers in the Absence of Benchmarking Data 

As mentioned above, in most cases some benchmarking/evals data, even if imperfect, will be more useful to lawyers than no data. But considering there are over 50 use cases for legal industry AI tools, and over 200 legal industry AI tools on the market, the vast majority of AI tools for lawyers are unlikely to be included in independent benchmarking or evals in the near future. This is a reality of navigating a world with incomplete and imperfect information, and it should not be used as justification to delay learning about and evaluating AI until better information exists, or to exclusively consider AI tool options with benchmarking and evals data.  

Instead, lawyers should be prepared to conduct their own evaluations and testing of the AI tools that they have identified as being most promising for their unique organizations before adopting AI solution(s). Evals and testing can help you assess any performance claims about an AI tool for yourself, and determine if the tool is a good match for your organization. Chapter 5 of A Lawyer’s Practical Guide to AI lays out a process to help you identify whether there are AI tools that potentially meet your organization’s needs, and if so, how to evaluate them before implementing them. 

Additionally, lawyers should continue to always verify output from generative AI tools for accuracy, even (or perhaps especially) as accuracy improves. As noted by the Stanford study, its results serve as evidence of lawyers’ responsibilities to supervise and verify AI output.[xv] 

See you back here next week for part three, for a big picture update on the status of the 40+ AI-related copyright infringement scraping cases, and what we might see in 2025. 

Thanks for being here. 

Jennifer Ballard
Good Journey Consulting

 

[i] Shayan Mohanty, John Singleton, and Parag Mahajani, LLM benchmarks, evals and tests, Thoughtworks, (Oct. 31, 2024), https://www.thoughtworks.com/en-us/insights/blog/generative-ai/LLM-benchmarks,-evals,-and-tests#:~:text=While%20benchmarks%20offer%20a%20general,emotions%2C%20or%20handles%20ambiguous%20queries

[ii] Neel Guha, Julian Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan Choi, Kevin Tobia, Margaret Hagan, Megan Ma, Michael Livermore, Nikon Rasumov-Rahe, Nils Holzenberger, Noam Kolt, Peter Henderson, Sean Rehaag, Sharad Goel, Shang Gao, Spencer Williams, Sunny Gandhi, Tom Zur, Varun Iyer, Zehua Li, LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models, arXiv:2308.11462v1, 1 (2023), https://arxiv.org/pdf/2308.11462

[iii] Magesh, Varun; Surani, Faiz; Dahl, Matthew; Suzgun, Mirac; Manning, Christopher D; Ho, Daniel E (2024): Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools, at 3, Stanford RegLab. Preprint. https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Hallucinations.pdf

[iv] Id. at 3, 21. 

[v] Mohanty et al., supra note i. 

[vi] Mohanty et al., supra note i. 

[vii] Magesh et al., supra note iii at 21. 

[viii] Sean Harrington, Evaluating Generative AI for Legal Research: A Benchmarking Project, AI Law Librarians, (May 24, 2024), https://www.ailawlibrarians.com/2024/05/24/new-project-evaluating-genai/

[ix] Magesh et al., supra note iii at 21. 

[x] Magesh et al., supra note iii at 24.  

[xi] Introducing BigLaw Bench, Harvey, (Aug. 29, 2024), https://www.harvey.ai/blog/introducing-biglaw-bench

[xii] Magesh et al., supra note iii at 4, 13. 

[xiii] Magesh et al., supra note iii at 22. 

[xiv] Jeremy Kahn, What a study of AI copilots for lawyers says about the future of AI for everyone, Fortune, (Jun. 4, 2024, 13:31), https://fortune.com/2024/06/04/stanford-hai-legal-ai-copilot-study-rag-llms-future-of-ai/

[xv] Magesh et al., supra note iii at 1. 

Stay connected with news and updates!

Join our mailing list to receive the latest legal industry AI news and updates.
Don't worry, your information will not be shared.

We will not sell your information.