View all AI news articles

Making LLMs Worth Every Penny: Balancing Costs & Performance in Financial NLP

February 27, 2024


The advancements in Natural Language Processing (NLP), especially in text classification, have been groundbreaking. Yet, the implementation of these technologies in data-limited domains like finance remains a challenge. Traditional classifiers demand thousands of labeled examples, making them impractical in such settings. Recent developments in Few-Shot techniques, including contrastive learning, and Large Language Models (LLMs) like GPT-4, which require as few as 1-5 examples per class, offer promising alternatives​​.

The Few-Shot Approach and LLMs in Finance

In finance, where data is often scarce, Few-Shot methods and LLMs like GPT-3.5, GPT-4, Cohere’s Command-nightly, and Anthropic’s Claude 2 have emerged as practical solutions. Using only 1 to 3 examples per class, these models have shown impressive performance. Particularly, samples curated by human experts significantly outperform randomly selected ones, demonstrating that quality of data in few-shot scenarios is crucial​​.

Cost Analysis of LLMs

While LLMs offer enhanced performance, they come with substantial operating costs. A detailed cost analysis reveals that using LLMs like GPT-4 can be expensive, especially for small organizations. For instance, a 3-shot experiment with GPT-4 for approximately 3k samples costs about $740. This highlights the need for cost-effective methods in leveraging LLMs without incurring significant expenses​​.

LLM Costs

Retrieval-Augmented Generation (RAG) for Cost Reduction

To address the high costs of LLMs, we utilized Retrieval-Augmented Generation (RAG). This approach involves selecting a small fraction of the most relevant examples for each inference call, drastically reducing costs. For example, using RAG, Anthropic Claude 2 achieved a higher µ-F1 score than GPT-4 while costing $700 less. This method showcases the potential of cost-effective LLM inference in financial settings​​.

Data Generation in Low-Resource Settings

In scenarios where collecting large datasets is unfeasible, such as in finance, generating additional data using GPT-4 has proven effective. By starting with a limited dataset (231 hand-picked examples from a human expert) and using GPT-4 for data augmentation, significant performance improvements were observed. However, there's a threshold beyond which the quality of generated data degrades, underscoring the importance of using real data over artificial data when possible​​.

Error Analysis in LLMs and MLMs

Error analysis in top-performing models like GPT-3.5, GPT-4, and MPNet-v2 revealed interesting insights. Misclassification issues were observed due to overlapping vocabulary and contextual similarities. For example, GPT-4 frequently misclassified the label "Get Physical Card" as "Change Pin" due to the presence of the word “PIN” in test instances. These findings highlight the challenges in differentiating specific concerns in financial transactions and the importance of clear and distinct labeling​​.


Our comprehensive study on few-shot text classification using LLMs and MLMs underscores the balance between cost and performance. The use of conversational LLMs presents a practical solution for accurate results in data-limited scenarios like finance. While the cost remains a consideration, especially for smaller organizations, methods like RAG offer a way to leverage the power of LLMs cost-effectively. The development of robust financial AI systems is further supported by providing the financial community with a human expert-curated Banking77 subset​​.

Recent articles

View all articles