Sitemap

Fine-Tuning Large Language Models for Specialized Domains: A Practical Guide

Part 2: Fine-tuning and Evaluating LLMs

7 min readApr 25, 2025

In our previous article, we shed light on the limitations of general-purpose AI models in extracting intricate legal clauses from contracts, highlighting the need for domain-specific fine-tuning.

In this article, we will dive into the specifics of supervised fine-tuning of large language models (LLMs) using the LoRa (Low Rank Adaptation) technique, built upon PEFT (Parameter Efficient Fine-Tuning). This targeted approach involves training the LLM on a curated dataset of labeled text from the target domain, in our case, legal contracts, allowing it to learn how to extract complex clauses. By leveraging highly curated labeled data, we can significantly enhance the accuracy and reliability of the LLM in tasks such as contract analysis and legal document review.

Supervised Fine-Tuning: A Closer Look

Why fine-tune an LLM instead of an NLP model?

Fine-tuning an LLM offers several advantages over using a standalone NLP model. Traditional NLP models such as spaCy or BERT transformers require extensive training data specific to the task at hand to get high accuracy and may struggle with out-of-distribution generalization, failing to perform well on unseen data or tasks. Furthermore, even for models that have been fine-tuned on a specific task, adding new labels or tasks often requires additional training data, limiting their adaptability to new requirements.

In contrast, LLMs are pre-trained on massive datasets, allowing them to capture a broader range of linguistic patterns and complex relationships. Additionally, LLMs are typically more robust and can handle more nuanced tasks, making them a more versatile choice for many traditional NLP applications.

Full Fine-Tuning vs PEFT

Fully fine-tuning a large language model involves training and changing the weights of the entire model on a target task. While this approach usually offers superior performance on specialized tasks with improved relevance and accuracy, it comes with notable drawbacks, such as high computational and resource costs, a risk of overfitting, catastrophic forgetting, and bias amplification.

Another fine-tuning technique that is more efficient is PEFT. Specifically, LORA (Low Rank Adaptation) enables much more efficient fine-tuning at lower cost by adding a small set of learnable weights to the pre-trained model’s weights, called adapters. This approach enables the model to adapt to the target task without significantly altering the original model’s weights, which is particularly beneficial when working with limited labeled data. By leveraging the strengths of the pre-trained models and adapting them to the target task, PEFT can help improve the model’s performance while reducing the risk of overfitting.

For an in-depth comparison, check out this research article explaining the differences in detail.

LoRa Architecture

In this article, we will fine-tune Qwen 2.5 14B, using LoRa, and evaluate its performance.

Fine-Tuning Qwen 2.5 14B

In part 1 of this series, we have evaluated Qwen 7B on legal clause extraction. However, after fine-tuning the model, we have noticed significant hallucinations, making the model's performance even worse. We have decided to redo the evaluation of the base model using Qwen 14B instead of 7B, using the labeled legal clauses and calculate the F-1, precision, and recall scores:

#Performance score of base mdoel Qwen 14B
{
"overall_metrics": {
"precision": 0.27192982456140347,
"recall": 0.24122807017543862,
"f1": 0.2518796992481203
}

In this article, we are going to fine-tune Qwen 14B to improve its ability to extract complex legal clauses. Here are the steps:

  • Data Labeling: Labeling the data involves annotating it with relevant labels, tags, or descriptions that can help the models understand the context and relationships between different data points.
Labeled dataset visualization in UbiAI
  • Data Pre-processing: We convert the labeled data into an instruction tuning dataset with system prompt, user prompt, and response. This format is required for LLM fine-tuning.

Here is an example of a prompt-response:

 {"annotation": [{"label": "Parties", "text": "Golfers Incorporated"}, {"label": "Governing Law", "text": "The Agreement shall be governed by and construed under the laws of the State of Florida in the United States of America,  and venue for any such legal action shall be in the Circuit Court or County Court in Orlando, FL or the U.S. District Court having jurisdiction  over Orlando, FL."}, {"label": "Anti-Assignment", "text": "Neither party to this Agreement shall assign the rights and benefits herein without the prior written consent of the other party."}, {"label": "Insurance", "text": "Company agrees, at its own expense, to obtain and maintain general comprehensive liability insurance, with an insurance  company that has a rating of A++ (per AM Best), insuring North as a \"named insured party\", against any claims, suits, losses and damages  arising out of or caused by Company's use of North's Likeness."}, {"label": "Insurance", "text": "Such insurance policy  shall be maintained with limits of not less than two million dollars ($2,000,000)."}, {"label": "Insurance", "text": "A copy of such insurance policy shall be provided to North within thirty (30) days after execution of this Agreement."}]} 
  • Fine-Tuning: This step involves using UbiAI’s no-code platform to fine-tune the Qwen 14B models without requiring extensive coding expertise. The platform provides an intuitive interface that allows users to easily upload their data, select the models, and fine-tune them for specific tasks.
UbiAI Model Training Configuration Window

Evaluating Fine-tuned LLMs

To create the ground truth data, we label an additional 20 contracts using the UbiAI platform and run evaluations against this labeled data. Following our previous article, we use the Levenshtein distance between the target and predicted data:

def text_similarity(self, text1: str, text2: str) -> float:
"""Calculate similarity between two texts."""
text1 = self.normalize_text(text1)
text2 = self.normalize_text(text2)

if not text1 or not text2:
return 0.0

# Calculate Levenshtein distance
dist = levenshtein_distance(text1, text2)
max_len = max(len(text1), len(text2))
if max_len == 0:
return 0.0

similarity = 1 - (dist / max_len)
return similarity

Let’s look at the evaluation results on 20 labeled contracts before and after fine-tuning:

Before fine-tuning:
{
"overall_metrics": {
"precision": 0.27192982456140347,
"recall": 0.24122807017543862,
"f1": 0.2518796992481203
}

After fine-tunig:

{
"overall_metrics": {
"precision": 0.6165413533834587,
"recall": 0.5350877192982456,
"f1": 0.5511695906432749
},,

Overall, we see more than a 2x increase in performance after fine-tuning across all labels!

And let’s look at the performance metrics for each label:

In addition to the increased performance, the fine-tuned model detected additional labels that were previously not extracted before fine-tuning.

Although we see a significant increase in performance, there is still some room for improvement. Let’s check some of the predictions against the ground truth labels:

        {
"label": "Agreement Date",
"reference_text": "7th day of September, 1999.",
"api_text": "this 7th day of September, 1999",
"similarity": 0.8333333333333334,
"matched": true
},
{
"label": "Effective Date",
"reference_text": null,
"api_text": "this 7th day of September, 1999",
"similarity": 0.0,
"matched": false,
"false_positive": true
}

The “api_text” refers to the prediction from the fine-tuned model, and “reference” is the ground truth text. For instance, we notice that the model is still outputting false positives in the Effective Date label. This indicates that we might need to provide more examples or improve our prompt description.

Conclusion

In this article, we explored the process of fine-tuning large language models for specialized domains, specifically in the context of legal contract analysis. We demonstrated the effectiveness of using the LoRa technique and LLMs like Qwen 14B in improving the accuracy and reliability of entity extraction tasks. By leveraging labeled data and fine-tuning these models, we can achieve significant performance gains and improve the overall quality of our results. However, as we’ve seen, there is still room for improvement, and future work will focus on addressing these limitations and pushing the boundaries of what’s possible with fine-tuned LLMs.

Stay tuned for our next article, where we will dive into knowledge distillation and synthetic data generation!

If you have any questions, send us an email at admin@ubiai.tools or connect with me directly on Linkedin.

--

--

Walid Amamou
Walid Amamou

Written by Walid Amamou

Founder and CEO of UBIAI | PhD in Physics.

Responses (1)