Fine-Tuning Large Language Models for Specialized Domains: A Practical Guide
Part 1: Selecting and Evaluating the Right LLM
Introduction
Relying on general-purpose AI models for high-accuracy applications in specific domains often reveals inherent limitations. For instance, in the healthcare sector, OpenAI’s GPT-3 was found to generate hallucinations in a study evaluating AI-generated responses for clinical use, highlighting its inability to consistently interpret complex medical terminology. In legal applications, early tests with ChatGPT showed that the model could fabricate legal citations, as seen in a case where an attorney submitted AI-generated references that did not exist.
This unreliability can manifest in various risks, including misdiagnosis, discrimination, the spread of misinformation, privacy breaches, and even accidents. The very general nature of these models also makes it difficult to anticipate their potential uses and associated risks, even within regulated environments. It becomes evident that for optimal performance in specialized applications, a more tailored approach is required.
Fine-tuning large language models emerged as a crucial process for adapting these advanced algorithms to specific tasks or domains, thereby enhancing their performance on specialized applications. The process of fine-tuning offers considerable advantages, including reduced computational costs compared to training a model from the ground up, and the ability to leverage cutting-edge models without the need for extensive foundational development.
However, a significant hurdle in developing specialized LLMs is the scarcity of high-quality, annotated training data. Especially languages with fewer digital resources face even greater challenges in obtaining sufficient linguistic data for training and annotation. The process of annotating data is often expensive, time-consuming, and susceptible to inconsistencies and biases introduced by human annotators.
One approach that has gained popularity is knowledge distillation: Advanced LLMs are proving to be invaluable in addressing the data scarcity problem through techniques like knowledge distillation and automated data validation. Knowledge distillation involves transferring the advanced capabilities of large, often proprietary, LLMs to smaller, more accessible models. In this process, the larger LLM acts as a “teacher,” guiding the training of a smaller “student” model by labeling unlabeled data or providing target responses. But first, we need to (1) determine whether a pre-trained language model can accurately output the right information without prior domain fine-tuning, and (2) understand where and why the model fails.
In this series of 5 articles, we will show how to:
- Article 1: Select and evaluate a base model against human labeled legal dataset
- Article 2: Fine-tune smaller LLMs and evaluate their performance
- Article 3: Distill knowledge and generate data from the fine-tuned model for further fine-tuning
- Article 4: Deploy and monitor
- Article 5: Evaluate fine-tuned models in production
Selecting and evaluating the Right LLM
For this first article of the series, we are going to evaluate pre-trained LLMs on their ability to correctly extract legal clauses from complex contracts against human-labeled datasets. In particular, we are interested to extract the important clauses below:
· PARTIES
· AGREEMENT DATE
· EFFECTIVE DATE
· EXPIRATION DATE
· RENEWAL TERM
· GOVERNING LAW
· EXCLUSIVITY
· NO-SOLICIT OF CUSTOMERS
· NO-SOLICIT OF EMPLOYEES
· ROFR/ROFO/ROFN
· ANTI-ASSIGNMENT
· PRICE RESTRICTIONS
· MINIMUM COMMITMENT
· LICENSE GRANT
· POST-TERMINATION SERVICES
· WARRANTY DURATION
· INSURANCE
· COVENANT NOT TO SUE
· CHANGE OF CONTROL
· AUDIT RIGHTS
· UNCAPPED LIABILITY
· CAP ON LIABILITY
The initial step involves carefully selecting the most appropriate base model for the task at hand. To do so, we need to evaluate the base model against ground truth data provided by subject matter experts. Below is an example of clauses labeled by human experts.
Unlike traditional text, legal contracts contain legalese that is difficult for even humans to understand. Therefore it is very important to evaluate our base model against high-quality labeled datasets by subject matter experts to quantify the accuracy, precision, and recall scores of the model.
For our base model, we are going to prompt qwen 2.5 7B instruct turbo to extract the relevant clauses and benchmark it against 20 expert-labeled contracts. We are going to focus on the F-1, Precision, and Recall metrics, below are their definitions as a reference:
F1=(2*(precision*recall))/(precision+recall) # The F1 Score ranges from 0 to 1, with 1 being the best possible score. It provides a balanced measure of a model's performance, especially useful for imbalanced datasets
Precision=True positives / (True positives + False positives) # Precision indicates the proportion of correct positive predictions out of all positive predictions made by the model
Recall= True Positives/(False Negatives+True Positives) # Recall represents the proportion of actual positive samples that were correctly identified by the model
To generate the LLM response, we are going to call the UbiAI API and use the prompt below for zero-shot extraction. To help the model understand the task better, we provide a detailed description for each clause to extract as shown below:
class UBIAI:
def __init__(self):
self.system_prompt = "You are an legal expert analyst proficient at extracting legal clauses from contracts"
self.user_prompt = '''Extract the following entities from the contract below in JSON format:
Example of JSON format.
"annotation": [
{
"label": "Document Name",
"text": "DISTRIBUTOR AGREEMENT",
"propertiesList": [],
"commentsList": []
},
{
"label": "Parties",
"text": "Distributor",
"propertiesList": [],
"commentsList": []
},
{
"label": "Parties",
"text": "Electric City Corp.",
"propertiesList": [],
"commentsList": []
},
Here are the entities:
PARTIES: Extract the parts (if any) of this contract related to \"Parties\" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract
AGREEMENT DATE: Extract the parts (if any) of this contract related to \"Agreement Date\" that should be reviewed by a lawyer. Details: The date of the contract
EFFECTIVE DATE: Extract the parts (if any) of this contract related to \"Effective Date\" that should be reviewed by a lawyer. Details: The date when the contract is effective
EXPIRATION DATE: Extract the parts (if any) of this contract related to \"Expiration Date\" that should be reviewed by a lawyer. Details: On what date will the contract's initial term expire?
RENEWAL TERM: Extract the parts (if any) of this contract related to \"Renewal Term\" that should be reviewed by a lawyer. Details: What is the renewal term after the initial term expires? This includes automatic extensions and unilateral extensions with prior notice
GOVERNING LAW: Extract the parts (if any) of this contract related to \"Governing Law\" that should be reviewed by a lawyer. Details: Which state/country's law governs the interpretation of the contract?
EXCLUSIVITY: Extract the parts (if any) of this contract related to \"Exclusivity\" that should be reviewed by a lawyer. Details: Is there an exclusive dealing\u00a0 commitment with the counterparty? This includes a commitment to procure all \u201crequirements\u201d from one party of certain technology, goods, or services or a prohibition on licensing or selling technology, goods or services to third parties, or a prohibition on\u00a0 collaborating or working with other parties), whether during the contract or\u00a0 after the contract ends (or both).
NO-SOLICIT OF CUSTOMERS: Extract the parts (if any) of this contract related to \"No-Solicit Of Customers\" that should be reviewed by a lawyer. Details: Is a party restricted from contracting or soliciting customers or partners of the counterparty, whether during the contract or after the contract ends (or both)?
NO-SOLICIT OF EMPLOYEES: Extract the parts (if any) of this contract related to \"No-Solicit Of Employees\" that should be reviewed by a lawyer. Details: Is there a restriction on a party\u2019s soliciting or hiring employees and/or contractors from the\u00a0 counterparty, whether during the contract or after the contract ends (or both)?
ROFR/ROFO/ROFN: Extract the parts (if any) of this contract related to \"Rofr/Rofo/Rofn\" that should be reviewed by a lawyer. Details: Is there a clause granting one party a right of first refusal, right of first offer or right of first negotiation to purchase, license, market, or distribute equity interest, technology, assets, products or services?
ANTI-ASSIGNMENT: Extract the parts (if any) of this contract related to \"Anti-Assignment\" that should be reviewed by a lawyer. Details: Is consent or notice required of a party if the contract is assigned to a third party?
PRICE RESTRICTIONS: Extract the parts (if any) of this contract related to \"Price Restrictions\" that should be reviewed by a lawyer. Details: Is there a restriction on the\u00a0 ability of a party to raise or reduce prices of technology, goods, or\u00a0 services provided?
MINIMUM COMMITMENT: Extract the parts (if any) of this contract related to \"Minimum Commitment\" that should be reviewed by a lawyer. Details: Is there a minimum order size or minimum amount or units per-time period that one party must buy from the counterparty under the contract?
LICENSE GRANT: Extract the parts (if any) of this contract related to \"License Grant\" that should be reviewed by a lawyer. Details: Does the contract contain a license granted by one party to its counterparty?
POST-TERMINATION SERVICES: Extract the parts (if any) of this contract related to \"Post-Termination Services\" that should be reviewed by a lawyer. Details: Is a party subject to obligations after the termination or expiration of a contract, including any post-termination transition, payment, transfer of IP, wind-down, last-buy, or similar commitments?
WARRANTY DURATION: Extract the parts (if any) of this contract related to \"Warranty Duration\" that should be reviewed by a lawyer. Details: What is the duration of any\u00a0 warranty against defects or errors in technology, products, or services\u00a0 provided under the contract?
INSURANCE: Extract the parts (if any) of this contract related to \"Insurance\" that should be reviewed by a lawyer. Details: Is there a requirement for insurance that must be maintained by one party for the benefit of the counterparty?
COVENANT NOT TO SUE: Extract the parts (if any) of this contract related to \"Covenant Not To Sue\" that should be reviewed by a lawyer. Details: Is a party restricted from contesting the validity of the counterparty\u2019s ownership of intellectual property or otherwise bringing a claim against the counterparty for matters unrelated to the contract?
CHANGE OF CONTROL: Extract the parts (if any) of this contract related to \"Change Of Control\" that should be reviewed by a lawyer. Details: Does one party have the right to terminate or is consent or notice required of the counterparty if such party undergoes a change of control, such as a merger, stock sale, transfer of all or substantially all of its assets or business, or assignment by operation of law?
AUDIT RIGHTS: Extract the parts (if any) of this contract related to \"Audit Rights\" that should be reviewed by a lawyer. Details: Does a party have the right to\u00a0 audit the books, records, or physical locations of the counterparty to ensure compliance with the contract?
UNCAPPED LIABILITY: Extract the parts (if any) of this contract related to \"Uncapped Liability\" that should be reviewed by a lawyer. Details: Is a party\u2019s liability uncapped upon the breach of its obligation in the contract? This also includes uncap liability for a particular type of breach such as IP infringement or breach of confidentiality obligation.
CAP ON LIABILITY: Extract the parts (if any) of this contract related to \"Cap On Liability\" that should be reviewed by a lawyer. Details: Does the contract include a cap on liability upon the breach of a party\u2019s obligation? This includes time limitation for the counterparty to bring claims or maximum amount for recovery.'''
Next, we run our evaluation on the 20 labeled contracts using the script below.
ef text_similarity(self, text1: str, text2: str) -> float:
"""Calculate similarity between two texts."""
text1 = self.normalize_text(text1)
text2 = self.normalize_text(text2)
if not text1 or not text2:
return 0.0
# Calculate Levenshtein distance
dist = levenshtein_distance(text1, text2)
max_len = max(len(text1), len(text2))
if max_len == 0:
return 0.0
similarity = 1 - (dist / max_len)
return similarity
Essentially, thetext_similarity
method normalizes two texts, calculates their Levenshtein distance, and returns a similarity score between 0 and 1. We define a threshold of 0.9 above which we consider a match (True positive) otherwise it will be considered a False negative. Based on these matches, the method increments true positives, false positives, or false negatives. Finally, we computes precision, recall, and F1 scores from those counts.
# Consider it a match if similarity is above threshold (0.5)
if best_similarity >= 0.9:
matched = True
self.label_metrics[norm_label]['true_positives'] += 1
self.label_metrics[norm_label]['matches'].append(best_match)
matches += 1
else:
self.label_metrics[norm_label]['false_negatives'] += 1
false_negatives += 1
Here are the results:
"overall_metrics": {
"precision": 0.15284072652493705,
"recall": 0.15843826495882918,
"f1": 0.1418777182958675
}
These scores are pretty low overall, let’s check the scores of some entities:
"parties": {
"metrics": {
"true_positives": 37,
"false_positives": 7,
"false_negatives": 54,
"precision": 0.8409090909090909,
"recall": 0.4065934065934066,
"f1": 0.5481481481481482
},
"matches": []
},
"agreement date": {
"metrics": {
"true_positives": 17,
"false_positives": 0,
"false_negatives": 1,
"precision": 1.0,
"recall": 0.9444444444444444,
"f1": 0.9714285714285714
},
"matches": []
},
"exclusivity": {
"metrics": {
"true_positives": 1,
"false_positives": 13,
"false_negatives": 12,
"precision": 0.07142857142857142,
"recall": 0.07692307692307693,
"f1": 0.07407407407407408
},
The model is struggling with long sentence clauses but is more accurate with shorter entities such as Parties names and dates. Let’s look at some of the discrepancies to get a quick idea of where the model is having a hard time:
{
"reference_text": "I-ON INTERACTIVE, INC.",
"api_text": "i-on interactive, a Florida corporation",
"similarity_score": 0.5128205128205128
},
In this specific case, the model has extracted the organization name correctly but in a different form than the labeled text. Our current evaluation system is most likely over-penalizing the model. Let’s look at a different example:
{
"reference_text": "PACIRA PHARMACEUTICALS, INC.",
"api_text": "Premier Biomedical, Inc.",
"similarity_score": 0.5
},
In this case, the model prediction is incorrect and should be counted as a false negative. To get more accurate evaluation metrics, a subject matter expert should analyze the model response and re-evaluate the model response. Nevertheless, the current evaluation method indicates that the base model is struggling with long sentence clauses while missing many instances of short entities.
This is a perfect use case for fine-tuning. By showing the model the correct clauses and their boundaries, the model should learn how to identify them correctly.
Conclusion
If there is one key takeaway from this article, is that continuously evaluating your LLM is of the utmost importance and cannot be overstated. This is especially true for for complex, domain-specific tasks like in the legal domain. Models that seem to work well in general contexts may falter when faced with intricate provisions, uncommon terminology, or nuanced definitions. By rigorously benchmarking against expert-labeled data, teams can identify precisely where the model struggles — whether it’s missing certain clauses altogether or failing to capture them in sufficient detail — and then take targeted steps to address these shortcomings.
In the next article, we will show how to fine-tune LLMs on legal clause extraction.
If you have any questions, send us an email at admin@ubiai.tools or connect with me directly on Linkedin.