Fine-tuning Meta’s Llama 2 to power Jio Copilot — Part 1

Vignesh Prajapati
Building Fynd
Published in
9 min readNov 29, 2023

--

In this two-part series, we cover everything from our fine-tuning experiments to model quantization and the limitations of Meta’s Open Source Large Language Model — Llama 2. This series has been co-authored by Vignesh Prajapati and Sreeraman Mohan Girija from Fynd and Varun Vontimitta from Meta.

Over half a billion people in India use various services offered by Jio. This number is growing at an unprecedented rate every year and includes a significant young demographic. These young customers are discerning and quick to switch services if the experience is not delightful. We’ve been working on leveraging the power of AI and LLMs to provide them with the best experience we can while keeping the experience unified across our various services. For this, we are building a Copilot.

For our customers, it will be a one-stop shop for everything Jio. It can tell you when your JioMart order is out for delivery, provide a smart summary of the last episode of your favourite show you watched on JioCinema or help you troubleshoot your JioFiber broadband connection.

Fynding the right LLM

We started with closed-source LLMs, which helped us achieve good results and build our prototype quickly. However, it posed challenges such as high latency, limited API usage quota, and constraints with data transfer policies. These limitations made it challenging to create a truly seamless assistant. We wanted an Open LLM-based solution that can help us control latency, API usage quota issues and optimise the model accuracy by effectively utilising various internal datasets.

We evaluated several LLMs and finally settled on Llama 2, which came with the following advantages.

  1. User privacy: We retained complete control over our data and model, safeguarding user privacy.
  2. Time & cost savings: With Llama 2, we can eliminate rate limiting and quota issues of third-party GenAI model services and time lost in negotiations.
  3. State-of-the-Art research: Llama 2’s advanced language processing capabilities allowed us to build an accurate virtual assistant compared to other open LLMs.
  4. Faster and better community support: Robust ecosystem support from the open-source community ensures that we can iterate and progress rapidly on our development.

These factors collectively made Llama 2 the ideal choice for powering Jio Copilot.

Initial experiments with Llama 2 base models

Once we got access to the model, we started running some experiments. Llama 2 base models already had the context of some of the Jio Storefronts, which helped in Intent-Entity detection.

However, a challenge surfaced as the models occasionally ‘hallucinated,’ suggesting products from memory with inaccurate web URLs and, sometimes, from unrelated sources like other eCommerce websites. To address these limitations, we experimented by adding more instructions with chain-of-thought prompts, which did not yield the expected results.

From the above table, you can see that the base model could not provide the expected output with a defined sample example in the prompt. The model-generated output had issues like inconsistent format, <out1101> token omission, hallucinations with product names, and price and size details from its prior knowledge.

Additionally, we experimented on prompt tuning parameters with multiple versions of system prompts and input messages and observed these issues,

  1. Inconsistency in format for the output of Intent-Entity detection response with single and multi-turn chat conversations. eg:
Input
User: Show me red shirts

Output
Sure, I can assist you with your request! Based on your message, I detect the intent "product search" and the entities "red shirts". Here is my response:
Agent: Search
Query: Red shirts
Store: JioMart
Here are some red shirts available at JioMart: . If you have any specific preferences or requirements, feel free to let me know and I can narrow down the options further. How may I assist you today?

2. Limitation of control in text generation with restriction of answering general Q&A, handle omission of necessary tokens and avoidance of redundant tokens. E.g.:

Input
User: Who is elon musk?

Output
As per the search results obtained from our language models (LLM), Elon Musk is a South African-born entrepreneur, inventor, and business magnate who is best known for his ambitious goals in revolutionizing transportation, energy, and space exploration through his companies SpaceX and Tesla. He has also been involved in other ventures such as Neuralink and The Boring Company.
I hope this information helps! Let me know if you have any further questions or if there's anything else I can assist you with.

3. Omission of <out101> token and hallucinating with fake product name, size, price and other attributes. eg:

Input
User: Show me red shirts

Output
Sure, I can assist you with your query! Based on your message, I understand that you want to see red shirts available at Jio Storefronts. Here's my response:
Agent: Search
Query: Red shirts
Store: JioMart
Here are some red shirts that might interest you: . If you have any specific preferences or requirements, feel free to let me know and I can narrow down the options further. Is there anything else I can help you with?

4. Detection of wrong intent/agents when the input message contained multi-turn chat conversations. eg:

Input
User : What's the best moisturizer for dry skin?
Assistant: For dry skin, it's important to use a moisturizer that provides intense hydration. Let me find some moisturizers that could help with this: {agent:search;query:Dry skin moisturizer;store:Tira Beauty} <out101>. If you're seeking more details, I'm here to help!
User: Where is my Tira order?

Output
Assistant: Sure, Here is the status of your order on Tira Beauty. {agent:search;query:order status;store:tira_beauty} <out101>

In the above example, the Intent-Entity model detected the wrong agent with the value as search instead of order_status. Sometimes, the Model would detect the wrong store entity eg, it would detect JioMart instead of TiraBeauty.

The above challenges prompted us to consider fine-tuning Llama 2 models to control the output format, tone and quality of text generated in multi-turn chat conversations for Jio Copilot.

Training & Deployment

Building the Copilot was a multi-step process which included,

  1. Dataset preparation
  2. Fine-tuning
  3. Compilation and Quantization
  4. Deployment

Dataset Preparation

Everything starts with the dataset. We needed to create a fine-tuning dataset for various use cases across Jio storefronts. This involves considering the format of the generated output, the tone of language in the context of the Jio ecosystem, and enhancing product recommendations to users based on multi-turn chat conversations between Jio Copilot and the User.

Data

Product discovery, order status, and product-related Q&A with multi-turn chat interactions were the first use cases we wanted to solve. To fulfil the fine-tuning prerequisites for the mentioned use cases, we had to generate synthetic data, iterating multiple times until we achieved superior results in both the tasks and scenarios across use cases — Bot personality questions, Intent-Entity Detection, Chat completion and Question Answering.

Of the input data records with LLM, roughly 70% pertain to the Product discovery intent, with the remaining records encompassing other intents.

Data Cleansing

We took the catalogue data from Jio Storefronts like JioMart, TiraBeauty, Netmeds, MilkBasket, JioCinema and JioFiber and identified all product categories from the dataset. We generated their most prominent question-answer pairs covering a diverse set of scenarios similar to how users interact while exploring the products/services. These question-answer pairs got translated into LLM prompts to feed into the fine-tuning process of Intent-Entity and Chat completion tasks.

The primary portion of our training dataset was synthetically generated, leading to a mostly clean dataset. Nevertheless, challenges arose when attempting to format answers precisely. There were instances where the model struggled to identify the necessary intent, entities, or placeholders, resulting in answer format discrepancies.

After several iterations of data preparation and cleaning, we identified the optimised data format for improving the fine-tuning process of Llama 2.

Subsequently, we performed a series of operations to cleanse the records stored in multiple raw data files.

Fine-tuning

Fine-tuning is a technique through which models are customised to perform specific tasks or behaviours. It involves taking an already trained base model and adapting it to a narrower subject or more focused goal.

We undertook a thorough model fine-tuning journey, marked by systematic experiments over multiple datasets and fine-tuning config variations. Beginning with simpler datasets, we steadily advanced to more intricate ones. With each iteration, we observed tangible progress.

We preferred transformer and evaluate libraries from the HuggingFace family to make our fine-tuning and evaluation process more robust. We fine-tuned Llama2 7B & 13B models using PEFT and QLORA-based parameter and memory efficient techniques, respectively, to limit performance decay. Here, PEFT focuses on fine-tuning a minimal number of parameters, significantly reducing computational and storage costs while achieving comparable performance to full fine-tuning. QLORA uses innovative memory-saving techniques, including 4-bit NormalFloat (NF4) and Double Quantization, to reduce memory usage while fine-tuning.

We used the following hyperparameters:

peft_r: 8
lora_alpha: 32
lora_dropout: 0.05
peft_target_modules: ["q_proj", "v_proj"]
learning_rate: 1e-4
optimizer: "adamw_torch_fused"

The above parameters gave us a stable fine-tuning process with a smooth declining training loss curve. We used adamw_torch_fused optimizer for the fine-tuning process. The loss function is Cross Entropy Loss, as given in the default script settings. The Intent Model was fine-tuned with ~25K Data Instances, 2 epochs, Batch Size 4 and Llama 2 13B chat-hf with a completion time of ~7 hours. The Chat Model was fine-tuned with around similar data Instances, 2 epochs, Batch Size 4, and Llama 7B base hf model with a completion time of ~4 hours.

The quality and output format of Llama 2 fine-tuned models offered us the flexibility and control required to tailor the model to our specific use cases, enabling Jio Copilot to route between multiple internal search and order tracking services across storefronts effectively.

We achieved a better format and more meaningful results than the base Llama 2 models — 7B and 13 B with a per-device batch size of 4 and 10 epochs. The following are the training and loss curves for the training and validation datasets.

To evaluate the readiness, we also performed multiple additional tests manually.

  1. Safety checks — Ethical Use, Bias Detection and Content Filtering for profanity, hate speech, violence, and other harmful content. (This is, by default, handled by the Llama2 base model.)
  2. Model API response time — This should be less than 1 second for the Intent-Entity model and less than 3 seconds for the chat model response.

We maintained a continuous cycle of iterations and experiments, consistently incorporating additional relevant data instances into new experiments driven by feedback from previous rounds of experimentation.

As we fine-tuned models with different lengths of parameters from 7B to 13B and further to 70B, accuracy improved. However, this led to heightened deployment complexity, increased costs, and longer latencies for the larger models. This realisation guided our decision to opt for the 13B parameter model for Intent-Entity detection, where tokens are limited while employing the 7B-chat-hf model for the chat completion task, where a higher token count was manageable. Striking this balance was essential in optimising critical factors for our solution.

Let’s delve into the model prompts, data format, key takeaways across various fine-tuning experiments, and their successive iterations.

Prompts:

Expected output:

We trained several Llama 2 models and experimented with diverse prompts, data formats, scenarios, and use cases to identify the factors that enhance the model’s performance.

Our fine-tuning experiments ran for around two weeks, during which we experimented with Intent-Entity detection, multi-turn chat conversation and even some experiments with regional languages like Hindi.

We will cover this and more in the second and final part of the series, to be published next week. Stay tuned.

Credits

Development: Abhay Gawde, Dipen Rangwani, Shraddha Gami, Shreyash Shetty, Subhankar Halder, Sujoy Dev, Sumit Dafda, Sagarika Singh, Vignesh Prajapati and Jigar Dafda

Design: Abhishek Singh, Gaurav Sathe, Milind Bhuravane, Suhas Zambre, Taabish Sheikh

--

--