Leveraging qLoRA for Fine-Tuning of Task-Fine-Tuned Models Without Catastrophic Forgetting: A Case Study with LLaMA2(-chat) | by Aris Tsakpinis | Sep, 2023

September 7th 2023

Sed ut perspiciatis unde. Large language models (LLMs) like Anthropic’s Claude or Meta’s LLaMA2 have demonstrated impressive capabilities on a variety of natural language tasks. However, their knowledge and task-specific skills remain relatively generic — if you want to execute more specialized, domain-specific tasks that require explicit knowledge, you need to find ways to infuse models with knowledge and teach them task-specific behaviour. LLM-powered applications need to work properly in their target domain, provide accurate answers instead of hallucinating, and ensure security, privacy, and appropriate content.These challenges are commonly denoted as the “three Hs” of helpfulness, honesty, and harmlessness. Overcoming these challenges have proved to be particularly important when designing FM-powered applications of enterprise-grade quality.There are a few options for imparting domain knowledge into foundation models:As Heiko Hotz explains in his blogpost, picking the right approach (or a combination of both) comes with tradeoffs. In this blog, we’ll focus on the parametric approach and demonstrate how to fine-tune the LLaMA2 Model using PEFT (parameter efficient fine-tuning) on Amazon SageMaker.Our goal is to adapt LLaMA2 to a specific domain, picking up recent knowledge to overcome the “knowledge cutoff” problem where models lack awareness of recent information not being part of their training data. As opposed to task-specific fine-tuning this is a much more achievable task for a lot of practitioners since they can simply use text corpora containing domain-specific information as training datasets as opposed to manually crafting or collecting task-specific datasets like conversational or instruction datasets.Since for a lot of relevant LLM-powered use cases task-specific models are beneficial, we will also show that the proposed setup can be applied equally to models like LLaMA2-chat which have already gone through a task-specific fine-tuning without loosing their task-specific nature (e.g. instruction following, conversational behaviour, …).By walking through this end-to-end workflow of knowledge infusion, we provide a practical guide for tuning foundation models to your specific needs.The LLaMA2 models was released in July 2023 together with a research publication. In the paper Touvron et al. state that LLaMA2 is “a collection of pre-trained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimised for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closedsource models.”As stated, the LLaMA2 models come in three sizes: 7B, 13B and 70B. They are available as pure completion models as well as optimised for dialog use cases. Pre-trained with roughly 2 Trillion tokens, they support context lengths of up to 4096 tokens. The fine-tuning for dialog use cases was carried out with over 100k examples and further optimised with over 1M training samples representing human preference.Within AWS the LLaMA2 models can be deployed as easy as with one click through SageMaker JumpStart or sourced from the HuggingFace model hub via the AWSxHuggingFace LLM DLC.Fine-tuning leverages transfer learning to efficiently inject niche expertise into a foundation model like LLaMA2. The process involves updating the model’s weights through continued pre-training on domain-specific data, while keeping the overall network architecture unchanged. Unlike full pre-training which requires massive datasets and compute, fine-tuning is highly sample and compute efficient. Parameter-efficient fine-tuning (PEFT) techniques, such as (q)LoRA approach, enable light-weight infusion of specialty knowledge into a general language model like LLaMA2 with minimal overhead.When speaking about fine-tuning, two different approaches are possible:The models utilise a self-supervised training approach optimizing towards a language modelling (LM) specific loss function. Decoder-only models like LLaMA2 are tied towards a Causal Language Modelling (CLM) approach with a uni-directional context. In simple words this means that they are trained towards predicting the subsequent token in an auto-regressive manner based on the previous ones as semantic context.As mentioned above, PEFT techniques enable light-weight infusion of specialty knowledge into a an LLM with minimal overhead since only a subset of the model parameters is updated. Approaches Low-Rank Adaptation (LoRA) or Quantized Low-Rank Adaptation (QLoRA) freeze the pre-trained model weights and inject trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. If you want to dive deeper into LoRA I recommend checking out Mariano Kamp’s blogpost.On July 26th, AWS announced various features around the field of generative AI at AWS summit NYC. To share additional details several announcement blogposts were published:· Agents for Amazon Bedrock· AWS entity resolution match· Role of vectorstores in generative AI applications· Vector engine for Amazon OpenSearch Serverless· AWS Glue Studio notebook powered by Amazon CodeWhisperer· Amazon EC2 P5Considering LLaMA2’s knowledge cutoff date unfortunately this model in its pure form will not be able to provide any information on these announcements. We want to change this through leveraging fine-tuning to infuse this knowledge into the Foundation Model (FM) and elevate its knowledge to be able to do so.Since we are not aiming to fine-tune the model towards a specific task but simply want to ingest domain-specific knowledge, we can go with a classic CLM-based approach (Option 1 in previous section). Philipp Schmid describes in his very comprehensive blog how to fine-tune LLaMA2 models with QLoRA, however focussing on task-specific (instruction) fine-tuning. We will take his code samples shared with the blogpost as a starting point and adjust the code accordingly.To be responsible with resource consumption we will conduct the fine-tuning for the LLaMA2–13b and LLaMA2–13b-chat models. Fine-tuning the 7b as well as 70b versions will work accordingly with adjusted training cluster configuration (see in Philipp’s blog). After the fine-tuning itself we will deploy the base models alongside with the fine-tuned models and do a high-level performance comparison.Subsequently we will do a step-by-step walkthrough through the single steps of fine-tuning the models. If you want to access the full code repository you can find it here.For data loading we use LangChain’s WebBaseLoader to load an array of websites identified by their URL.Before we can start with data preparation and training, we need to accept the license agreement of the LLaMA2 models. This includes a registration on the Meta website with the email address matching your HuggingFace account one.Then we authenticate with the HuggingFace hub from our runtime.First, we do some preprocessing on the raw websites. In a real-world use case more emphasis could be put into this stage. For demo purposes we stick to simply striping all larger concatenations of spaces so we get a cohesive and fluent text corpus. Then we load the list of docs into a HuggingFace Dataset.In the next step we are tokenising our text corpus to make it digestible for the LLM. Therefor we use the LLaMA2 tokeniser loaded from the HuggingFace hub. After being batched up according to the context window of the model (2048 tokens) the loaded tokenizer can be used for tokenization of the created batches.Then we save the dataset on S3 for usage within a training job.Now we can trigger an Amazon SageMaker training job for executing a CLM-tied QLoRA fine-tuning script based on the preprocessed data.The hyperparameters and training script is adapted from Philipp’s blogpost. The only exception is the number of training epochs executed, where with 20 we choose a relatively high amount of epochs. This is to account for the fact that our training dataset is rather small (~32k tokens). In real world use cases fine-tuning with larger datasets is advised. The training script itself can be found here.Please also note: Since the accepted model license agreement is tied to your HuggingFace account, we need to specify a HuggingFace access token.The training job configuration, e.g. training cluster configuration was adapted from Philipp’s blogpost as well.We can then execute the training job, which will perform the fine-tuning and save our model artefacts into S3.The two base models LLaMA2–13b and LLaMA2–13b-chat can be conveniently deployed via SageMaker JumpStart. This blogpost provides detailed instructions on this.The two fine-tuned models can be deployed as by using the AWSxHuggingFace LLM DLC. Therefor we point the model_data parameter of the HuggingFaceModel class to the s3 path of the model artefacts. Then we can point the environment variable HF_MODEL_ID set in the hosting container to the default model path within a DLC (“/opt/ml/model”).After having deployed all four models we want to test the performance on an example question. The announcement blogpost on Amazon EC2 P5 instances states: “P5 instances provide 8 x NVIDIA H100 Tensor Core GPUs with 640 GB of high bandwidth GPU memory, 3rd Gen AMD EPYC processors, 2 TB of system memory, and 30 TB of local NVMe storage. P5 instances also provide 3200 Gbps of aggregate network bandwidth with support for GPUDirect RDMA, enabling lower latency and efficient scale-out performance by bypassing the CPU on internode communication.”We frame the following question: “What are Amazon EC2 P5 instances? Which kind of GPUs are they equipped with?” The chat models answer as follows with an inference configuration of max_new_tokens = 200, top_p = 0.9, temperature = 0.01:We can clearly see that while the base model hallucinates about the GPU type (V100) the fine-tuned model provides us with the correct answer (H100). We also see that through qLoRA we can preserve the chat-finetuned nature of the base model and hence mitigate catastrophic forgetting. This means that we can infuse knowledge into a LLM without having to do a fully fledged instruction/chat-fine-tuning afterwards by just using a respectively task-fine-tuned model as base model. The reason why this works is the nature of LoRA where large parts of every layer of the neural network stay untouched while the layers are extended.For the pure completion version of the models we need to rephrase the question since these models are not capable to understand instructions or behave in a conversational manner. Instead they are simply completing token sequences through auto-regressive next token prediction.We frame the following question: “Amazon EC2 P5 instances are equipped with GPUs of the type” The completion models answer as follows with an inference configuration of max_new_tokens = 200, top_p = 0.9, temperature = 0.01:While fine-tuning the chat-finetuned model with this small amount of data has led to elimination of the model’s halluzination with the base LLaMA2 model this does not seem to work in our setup. This might be because of the limited size of our fine-tuning dataset. Also, more sophisticated prompt engineering and optimising the inference configuration parameter setup could be helpful.In conclusion, this blog post delves into the critical process of infusing domain-specific knowledge into large language models (LLMs) like LLaMA2, emphasizing the importance of addressing challenges related to helpfulness, honesty, and harmlessness when designing LLM-powered applications for enterprise-grade quality. The primary focus here is on the parametric approach to fine-tuning, which efficiently injects niche expertise into foundation models without compromising their general linguistic capabilities.The blog highlights the steps involved in fine-tuning LLaMA2 using parameter-efficient fine-tuning techniques, such as the qLoRA approach, and how this process can be conducted on Amazon SageMaker. By adopting this approach, practitioners can adapt LLaMA2 to specific domains, ensuring that the models remain up-to-date with recent knowledge even beyond their original training data.The article also underscores the versatility of this approach, showing that it can be applied to models like LLaMA2-chat, which have already undergone task-specific fine-tuning. This opens up opportunities to infuse knowledge into LLMs without the need for extensive instruction or chat-based fine-tuning, preserving their task-specific nature.Source link Save my name, email, and website in this browser for the next time I comment.By using this form you agree with the storage and handling of your data. * Δdocument.getElementById( "ak_js_1" ).setAttribute( "value", ( new Date() ).getTime() );Tech dedicated news site to equip you with all tech related stuff.I agree that my submitted data is being collected and stored.✉️ Send us an emailTechToday © 2023. All Rights Reserved.TechToday.co is a technology blog and review site specializing in providing in-depth insights into the latest news and trends in the technology sector.TechToday © 2023. All Rights Reserved.Be the first to know the latest updatesI agree that my submitted data is being collected and stored.

The Ultimate Guide to Cloud Gaming: D…
best projectors for home
Vivo V30 & V30 Pro Sale in India: 10%…

This post first appeared on VedVyas Articles, please read the originial post: here