A step-by-step guide to accelerating large-scale language models
Large-scale language model (LLM) deployment
We live in an amazing age of large-scale language models like ChatGPT, GPT-4, and Claude that can perform many amazing tasks. In almost every field, from education and healthcare to art and business, large language models are being used to drive efficiency in service delivery. Over the past year, many excellent open source large language models have been released, such as Llama, Mistral, Falcon, and Gemma. Although these open source LLMs are available to anyone, they can be very difficult to deploy as they are very slow and require a lot of GPU computing power to run for real-time deployment. A variety of tools and approaches have been developed to simplify large-scale language model deployment.
Many deployment tools have been created to provide LLM with faster inference, such as vLLM, c2translate, TensorRT-LLM, and llama.cpp. Quantization techniques are also used to optimize GPUs for loading very large language models. This article describes how to deploy large-scale language models using vLLM and quantization.
Latency and Throughput
Some of the key factors that affect the speed performance of large language models are GPU hardware requirements and model size. The larger the model, the more GPU computing power it requires to run it. Common benchmark metrics used to measure the speed performance of large language models include: delay time and throughput.
Delay Time: This is the time required for a large language model to generate a response. Typically measured in seconds or milliseconds.
Throughput: This is the number of tokens generated per second or millisecond in a large language model.
Install required packages
Here are two packages needed to run large language models: Hugging Face Transformers and accelerate.
pip3 install transformers
pip3 install accelerate
What is Phi-2?
Pi-2 Microsoft’s state-of-the-art foundational model with 2.7 billion parameters. It is pre-trained with a variety of data sources, from code to textbooks. Learn more Pi-2 From here.
Benchmarking LLM latency and throughput using Hugging Face Transformers
generated output
Latency: 2.739394464492798 seconds
Throughput: 32.36171766303386 tokens/second
Generate a python code that accepts a list of numbers and returns the sum. [1, 2, 3, 4, 5]
A: def sum_list(numbers):
total = 0
for num in numbers:
total += num
return totalprint(sum_list([1, 2, 3, 4, 5]))
Step-by-step code analysis
Lines 6-10: loaded Pi-2 We created a model and tokenized the prompt.Generate Python code that accepts a list of numbers and returns the sum.”
Lines 12-18: generate a response from the model delay time By calculating the time required to generate a response
Lines 21-23: Find the total length of tokens in the generated response and calculate it as delay time and calculated throughput.
This model was run on A1000 (16GB GPU) delay time ~ Of 2.7 seconds throughput 32 tokens per second.
vLLM is an open source LLM library for providing large-scale language models at low cost. delay time and high throughput.
How vLLM works
Transformers are the building blocks of large language models. Transformer networks use the following mechanisms: attention mechanism, which is used to study and understand the context of words in the network. that much attention mechanism It consists of mathematical calculations of matrices known as keys and values. The memory used to interact with these attention keys and values affects model speed. vLLM introduces the following new attention mechanisms: PagedAttention Efficiently manages memory allocation for the translator’s attention keys and values during token generation. The memory efficiency of vLLM has proven to be very useful for running large language models with low latency and high throughput.
This is a high-level explanation of how vLLM works. For more technical details, please visit the vLLM documentation.
Install vLLM
pip3 install vllm==0.3.3
Running Phi-2 with vLLM
generated output
Latency: 1.218436622619629seconds
Throughput: 63.15334836428132tokens/second
[1, 2, 3, 4, 5]
A: def sum_list(numbers):
total = 0
for num in numbers:
total += num
return totalnumbers = [1, 2, 3, 4, 5]
print(sum_list(numbers))
Step-by-step code analysis
Lines 1-3: Required packages were imported from vLLM to run. Pi-2.
Lines 5-8: loaded Pi-2 We used vLLM to define prompts and set important parameters for model execution.
Lines 10-16: I generated the model’s response using: create llm. and calculated delay time.
Lines 19-21: Find the total length of tokens generated in the response and divide the token length by the delay time. throughput.
Lines 23-24: We got the generated text.
i ran Pi-2 Using vLLM from the same prompt “Create Python code that takes a list of numbers and returns their sum.” On the same GPU, A1000 (16GB GPU), vLLM is delay time ~ Of 1.2 seconds and throughput ~ Of 63 tokens/secComparison with Hugging Face Transformer delay time ~ Of 2.85 seconds and throughput ~ Of 32 tokens per second. Running large language models using vLLM produces the same accurate results as using Hugging Face, with much lower latency and higher throughput.
memo: The metrics obtained for vLLM (latency and throughput) are estimated benchmarks for vLLM performance. Model creation speed depends on many factors, including the length of the input prompt and GPU size. According to the official vLLM report, running LLM models on a powerful GPU like the A100 in a production environment using vLLM can provide the following benefits: 24x higher throughput Better than Hugging Face Transformer.
Real-time benchmarking latency and throughput
Our method of calculating latency and throughput for running Phi-2 is experimental, and we did this to demonstrate how vLLM accelerates the performance of large language models. In real-world use cases for LLM, such as chat-based systems where the model outputs generated tokens, measuring latency and throughput is more complex.
Chat-based systems are based on streaming output tokens. Some of the key factors affecting LLM metrics are: Time to first token (time required for the model to generate the first token) Time per output token (Time spent per output token generated) Input sequence length, expected output, total expected output tokensand model size. In chat-based systems, latency is typically a combination of: Time to first token and Time per output token Multiply by the total expected output tokens.
The longer the input sequence passed to the model, the slower it will respond. Some approaches used to run LLM in real-time include prompting users to batch their input requests or perform inference on the requests simultaneously, which helps improve throughput. In general, using a powerful GPU and serving LLM with an efficient tool such as vLLM improves both latency and throughput in real time.
Running a vLLM Deployment in Google Colab
Quantization is the conversion of a machine learning model from higher precision to lower precision, usually by shrinking the model’s weights into smaller bits. 8 bit or 4 bits. Deployment tools like vLLM are very useful for inferring large-scale language models with very low latency and high throughput. we can run Pi-2 Hugging Face and vLLM are conveniently available on Google Colab’s T4 GPU. 2.7 billion parameters. For example, we have the following 7 billion parameter model: Mistral 7B Cannot run on Colab using Hugging Face or vLLM. Quantization is best suited for managing the GPU hardware requirements of large language models. If GPU availability is limited and you need to run very large language models, quantization is the best way to load LLM on limited devices.
bit and byte
A Python library built with a custom quantization function to scale down the model’s weights to lower bits (8 bit and 4 bits).
Install BitsandBytes
pip3 install bitsandbytes
Quantization of the Mistral 7B model
Mistral 7BMistralAI’s 7 billion parameter model, is one of the best, most cutting-edge, open-source, large-scale language models. You will go through a step-by-step execution process. Mistral 7B It uses a variety of quantization techniques that can run on Google Colab’s T4 GPU.
Quantization with 8-bit precision: The weights of the machine learning model are converted to 8-bit precision. bit and byte It uses the same Hugging Face code, but with some modifications for quantization, and is integrated with the Hugging Face transformer to load the language model.
Line 1: I imported the packages required to run the model, including: BitsandBytesConfig library.
Lines 3-4: We defined the quantization configuration and set the parameters. load_in_8bit Set to true to load the model’s weights. 8 bit degree.
Lines 7-9: To load the model, we pass the quantization configuration to the function and set its parameters. device_map for bitsandbyte Automatically allocates appropriate GPU memory for model loading. Finally, we loaded the tokenizer weights.
Quantization with 4-bit precision: Weight of machine learning model 4 byeThere is no precision.
loading code Mistral 7B 4-bit precision is 8 bit Precision except for a few changes:
- changed load_in_8bit to load_in_4bit.
- new parameters bnb_4bit_compute_dtype is introduced in BitsandBytesConfig To perform calculations on the model bfloat16. bfloat16 A computational data type that loads the model’s weights for faster inference. Can be used with both. 4 bits and 8 bit precision. if there is 8 bit Just change the parameters in: bnb_4bit_compute_dtype to bnb_8bit_compute_dtype.
NF4 (4-bit plain float) and double quantization
NF4 (4-bit plain float) QLoRA’s quantization method is an optimal quantization approach that provides better results than standard 4-bit quantization. This is combined with double quantization, where quantization occurs twice. The quantized weights from the first quantization step are passed to the next quantization step to produce optimal float range values for the model weights. According to a report in the QLoRA paper, NF4 with double quantization There is no loss in accuracy performance. Read more technical details about NF4 and double quantization in the QLoRA documentation.
Lines 4-9: Additional parameters have been set. BitsandBytesConfiguration:
- LOAD_4BIT: Model loading with 4-bit precision is set to true.
- bnb_4bit_Quant_type: The quantization type is set to nf4.
- bnb_4bit_use_double_Quant: Double Quantization is set to True.
- bnb_4_bit_compute_dType: bfloat16 Computational data types are used for faster inference.
Lines 11-13: We have loaded the model’s weights and tokenizer.
Full code for model quantization
generated output
<s> [INST] What is Natural Language Processing? [/INST] Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) and
computer science that deals with the interaction between computers and human language. Its main objective is to read, decipher,
understand, and make sense of the human language in a valuable way. It can be used for various tasks such as speech recognition,
text-to-speech synthesis, sentiment analysis, machine translation, part-of-speech tagging, name entity recognition,
summarization, and question-answering systems. NLP technology allows machines to recognize, understand,
and respond to human language in a more natural and intuitive way, making interactions more accessible and efficient.</s>
Quantization is a very good approach for optimizing the execution of very large language models on small GPUs and can be applied to all models such as Llama 70B, Falcon 40B and mpt-30b. The LLM.int8 paper reports that very large language models suffer less accuracy loss when quantized compared to smaller models. Quantization is best applied to very large language models and does not work well for smaller models due to loss of accuracy performance.
Running Mixtral 7B Quantization on Google Colab
conclusion
This article provided a step-by-step approach to measuring the speed performance of large-scale language models, explained how vLLM works, and how vLLM can be used to improve latency and throughput of large-scale language models. Finally, we described quantization and how it can be used to load large language models on small GPUs.
Contact me via:
Email: olafenwaayoola@gmail.com
LinkedIn: https://www.linkedin.com/in/ayoola-olafenwa-003b901a9/
References