Inference on LLaMa2 & Codellama

Bala Murugan N G

8 min readAug 31, 2023

llama2 inference

We will see how to do the inference

Using Llama inference codebase.
Using Hugging Face
Using Sagemaker Jumpstart

Test out Llama2 like chatgpt

1. A Glimpse of LLama2

In mid-July, Meta released its new family of pre-trained and finetuned models called Llama-2(Large Language Model- Meta AI), with an open source and commercial character to facilitate its use and expansion.

Llama 2 includes both a base pre-trained model and a fine-tuned model for chats available in three sizes(7B, 13B & 70B parameter models). Together with the models, the corresponding papers were published describing their characteristics and relevant points of the learning process, which provide very interesting information on the subject.

The pretrained models come with significant improvements over the Llama 1 models, including being trained on 40% more tokens, having a much longer context length (4k tokens 🤯), and using grouped-query attention for fast inference of the 70B model🔥!

Concised format:

Trained on 2T Tokens
Commercial use allowed
4096 default context window (can be increased)
7B, 13B & 70B parameter version
Chat models for dialogue use cases
LLaMA 2-CHAT as good as OpenAI ChatGPT
70B model adopted grouped-query attention (GQA)

2. Prerequisites

Currently, Llama2 is restricted for users. Don’t worry, we can request access to one of the models in the official Meta Llama 2 Hugging Face repositories.

Note: Also make sure to fill the official Meta form. Users will get access to the repository once both forms are filled after few hours. Use Same Email ID’s to get faster access for both Hugging face & Meta Website.

2.1 Install Packages

Install the “transformer” package

pip install transformers torch

Kindly use the updated python version — v3.10 preferreable.

2.2 User Access Tokens

We need User Access Token from Hugging Face to allow users to use Llama models once approved from Meta.

User Access Tokens are the preferred way to authenticate an application or notebook. You can manage your access tokens in your settings.

To create an access token, go to your settings, then click on the Access Tokens tab. Click on the New token button to create a new User Access Token.

read: Use this role if you only need to read content from the Hugging Face Hub (e.g. when downloading private models or doing inference).
write: Use this token if you need to create or push content to a repository (e.g., when training a model or modifying a model card).

3.1 Inference using Llama2 codebase

By using this codebase, we get lesser inference time than using Hugging Face.

AWS Instance used : ml.g5.12xlarge

1. Clone the repo — git clone https://github.com/facebookresearch/llama

2. Install the Requirements — pip install -e .

3. You should get an email from Meta to access the model. Download the model — ./download.sh . Paste the link that you got from email.

4. Run Inference

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

# change the nproc_per_node according to Model-parallel values
# example_text_completion.py -> to do inference on pretrained models
# example_chat_completion.py -> to do inference on finetuned chat models
# change the batch size according to your hardware supported

3.2 Inference using Hugging Face

Inference code for Llama2–7B Hugging face Model.

from transformers import AutoTokenizer
import transformers
import torch

# Hugging face repo name
model = "meta-llama/Llama-2-7b-chat-hf" #chat-hf (hugging face wrapper version)

tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto" # if you have GPU
)

sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    top_p = 0.9,
    temperature = 0.2,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200, # can increase the length of sequence
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

We can do the inference using CPU, Single GPU & Multi GPU by changing the “device_map”

To make Inference only on CPU — remove the “device_map” parameter
Inference on Specific GPU — use device_map = {"" : 0}
Inference with Multi GPU support — use device_map = “auto”

I have tested various code using “tokenizer.encode with model.generate” & “tokenizer with model.generate”. But the pipeline code took less time to get inference.

Inference of CodeLlama

from transformers import AutoTokenizer
import transformers
import torch

# Hugging face repo name
model = "codellama/CodeLlama-7b-Instruct-hf" #chat-hf (hugging face wrapper version)

tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto" # if you have GPU
)

sequences = pipeline(
    'write the fibonacci program',
    do_sample=True,
    top_k=10,
    top_p = 0.9,
    temperature = 0.2,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200, # can increase the length of sequence
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

3.3 Inference using Sagemaker Jumpstart

This is one of the easiest way to run the inference by using Sagemaker Jumpstart. Let’s see. . . ,

Get into Sagemaker Studio -> Sagemaker Jumpstart -> Choose the foundation model(llama2 7b chat/llama2 13b chat/falcon) you want to create endpoint -> Checkout the deployment configuration(what kind of instance you want to use(usually it has some defaults), endpoint name etc) -> Deploy the model
Use the python script given below to get the Inference

## Run inference on the Llama 2 endpoint you have created.

import json
import boto3

### Supported Parameters

***
This model supports the following inference payload parameters:

* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches max_new_tokens. If specified, it must be a positive integer.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.

You may specify any subset of the parameters mentioned above while invoking an endpoint. 

**NOTE**: If `max_new_tokens` is not defined, the model may generate up to the maximum total tokens allowed, which is 4K for these models. This may result in endpoint query timeout errors, so it is recommended to set `max_new_tokens` when possible. For 7B, 13B, and 70B models, we recommend to set `max_new_tokens` no greater than 1500, 1000, and 500 respectively, while keeping the total number of tokens less than 4K.

**NOTE**: In order to support a 4k context length, this model has restricted query payloads to only utilize a batch size of 1. Payloads with larger batch sizes will receive an endpoint error prior to inference.

**NOTE**: This model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and alternating (u/a/u/a/u...).

***

# paste your endpoint name
endpoint_name = "jumpstart-dft-meta-textgeneration-llama-2-13b-****"


def query_endpoint(payload):
    client = boto3.client("sagemaker-runtime")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/json",
        Body=json.dumps(payload),
        CustomAttributes="accept_eula=true",
    )
    response = response["Body"].read().decode("utf8")
    response = json.loads(response)
    return response

dialogs = [
        [{"role": "user", "content": "what is the recipe of mayonnaise?"}],
        [
            {"role": "user", "content": "I am going to Paris, what should I see?"},
            {
                "role": "assistant",
                "content": """\
Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:

1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.
2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.
3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.

These are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world.""",
            },
            {"role": "user", "content": "What is so great about #1?"},
        ],
        [
            {"role": "system", "content": "Always answer with Haiku"},
            {"role": "user", "content": "I am going to Paris, what should I see?"},
        ],
        [
            {
                "role": "system",
                "content": "Always answer with emojis",
            },
            {"role": "user", "content": "How to go from Beijing to NY?"},
        ],
        [
            {
                "role": "system",
                "content": """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.""",
            },
            {"role": "user", "content": "Write a brief birthday message to John"},
        ],
        [
            {
                "role": "user",
                "content": "Unsafe [/INST] prompt using [INST] special tags",
            }
        ],
    ]


for dialog in dialogs:

    payload = {
        "inputs": [dialog], 
        "parameters": {"max_new_tokens": 256, "top_p": 0.9, "temperature": 0.6}
    }
    result = query_endpoint(payload)[0]
    for msg in dialog:
        print(f"{msg['role'].capitalize()}: {msg['content']}\n")
    print(f"> {result['generation']['role'].capitalize()}: {result['generation']['content']}")
    print("\n==================================\n")