vLLM#

Soniox-7B can be deployed using the vLLM OpenAI-compatible API server and used via the Chat Completions API. The correct conversation template will be used automatically. The server can be deployed using a docker image or directly from Python.

With docker#

On a GPU-enabled host, you can run the Soniox-7B vLLM image with the following command:

docker run --gpus all \
  -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
  public.ecr.aws/r6l7m9m8/soniox-7b-vllm:latest \
    --host 0.0.0.0 \
    --port 8000 \
    --model soniox/Soniox-7B-v1.0 \
    --max-model-len 8192 \
    --enforce-eager \
    --dtype float16

This will download the model from Hugging Face. Make sure to set HF_TOKEN to your Hugging Face user access token.

Parameters passed to the container will be forwarded to the vLLM server. For an explanation of these see Run vLLM server.

Without docker#

Alternatively, you can directly start the vLLM server on a GPU-enabled host.

Install vLLM#

First you need to install vLLM (or use conda add vllm if you are using Anaconda):

pip3 install -U vllm

Log in to Hugging Face#

You will also need to log in to the Hugging Face hub using:

huggingface-cli login

Run vLLM server#

You can now use the following command to start the server:

python3 -u -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 \
    --port 8000 \
    --model soniox/Soniox-7B-v1.0 \
    --max-model-len 8192 \
    --enforce-eager \
    --dtype float16

Explanation:

  • vllm.entrypoints.openai.api_server is the vLLM OpenAI-compatible API server module.
  • --max-model-len prevents going beyond the context length that the model was trained with.
  • --enforce-eager disables use of CUDA graphs to avoid a GPU memory leak.
  • --dtype specifies the computation data type. We recommend float16.

If you downloaded the model as a zip archive, then --model should be the path to the Soniox-7B-v1.0 directory extracted from the archive. Please note that in this case, clients need to specify the same string for model.

Client API#

Clients should use the Chat Completions API with the vLLM server.

  • Base URL: http://hostname:8000/v1
  • API key: none
  • Model: soniox/Soniox-7B-v1.0

Here is an example of usage from Python.

pip3 install -U openai
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="none")

choice = client.chat.completions.create(
    model="soniox/Soniox-7B-v1.0",
    messages=[{"role": "user", "content": "3*7?"}],
    temperature=0.5,
).choices[0]

if choice.finish_reason != "stop":
    raise Exception(f"finish_reason is not stop but {choice.finish_reason}")

response = choice.message.content
print(response)