SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, highest GPU availability, and managed execution.

Below is an example of the SkyPilot config to deploy Soniox 7B.

SkyPilot Configuration#

After installing SkyPilot, you need to create a configuration file that tells SkyPilot how and where to deploy your inference server, using our pre-built docker container:

  cloud: ${CLOUD_PROVIDER}
  accelerators: A10G:1
    - 8000

run: |
  docker run --gpus all -p 8000:8000 \
    --host \
    --port 8000 \
    --model soniox/Soniox-7B-v1.0 \
    --max-model-len 8192 \
    --enforce-eager \
    --dtype float16

Once these environment variables are set, you can use sky launch to launch the inference server with the name soniox-7b:

sky launch -c soniox-7b soniox-7b.yaml --region us-east-1


When deployed that way, the model will be accessible to the whole world. You must secure it, either by exposing it exclusively on your private network (change the --host Docker option for that), by adding a load-balancer with an authentication mechanism in front of it, or by configuring your instance networking properly.

Test it out#

To easily retrieve the IP address of the deployed soniox-7b cluster you can use:

sky status --ip soniox-7b

You can then use curl to send a chat completion request:

IP=$(sky status --ip cluster-name)

curl http://$IP:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "soniox/Soniox-7B-v1.0",
    "messages": [{"role": "user", "content": "12 * 7?"}],
    "max_tokens": 128