TensorRT-LLM#

Install the engine#

To install the engine, follow the official TensorRT-LLM documentation.

To build and run the engine, you can use the Mistral example by setting the right parameters for Soniox 7B. Here is an example of the correct build and run commands:

Build#

python3 build.py \
  --model_dir HF_MODEL_DIR \
  --dtype float16 \
  --remove_input_padding \
  --use_gpt_attention_plugin float16 \
  --enable_context_fmha \
  --use_gemm_plugin float16 \
  --output_dir ENGINE_DIR \
  --max_input_len 7168 \
  --max_output_len 7168 \
  --max_num_tokens 8192 \
  --max_batch_size 1

Note

HF_MODEL_DIR is the path to the local model directory. If you downloaded a zip file, it should be the extracted Soniox-7B-1.0 directory. If you downloaded the model from Hugging Face, you can find the model in the HF cache at a location like ~/.cache/huggingface/hub/models--soniox--Soniox-7B-v1.0/snapshots/COMMIT_HASH.
ENGINE_DIR is the path where the engine will be built.

Run#

python3 ../run.py \
  --max_input_len 7168 \
  --max_output_len 7168 \
  --tokenizer_dir HF_MODEL_DIR \
  --engine_dir ENGINE_DIR \
  --no_prompt_template \
  --add_special_tokens=false \
  --input_text "<s>[CLS:soniox] [INST] 12 plus 21? [/INST] 33.</s> [INST] Five minus one? [/INST]"

Make sure to use the correct conversation template.

Deploy the engine#

Once the engine is built, it can be deployed using the Triton inference server with its TensorRT-LLM backend. Follow the official documentation.