Model Registry Reference¶

Complete reference for built-in AI models supported by soong.

Overview¶

soong includes 7 pre-configured models optimized for different use cases. Each model includes:

VRAM estimation - Automatic calculation of memory requirements
GPU recommendation - Best GPU type for the model
Quantization - Memory-efficient compression
Use case guidance - What the model is good (and not good) for

Quick Model Comparison¶

Model	Size	VRAM	Min GPU	Best For	Cost/hr
Llama 3.1 8B	8B	18 GB	A10 (24GB)	Quick tasks, prototyping	~$0.60
Mistral 7B	7B	20 GB	A10 (24GB)	Long contexts, efficiency	~$0.60
Qwen 2.5 Coder 32B INT4	32B	22 GB	A10 (24GB)	Code on budget	~$0.60
Code Llama 34B	34B	73 GB	A100 (80GB)	Code completion	~$1.29
Qwen 2.5 Coder 32B	32B	69 GB	A100 (80GB)	Fast code generation	~$1.29
DeepSeek-R1 70B	70B	44 GB	A100 (80GB)	Complex reasoning	~$1.29
Llama 3.1 70B	70B	44 GB	A100 (80GB)	General purpose	~$1.29

Built-in Models¶

DeepSeek-R1 70B ⭐ Recommended¶

Best for complex coding tasks requiring deep reasoning.

Model ID: deepseek-r1-70b
HuggingFace: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
Parameters: 70 billion
Quantization: INT4 (GPTQ/AWQ)
Context: 8,192 tokens
Est. VRAM: 44 GB
Min GPU: 1x A100 SXM4 (80 GB)

VRAM Breakdown¶

Component	Memory
Base weights	35.0 GB
KV cache	4.0 GB
CUDA overhead	2.0 GB
Activations	3.5 GB
Total	44.5 GB

Use Cases¶

Good for: - ✅ Complex multi-step reasoning - ✅ Debugging difficult issues - ✅ Architecture decisions - ✅ Code review with explanations - ✅ Chain-of-thought problem solving

Not good for: - ❌ Simple/quick tasks (overkill) - ❌ Long context windows (8K limit) - ❌ Speed-critical applications

Notes¶

Chain-of-thought reasoning model. Slower but more accurate than general models. Uses reinforcement learning from human feedback (RLHF) for better reasoning.

Qwen 2.5 Coder 32B¶

Fast coding specialist with exceptional context length.

Model ID: qwen2.5-coder-32b
HuggingFace: Qwen/Qwen2.5-Coder-32B-Instruct
Parameters: 32 billion
Quantization: FP16
Context: 32,768 tokens (4x longer than DeepSeek)
Est. VRAM: 69 GB
Min GPU: 1x A100 SXM4 (80 GB)

VRAM Breakdown¶

Component	Memory
Base weights	64.0 GB
KV cache	4.0 GB
CUDA overhead	2.0 GB
Activations	6.4 GB
Total	76.4 GB

Use Cases¶

Good for: - ✅ Code generation and completion - ✅ Large file refactoring (32K context) - ✅ Fast iteration cycles - ✅ Multiple programming languages - ✅ Reviewing entire modules at once

Not good for: - ❌ Complex reasoning chains - ❌ Non-coding tasks - ❌ Tasks requiring world knowledge

Notes¶

Purpose-built for code. Supports 80+ programming languages. 4x longer context than DeepSeek makes it ideal for working with large files.

Qwen 2.5 Coder 32B INT4¶

Budget-friendly quantized version of Qwen Coder.

Model ID: qwen2.5-coder-32b-int4
HuggingFace: Qwen/Qwen2.5-Coder-32B-Instruct-AWQ
Parameters: 32 billion
Quantization: INT4 (AWQ)
Context: 32,768 tokens
Est. VRAM: 22 GB
Min GPU: 1x A10 (24 GB)

VRAM Breakdown¶

Component	Memory
Base weights	16.0 GB
KV cache	4.0 GB
CUDA overhead	2.0 GB
Activations	1.6 GB
Total	23.6 GB

Use Cases¶

Good for: - ✅ Same tasks as Qwen FP16 - ✅ Cheaper GPU requirements (A10 vs A100) - ✅ Good quality/cost ratio - ✅ Cost-conscious development

Not good for: - ❌ Maximum accuracy (slight quality loss) - ❌ Tasks where FP16 precision matters

Notes¶

~5% quality loss vs FP16, but runs on cheaper GPUs. Uses AWQ (Activation-aware Weight Quantization) for better quality than naive INT4.

Llama 3.1 70B¶

General-purpose powerhouse for diverse tasks.

Model ID: llama-3.1-70b
HuggingFace: meta-llama/Llama-3.1-70B-Instruct
Parameters: 70 billion
Quantization: INT4
Context: 8,192 tokens
Est. VRAM: 44 GB
Min GPU: 1x A100 SXM4 (80 GB)

VRAM Breakdown¶

Component	Memory
Base weights	35.0 GB
KV cache	4.0 GB
CUDA overhead	2.0 GB
Activations	3.5 GB
Total	44.5 GB

Use Cases¶

Good for: - ✅ Broad task coverage - ✅ Instruction following - ✅ Writing and documentation - ✅ Code + general knowledge - ✅ Balanced performance

Not good for: - ❌ Pure coding (Qwen better) - ❌ Deep reasoning (DeepSeek better) - ❌ Very long contexts

Notes¶

Jack of all trades. Good baseline choice when task type is unclear. Strong instruction following and safer outputs due to extensive RLHF training.

Llama 3.1 8B¶

Fast and economical for simple tasks.

Model ID: llama-3.1-8b
HuggingFace: meta-llama/Llama-3.1-8B-Instruct
Parameters: 8 billion
Quantization: FP16
Context: 8,192 tokens
Est. VRAM: 18 GB
Min GPU: 1x A10 (24 GB)

VRAM Breakdown¶

Component	Memory
Base weights	16.0 GB
KV cache	4.0 GB
CUDA overhead	2.0 GB
Activations	1.6 GB
Total	23.6 GB

Use Cases¶

Good for: - ✅ Quick simple tasks - ✅ Lowest cost - ✅ Rapid prototyping - ✅ Simple code changes - ✅ Testing workflows

Not good for: - ❌ Complex reasoning - ❌ Large codebases - ❌ Nuanced decisions - ❌ Multi-step tasks

Notes¶

Use when speed/cost matters more than quality. Great for testing infrastructure or simple repetitive tasks.

Code Llama 34B¶

Meta's dedicated coding model (older generation).

Model ID: codellama-34b
HuggingFace: codellama/CodeLlama-34b-Instruct-hf
Parameters: 34 billion
Quantization: FP16
Context: 16,384 tokens
Est. VRAM: 73 GB
Min GPU: 1x A100 SXM4 (80 GB)

VRAM Breakdown¶

Component	Memory
Base weights	68.0 GB
KV cache	4.0 GB
CUDA overhead	2.0 GB
Activations	6.8 GB
Total	80.8 GB

Use Cases¶

Good for: - ✅ Code completion - ✅ Infilling (fill-in-the-middle) - ✅ 16K context window - ✅ Python, Java, C++

Not good for: - ❌ Latest techniques (2023 model) - ❌ Complex reasoning - ❌ Non-code tasks

Notes¶

Older but still capable. Consider Qwen 2.5 Coder for newer alternative with better performance.

Mistral 7B¶

Efficient model with sliding window attention.

Model ID: mistral-7b
HuggingFace: mistralai/Mistral-7B-Instruct-v0.3
Parameters: 7 billion
Quantization: FP16
Context: 32,768 tokens
Est. VRAM: 20 GB
Min GPU: 1x A10 (24 GB)

VRAM Breakdown¶

Component	Memory
Base weights	14.0 GB
KV cache	4.0 GB
CUDA overhead	2.0 GB
Activations	1.4 GB
Total	21.4 GB

Use Cases¶

Good for: - ✅ Long documents (32K context) - ✅ Low resource usage - ✅ Fast inference - ✅ Sliding window attention

Not good for: - ❌ Complex coding tasks - ❌ Tasks needing large model capacity

Notes¶

Great efficiency but limited capacity. Uses sliding window attention for efficient long context processing.

Model Selection Guide¶

By Use Case¶

Task	Recommended Model	Why
Complex debugging	DeepSeek-R1 70B	Chain-of-thought reasoning
Large file refactoring	Qwen 2.5 Coder 32B	32K context window
Budget coding	Qwen 2.5 Coder 32B INT4	Good quality, cheap GPU
General purpose	Llama 3.1 70B	Balanced performance
Quick testing	Llama 3.1 8B	Fast and cheap
Long documents	Mistral 7B	32K context, efficient

By GPU Budget¶

GPU Type	VRAM	Suitable Models
A10 (24 GB)	24 GB	Llama 8B, Mistral 7B, Qwen 32B INT4
A6000 (48 GB)	48 GB	DeepSeek-R1 70B, Llama 70B
A100 (80 GB)	80 GB	All models, especially Qwen 32B FP16
H100 (80 GB)	80 GB	All models with faster inference

By Context Length Needs¶

Context Needed	Models
8K tokens	DeepSeek-R1, Llama 8B/70B
16K tokens	Code Llama 34B
32K tokens	Qwen 2.5 Coder, Mistral 7B

VRAM Estimation Formula¶

For any model, VRAM is estimated as:

Total VRAM = Base + KV Cache + Overhead + Activations

Base = params_billions × bytes_per_param
KV Cache = min(4.0 GB, context_length / 2048)
Overhead = 2.0 GB (CUDA, framework)
Activations = base × 0.1

Quantization Levels¶

Quantization	Bytes/Param	Memory vs FP32	Quality Loss
FP32	4.0	100%	0% (baseline)
FP16/BF16	2.0	50%	~0-1%
INT8	1.0	25%	~3-5%
INT4 (GPTQ/AWQ)	0.5	12.5%	~5-8%

Custom Models¶

You can add your own models using the CLI or config file.

Via CLI¶

# Interactive
soong models add

# With flags
soong models add \
  --name my-model-70b \
  --hf-path myorg/custom-llama-70b \
  --params 70 \
  --quantization int4 \
  --context 8192

Via Config File¶

Edit ~/.config/gpu-dashboard/config.yaml:

custom_models:
  my-model-70b:
    hf_path: myorg/custom-llama-70b
    params_billions: 70
    quantization: int4
    context_length: 8192
    notes: Fine-tuned on domain data

See Configuration Reference for full schema.

Listing Models¶

# List all models (built-in + custom)
soong models

# Get detailed info
soong models info deepseek-r1-70b

Model Update Policy¶

Built-in models are updated when:

New high-quality models release (e.g., Llama 4, GPT-4 class)
Better quantization methods (e.g., ExLlamaV2, GGUF improvements)
Lambda Labs adds new GPUs (requiring different recommendations)

Check releases for model registry updates.

Performance Benchmarks¶

Approximate tokens/second on Lambda Labs GPUs:

Model	A10 (24GB)	A100 (80GB)	H100 (80GB)
Llama 8B FP16	80 tok/s	120 tok/s	200 tok/s
Qwen 32B INT4	40 tok/s	60 tok/s	100 tok/s
DeepSeek 70B INT4	N/A	30 tok/s	50 tok/s
Qwen 32B FP16	N/A	45 tok/s	75 tok/s

Benchmarks approximate. Actual performance varies by context length and prompt complexity.

Frequently Asked Questions¶

Why INT4 for 70B models?¶

INT4 quantization (GPTQ/AWQ) allows 70B models to fit on 80GB GPUs with minimal quality loss (~5-8%). Without it, you'd need multi-GPU setups.

Can I use GGUF models?¶

Currently, soong uses SGLang which doesn't support GGUF. Models must be in HuggingFace Transformers format (safetensors).

What about LoRA adapters?¶

Custom LoRA adapters can be loaded by specifying the adapter path in cloud-init scripts. Not currently automated in CLI.

How do I choose between DeepSeek and Qwen?¶

DeepSeek: Better for complex reasoning, architecture decisions, debugging
Qwen: Better for code generation, refactoring, working with large files

Why is Qwen FP16 so expensive (69 GB)?¶

FP16 doubles memory vs INT4. Qwen 32B FP16 = 64 GB base + overhead = 69 GB total. Use INT4 version if cost matters.

Model Registry Reference¶

Overview¶

Quick Model Comparison¶

Built-in Models¶

DeepSeek-R1 70B ⭐ Recommended¶

VRAM Breakdown¶

Use Cases¶

Notes¶

Qwen 2.5 Coder 32B¶

VRAM Breakdown¶

Use Cases¶

Notes¶

Qwen 2.5 Coder 32B INT4¶

VRAM Breakdown¶

Use Cases¶

Notes¶

Llama 3.1 70B¶

VRAM Breakdown¶

Use Cases¶

Notes¶

Llama 3.1 8B¶

VRAM Breakdown¶

Use Cases¶

Notes¶

Code Llama 34B¶

VRAM Breakdown¶

Use Cases¶

Notes¶

Mistral 7B¶

VRAM Breakdown¶

Use Cases¶

Notes¶

Model Selection Guide¶

By Use Case¶

By GPU Budget¶

By Context Length Needs¶

VRAM Estimation Formula¶

Quantization Levels¶

Custom Models¶

Via CLI¶

Via Config File¶

Listing Models¶

Model Update Policy¶

Performance Benchmarks¶

Frequently Asked Questions¶

Why INT4 for 70B models?¶

Can I use GGUF models?¶

What about LoRA adapters?¶

How do I choose between DeepSeek and Qwen?¶

Why is Qwen FP16 so expensive (69 GB)?¶

See Also¶