Cost Optimization¶

Strategies to minimize GPU costs while maintaining productivity.

Overview¶

GPU instances can be expensive ($0.60-$2.00+/hour). This guide shows how to optimize costs without sacrificing development quality.

Cost Structure¶

Understanding the pricing model:

Total Cost = GPU Rate × Uptime Hours

Key variables:

GPU Rate: Determined by GPU type ($0.60/hr for A10, $1.29/hr for A100)
Uptime Hours: How long the instance runs

Optimization targets:

Choose the cheapest GPU that fits your model
Minimize unnecessary uptime

GPU Selection Strategies¶

Rule: Use the Smallest GPU That Fits¶

graph TD
    A[Model needs 26 GB] --> B{GPU Options}
    B --> C[A10: 24 GB ❌]
    B --> D[A6000: 48 GB ✅]
    B --> E[A100: 80 GB ✅]

    D --> F[$0.80/hr - BEST]
    E --> G[$1.29/hr - Overpaying]

    style F fill:#90EE90
    style G fill:#FFB6C1

Example: Qwen2.5-Coder 32B INT4

Needs: 26 GB VRAM
❌ A100 80GB ($1.29/hr): Overpaying by $0.49/hr
✅ A10 24GB (too small)
✅ A6000 48GB ($0.80/hr): Best choice

Savings: $0.49/hr × 4 hours = **$1.96 saved per session**

Use INT4 When Quality Allows¶

Quantization reduces VRAM requirements, enabling cheaper GPUs:

Model	Quantization	VRAM	GPU	Cost/hr	Quality Loss
Qwen2.5-Coder 32B	FP16	46 GB	A6000	$0.80	0% (baseline)
Qwen2.5-Coder 32B	INT4	26 GB	A10	$0.60	~5%

Savings: $0.20/hr × 4 hours = **$0.80 saved per session**

When to use INT4:

✅ Code generation and completion
✅ Simple refactoring
✅ Prototyping
❌ Production code review
❌ Complex reasoning tasks

Model Selection by Budget¶

graph LR
    A[Budget?] --> B{Tier}

    B -->|Low $0.60/hr| C[Llama 3.1 8B]
    B -->|Medium $0.80/hr| D[Qwen2.5-Coder 32B]
    B -->|High $1.29/hr| E[DeepSeek-R1 70B]

    C --> F[A10 24GB]
    D --> G[A6000 48GB]
    E --> H[A100 80GB]

Cost comparison (4-hour session):

Tier	Model	GPU	Session Cost
Budget	Llama 3.1 8B	A10	$2.40
Medium	Qwen2.5-Coder 32B INT4	A10	$2.40
Medium+	Qwen2.5-Coder 32B	A6000	$3.20
Premium	DeepSeek-R1 70B	A100	$5.16

Strategy: Start with medium tier (Qwen INT4), escalate to premium only when needed.

Lease Management Strategies¶

Use Shorter Initial Leases¶

Start with 2 hours, extend as needed:

# Start short
soong start --hours 2  # $1.60

# Extend if needed (only if still working)
soong extend 2  # +$1.60 total $3.20

vs. starting with 4 hours:

# Start long
soong start --hours 4  # $3.20

# Finish early, waste time
# (Stop after 2.5 hours, wasted $1.20)

Savings: Avoid paying for unused time.

Monitor Lease Time¶

Set reminders to check status:

# Every hour
watch -n 3600 soong status

Or add to shell prompt:

# Add to ~/.bashrc or ~/.zshrc
gpu_status() {
  soong status 2>/dev/null | grep "Time Left" | awk '{print $NF}'
}

PS1='[GPU: $(gpu_status)] $ '

Extend Before Expiration¶

Avoid unexpected termination:

Time Left: 45m  # Yellow warning

Action: Extend now, not at 5 minutes:

soong extend 2

Development Workflow Optimization¶

Batch Work Sessions¶

Instead of multiple small sessions, batch work:

# ❌ Inefficient: 3 separate sessions
Session 1: 1 hour ($0.80)
Session 2: 1.5 hours ($1.20)
Session 3: 0.5 hours ($0.40)
Total: 3 hours, 3 launches = $2.40

# ✅ Efficient: 1 combined session
Session 1: 3 hours ($2.40)
Total: 3 hours, 1 launch = $2.40

Savings: Avoid instance startup overhead (5-10 min per launch).

Use Local Development First¶

Develop and test locally, use GPU only for model-specific work:

graph LR
    A[Write Code Locally] --> B[Test Locally]
    B --> C{Need Model?}
    C -->|No| A
    C -->|Yes| D[Launch GPU]
    D --> E[Run Inference]
    E --> F[Stop Instance]

Example workflow:

Write code on laptop (free)
Write tests on laptop (free)
Launch GPU instance for inference testing
Stop instance immediately after

Savings: Minimize GPU uptime to only essential tasks.

Stop Instances Between Sessions¶

Don't leave instances running overnight or during breaks:

# Before lunch
soong stop --yes

# After lunch
soong start

Savings: $0.80/hr × 1 hour lunch = **$0.80 saved**

Idle Detection Configuration¶

The status daemon automatically stops instances after 30 minutes of inactivity.

How It Works¶

The daemon monitors:

CPU usage
GPU utilization
Network activity
Active SSH sessions

Idle = all metrics below threshold for 30 minutes

Benefits¶

Prevents forgetting to stop instances
No manual monitoring required
Automatic cost control

Limitations¶

Not idle if:

Long-running training jobs
Background processes
Active SSH tunnel (even if not using)

Solution: Use lease limits (8 hours max) as hard stop.

Cost Estimation¶

Before Launch¶

Always review cost estimates:

┌─────────────────────────────────────┐
│ Launch Instance                     │
├─────────────────────────────────────┤
│                                     │
│ Cost Estimate                       │
│                                     │
│ GPU: 1x A6000 (48 GB)              │
│ Rate: $0.80/hour                   │
│ Duration: 4 hours                  │
│                                     │
│ Estimated cost: $3.20              │
│                                     │
└─────────────────────────────────────┘

Skip if confident:

soong start --yes

During Session¶

Monitor current and estimated total costs:

soong status

┌────────────────────────────────────────────────────────────┐
│ ID   │ Uptime │ Time Left │ Cost Now │ Est. Total         │
├──────┼────────┼───────────┼──────────┼────────────────────┤
│ a1b2 │ 2h 30m │ 1h 30m    │ $2.00    │ $3.20              │
└────────────────────────────────────────────────────────────┘

Cost Now: Amount spent so far
Est. Total: Projected cost if lease runs to completion

Cost Comparison Scenarios¶

Scenario 1: Simple Code Changes¶

Task: Fix a bug, add a feature

Options:

Approach	Model	GPU	Duration	Cost
Premium	DeepSeek-R1 70B	A100	1 hr	$1.29
Medium	Qwen2.5-Coder 32B	A6000	1 hr	$0.80
Budget	Llama 3.1 8B	A10	1 hr	$0.60

Recommendation: Budget (Llama 3.1 8B) Savings: $0.69 vs premium

Scenario 2: Large Codebase Refactoring¶

Task: Refactor multiple files, need long context

Options:

Approach	Model	GPU	Duration	Cost
High Quality	Qwen2.5-Coder 32B (FP16)	A6000	4 hr	$3.20
Budget	Qwen2.5-Coder 32B (INT4)	A10	4 hr	$2.40

Recommendation: Budget (INT4) Savings: $0.80 (25% cheaper) Trade-off: ~5% quality loss (acceptable for refactoring)

Scenario 3: Complex Debugging¶

Task: Debug a difficult, multi-step issue

Options:

Approach	Model	GPU	Duration	Cost
Premium	DeepSeek-R1 70B	A100	3 hr	$3.87
Medium	Qwen2.5-Coder 32B	A6000	4 hr	$3.20

Recommendation: Premium (DeepSeek-R1) Why: Better reasoning saves time (3 hrs vs 4 hrs) Actual savings: $3.87 vs $3.20 (spend $0.67 more but save 1 hour)

Monthly Budget Planning¶

Estimate Monthly Usage¶

Monthly Cost = Sessions/week × Weeks × Avg Cost/session

Example:

3 sessions/week
4 weeks/month
$3.00/session average

Monthly Cost = 3 × 4 × $3.00 = $36/month

Usage Patterns¶

Usage Level	Sessions/Week	Hours/Week	Est. Monthly Cost
Light	2-3	4-6	$20-40
Medium	5-7	10-14	$50-80
Heavy	10+	20+	$100-200

Budget Allocation¶

Strategy: Allocate 60/30/10

60% - Budget tier (Llama 8B, Qwen INT4)
30% - Medium tier (Qwen FP16)
10% - Premium tier (DeepSeek-R1)

Example ($80/month budget):

$48 - Budget tier (20 sessions × $2.40)
$24 - Medium tier (7 sessions × $3.20)
$8 - Premium tier (2 sessions × $4.00)

Result: 29 sessions/month, average $2.76/session

Advanced Strategies¶

Use Persistent Filesystem¶

Store work on persistent filesystem (/home/ubuntu/workspace) to avoid redoing work:

# Lost work = wasted money
❌ /tmp/project  # Lost on termination

# Persistent work
✅ /home/ubuntu/workspace/project  # Survives termination

Benefit: Resume work in next session without starting over.

Parallel Development¶

Run multiple instances for parallel tasks:

# Terminal 1
soong start --name "frontend" --model llama-3.1-8b --hours 2

# Terminal 2
soong start --name "backend" --model llama-3.1-8b --hours 2

Cost: $2.40 total (2 × 2 hours × $0.60) Time savings: 2 hours → 1 hour (50% faster) Effective rate: $2.40/hr for 2x productivity

Automate Stop¶

Add a cron job or script to auto-stop instances:

# Stop all instances at 6 PM
0 18 * * * soong stop --yes

Benefit: Never forget to stop instances.

Benchmarking Costs¶

Track your actual costs over time:

# View termination history
soong status --history --history-hours 168  # Last week

┌──────────────────────────────────────────────────────────┐
│            Termination History (Last 168 Hours)          │
├────────┬─────────┬────────────────┬────────┬────────────┤
│ Time   │ Inst ID │ Reason         │ Uptime │ GPU        │
├────────┼─────────┼────────────────┼────────┼────────────┤
│ Jan 01 │ a1b2c3  │ Lease expired  │ 4h 5m  │ a6000      │
│ Dec 31 │ x9y8z7  │ Manual stop    │ 2h 12m │ a10        │
│ Dec 30 │ m5n6o7  │ Idle timeout   │ 1h 30m │ a100       │
└────────┴─────────┴────────────────┴────────┴────────────┘

Total uptime: 7h 47m
Estimated cost: ~$5.50

Analysis: If over budget, adjust strategy:

Use cheaper GPUs
Shorter sessions
More aggressive idle timeout

Cost Avoidance Checklist¶

Before every launch:

Do I need a GPU for this task?
Can I use a smaller model?
Can I use INT4 instead of FP16?
Can I use a cheaper GPU?
How long will this actually take? (start with shorter lease)
Will I use all the time? (better to extend than pre-pay)

During session:

Am I actually using the instance?
Can I finish in the current lease?
Should I stop now and resume later?

After session:

Did I stop the instance?
Is work saved to persistent filesystem?
Did I get value for money spent?

Best Practices Summary¶

GPU Selection: Always use the cheapest GPU that fits your model
Quantization: Use INT4 when 5% quality loss is acceptable
Lease Duration: Start short (2 hours), extend as needed
Monitoring: Check status regularly, extend before expiration
Development: Develop locally, use GPU only for model tasks
Idle Detection: Leverage automatic shutdown for forgotten instances
Budgeting: Track usage, allocate 60/30/10 across tiers
Automation: Script stops, set reminders, use persistent storage

Cost Optimization¶

Overview¶

Cost Structure¶

GPU Selection Strategies¶

Rule: Use the Smallest GPU That Fits¶

Use INT4 When Quality Allows¶

Model Selection by Budget¶

Lease Management Strategies¶

Use Shorter Initial Leases¶

Monitor Lease Time¶

Extend Before Expiration¶

Development Workflow Optimization¶

Batch Work Sessions¶

Use Local Development First¶

Stop Instances Between Sessions¶

Idle Detection Configuration¶

How It Works¶

Benefits¶

Limitations¶

Cost Estimation¶

Before Launch¶

During Session¶

Cost Comparison Scenarios¶

Scenario 1: Simple Code Changes¶

Scenario 2: Large Codebase Refactoring¶

Scenario 3: Complex Debugging¶

Monthly Budget Planning¶

Estimate Monthly Usage¶

Usage Patterns¶

Budget Allocation¶

Advanced Strategies¶

Use Persistent Filesystem¶

Parallel Development¶

Automate Stop¶

Benchmarking Costs¶

Cost Avoidance Checklist¶

Best Practices Summary¶

Next Steps¶