Architecture Documentation¶
This section provides technical documentation of the soong CLI system architecture, design decisions, and implementation details.
Contents¶
- System Design - Overall architecture and component interactions
- Cost Controls - Multi-layered cost protection mechanisms
Overview¶
The soong CLI is a production-ready system for managing GPU instances on Lambda Labs. It provides a robust CLI interface backed by a distributed architecture that emphasizes:
- Reliability: Retry logic with exponential backoff for API calls
- Cost Control: Multiple layers of protection against runaway costs
- Persistence: Integration with Lambda filesystems for state across sessions
- Security: Token-based authentication, secure configuration storage
- User Experience: Rich terminal UI with progress indicators and detailed status
Key Components¶
- CLI Tool (
soong) - Python-based command-line interface - Lambda API Client - HTTP client with retry logic and error handling
- SSH Tunnel Manager - Manages port forwarding for remote services
- Instance Manager - Lifecycle management and status polling
- Status Daemon (on GPU instance) - Health monitoring and lease management
- Configuration Manager - YAML-based configuration with validation
Design Principles¶
Fail-Safe Defaults¶
All default settings prioritize safety and cost control:
- Default lease: 4 hours (not maximum)
- Idle timeout: 30 minutes
- Hard timeout: 8 hours (absolute maximum)
- Cost estimates shown before launch
- Confirmation required for destructive operations
Single Source of Truth¶
Configuration is centralized in ~/.config/gpu-dashboard/config.yaml with secure permissions (0600). Command-line flags override defaults but do not modify saved configuration.
Graceful Degradation¶
The system continues to function when optional services are unavailable:
- Cost estimates work without pricing API
- Status shows partial info when daemon unreachable
- History shows local cache when worker unavailable
Progressive Disclosure¶
The CLI presents information hierarchically:
- Critical information first (instance ID, status, IP)
- Cost and timing details for active instances
- Extended details via flags (
--history,--stopped) - Debug information via SSH and logs
Next Steps¶
- System Design - Detailed architecture diagrams
- Cost Controls - Cost protection mechanisms