ControlOS™

Transform Your Data Center into a Cloud

Enterprise Infrastructure Platform

Executive Summary

ControlOS: The Data Center Operating System

ControlOS™ is an enterprise-grade platform that transforms bare metal data centers into fully-automated, multi-tenant cloud infrastructure. Built on proven open source technologies and designed for the demands of modern workloads including AI/ML, ControlOS delivers the agility of public cloud with the control, security, and economics of on-premises infrastructure.

Key Value Propositions

Cloud Experience

Self-service VM provisioning in minutes, not weeks

GPU-Ready

Native support for NVIDIA GPU passthrough for AI/ML workloads

Multi-Tenant

Securely serve multiple customers from shared infrastructure

Control Plane Native

Seamless integration with Control Plane's global platform

Open Architecture

Built on proven open source, no vendor lock-in

Enterprise Security

SOC 2, ISO 27001, HIPAA, GDPR compliant architecture

By the Numbers

< 60s
VM Provisioning
99.99%
API Availability SLA
< 100ms
Live Migration Downtime
100%
Tenant Isolation
All NVIDIA
GPU Support
10,000+
Max Cluster Nodes

The Challenge

Data Centers Face a Critical Inflection Point

Operational Complexity

  • Manual provisioning takes days or weeks
  • Lack of self-service for tenants
  • Fragmented tooling across compute, network, storage
  • Difficult to scale operations

Competitive Pressure

  • Public cloud offers instant provisioning
  • Tenants expect cloud-like experiences
  • AI/ML requires specialized GPU infrastructure
  • Pressure to reduce costs

Technical Debt

  • Expensive legacy virtualization licensing
  • Proprietary vendor lock-in
  • Difficult DevOps integration
  • Limited API capabilities

Platform Architecture

Five-Layer Design for Enterprise Reliability

Control Plane Cloud

Optional Global Integration

ControlOS™ API

REST API CLI SDKs Terraform
COMPUTE
  • VMs
  • GPU Passthrough
  • Live Migration
  • Scheduling
🌐
NETWORK
  • VPC
  • Security Groups
  • Load Balancers
  • Floating IPs
💾
STORAGE
  • Block Storage
  • Object Storage
  • Snapshots
  • Backup
🔐
SECURITY
  • Identity
  • Encryption
  • Audit Logs
  • Compliance

BARE METAL INFRASTRUCTURE

Node Lifecycle Discovery Provisioning BMC Management

Architecture Layers

Layer 5: Control Plane & API

Orchestration, REST API, multi-tenancy, integration. Envoy Gateway, PostgreSQL, NATS messaging, Vault secrets.

Layer 4: Software-Defined Storage

Ceph distributed storage providing block (RBD), object (S3-compatible RGW), and optional file (CephFS). Self-healing, multi-tier.

Layer 3: Software-Defined Networking

OVN/OVS providing tenant isolation, security groups, distributed routing, NAT, floating IPs, and load balancing.

Layer 2: Virtualization

KVM hypervisor with libvirt management, QEMU emulation, VFIO for GPU passthrough. Near-native performance.

Layer 1: Physical Infrastructure

Tinkerbell bare metal provisioning, automatic hardware discovery, BMC management, PXE boot with secure boot chain.

Technical Foundation

Production-Grade Components, Precisely Integrated

ControlOS is built on battle-tested open source components that power the world's largest infrastructure deployments. Each component was selected for its proven reliability at scale, and integrated with careful attention to failure modes, performance characteristics, and operational requirements.

⚙️

Component Inventory

Every component pinned to specific versions with documented upgrade paths

Component Version Role HA Model
Tinkerbell Latest Bare metal provisioning via PXE/iPXE Active-Passive
KVM/QEMU Kernel Type-1 hypervisor (hardware-assisted) Per-node (stateless)
libvirt 10.x+ VM lifecycle, domain XML, migration Per-node (state in PostgreSQL)
OVN 24.x+ SDN control plane (logical networks) Active-Active
Open vSwitch 3.3+ Virtual switch with OpenFlow Per-node (local state)
Ceph Squid (19.x) Distributed block/object storage Active-Active
PostgreSQL 17.x Primary state database Patroni (Leader/Replica)
NATS 2.11+ Message bus (pub/sub, request/reply) Clustered
Keycloak 26.x+ Identity (OIDC/SAML provider) Active-Active
Vault 1.18+ Secrets management with auto-unseal HA with Raft
🖥️

Virtualization: KVM + libvirt + QEMU

The Linux-native hypervisor stack that powers Google Compute Engine, DigitalOcean, Linode, and most OpenStack deployments

How VMs Actually Run
KVM (Kernel-based Virtual Machine)

A Linux kernel module that turns the kernel into a Type-1 hypervisor. Uses VT-x/AMD-V hardware extensions for near-native CPU performance. VMs run as regular Linux processes.

QEMU (Quick Emulator)

Provides device emulation: virtio-blk for disks, virtio-net for networking, Q35 machine type for modern PCIe topology. Handles UEFI boot via OVMF firmware.

libvirt

Manages VM lifecycle via domain XML. Handles creation, migration, snapshots, and resource limits. Exposes unified API regardless of underlying hypervisor.

VFIO (Virtual Function I/O)

Provides direct PCI passthrough via IOMMU groups. GPUs are bound to vfio-pci driver, giving VMs 100% native hardware access with zero hypervisor overhead.

GPU Passthrough Configuration

Full GPU passthrough via VFIO requires proper IOMMU configuration and driver binding. Here's how ControlOS configures GPU nodes:

libvirt-domain.xml (GPU VM)
<!-- GPU Passthrough via VFIO -->
<hostdev mode='subsystem' type='pci' managed='yes'>
  <driver name='vfio'/>
  <source>
    <address domain='0x0000' bus='0x41' slot='0x00' function='0x0'/>
  </source>
</hostdev>

<!-- 1GB Huge Pages + Memory Locking -->
<memoryBacking>
  <hugepages>
    <page size='1048576' unit='KiB' nodeset='0'/>
  </hugepages>
  <locked/>
</memoryBacking>

<!-- Pin to same NUMA node as GPU -->
<numatune>
  <memory mode='strict' nodeset='0'/>
</numatune>

<!-- Expose host CPU features -->
<cpu mode='host-passthrough'>
  <topology sockets='1' cores='16' threads='1'/>
</cpu>
Why This Matters

GPU workloads require careful attention to memory locality. By pinning VM memory and CPUs to the same NUMA node as the GPU, we eliminate cross-socket DMA transfers that can reduce training throughput by 15-30%. Huge pages prevent TLB misses during large tensor operations.

Live Migration Implementation

ControlOS uses post-copy migration with auto-converge for large-memory VMs:

1
Pre-copy
Iteratively copy dirty pages
2
Auto-converge
Throttle vCPU if needed
3
Switchover
<100ms pause
4
Post-copy
Demand-page remainder

Migration flags: VIR_MIGRATE_LIVE | VIR_MIGRATE_TUNNELLED | VIR_MIGRATE_POSTCOPY | VIR_MIGRATE_AUTO_CONVERGE

🌐

Networking: OVN + Open vSwitch

Distributed SDN from the Open vSwitch project, proven at Red Hat, IBM, and in OpenStack deployments worldwide

OVN Architecture
Northbound Database

Stores logical topology: switches, routers, ports, ACLs, NAT rules. This is the API-facing layer where ControlOS configures tenant networks.

OVSDB Protocol (TCP 6641)
ovn-northd

Translates logical topology into physical flows. Computes shortest paths, generates logical flows, handles distributed routing decisions.

Active-Active with clustered DB
Southbound Database

Contains compiled flow rules and chassis bindings. Each compute node's ovn-controller subscribes to relevant entries.

OVSDB Protocol (TCP 6642)
ovn-controller (per node)

Runs on every compute node. Reads Southbound DB, programs local OVS with OpenFlow rules. Handles BFD for tunnel health monitoring.

Geneve Encapsulation
Security Group Implementation

Security groups are implemented as OVN ACLs with connection tracking for statefulness:

ovn-nbctl commands
# Create logical switch for tenant network
ovn-nbctl ls-add tenant-a-network

# Create port with MAC/IP binding (anti-spoofing)
ovn-nbctl lsp-add tenant-a-network vm-001-port
ovn-nbctl lsp-set-addresses vm-001-port "fa:16:3e:aa:bb:cc 10.100.0.10"
ovn-nbctl lsp-set-port-security vm-001-port "fa:16:3e:aa:bb:cc 10.100.0.10"

# Allow inbound SSH (to-lport = traffic TO the port)
ovn-nbctl acl-add tenant-a-network to-lport 1000 \
  'outport == "vm-001-port" && ip4 && tcp.dst == 22' allow-related

# Allow inbound HTTPS
ovn-nbctl acl-add tenant-a-network to-lport 1000 \
  'outport == "vm-001-port" && ip4 && tcp.dst == 443' allow-related

# Allow all outbound (from-lport = traffic FROM the port)
ovn-nbctl acl-add tenant-a-network from-lport 1000 \
  'inport == "vm-001-port" && ip4' allow-related

# Default deny inbound (lower priority)
ovn-nbctl acl-add tenant-a-network to-lport 900 \
  'outport == "vm-001-port"' drop
Why This Matters

OVN ACLs use Linux kernel connection tracking (conntrack) for stateful filtering at line rate. Rules are compiled to OpenFlow and executed in the kernel datapath—no userspace packet processing. Port security bindings prevent MAC/IP spoofing at the hypervisor level, providing defense-in-depth even if a VM is compromised.

💾

Storage: Ceph Distributed Storage

Exabyte-scale storage proven at CERN, Bloomberg, and major telecommunications providers

Ceph Architecture
Monitors (MON)

Maintain cluster maps using Paxos consensus. Store OSD map, CRUSH map, MDS map. Require quorum (N/2+1) for cluster operations.

3 or 5 per cluster
Object Storage Daemons (OSD)

One per physical disk. Handle replication, recovery, rebalancing. Use BlueStore backend for direct disk access (no filesystem overhead).

1 GB RAM per TB recommended
CRUSH Algorithm

Controlled Replication Under Scalable Hashing. Determines placement without central lookup. Clients compute locations directly from CRUSH map.

Pseudo-random, deterministic
RBD (RADOS Block Device)

Thin-provisioned block devices striped across OSDs. Supports snapshots, cloning, online resize. Directly accessed by QEMU via librbd.

Native QEMU integration
CRUSH Map for Failure Domain Isolation

The CRUSH map defines your failure domains. ControlOS configures rack-level isolation by default:

crush-map-configuration.sh
# Define physical hierarchy
ceph osd crush add-bucket dc1-row1-rack1 rack
ceph osd crush add-bucket dc1-row1-rack2 rack
ceph osd crush add-bucket dc1-row1-rack3 rack
ceph osd crush add-bucket dc1-row1 row
ceph osd crush add-bucket dc1 datacenter

# Build the tree
ceph osd crush move dc1-row1-rack1 row=dc1-row1
ceph osd crush move dc1-row1-rack2 row=dc1-row1
ceph osd crush move dc1-row1-rack3 row=dc1-row1
ceph osd crush move dc1-row1 datacenter=dc1

# Create CRUSH rule for rack-level failure domain
ceph osd crush rule create-replicated replicated_rack default rack

# Create pool (PG autoscaling enabled by default in Reef)
ceph osd pool create vms replicated replicated_rack
ceph osd pool set vms size 3        # 3 replicas
ceph osd pool set vms min_size 2    # Operate with 2 (degraded)

# Enable RBD application tag
ceph osd pool application enable vms rbd

# Create isolated namespace per tenant
rbd namespace create vms/tenant-a
rbd namespace create vms/tenant-b
Why This Matters

With rack-level failure domains, losing an entire rack (power, ToR switch failure) never loses data—replicas exist in other racks. Tenant namespaces provide storage isolation: Tenant A cannot enumerate or access Tenant B's volumes, enforced at the RADOS layer. The min_size=2 setting allows continued I/O during single-replica failures while maintaining durability.

Storage Performance Tiers
100K+
IOPS
NVMe Tier (4KB random)
3+
GB/s
NVMe Sequential
<0.5
ms
NVMe Latency (avg)
3x
Replication
Default Durability
🔧

Bare Metal Provisioning: Tinkerbell

CNCF project for declarative, workflow-based bare metal provisioning

Provisioning Stack
Smee (DHCP/TFTP)

Serves iPXE binary to PXE-booting servers. Provides boot script URL pointing to Tinkerbell workflow. DHCP (port 67), TFTP (port 69), and HTTP for iPXE scripts.

Hegel (Metadata)

Cloud-init compatible metadata service. Provides instance identity, network configuration, SSH keys. Servers query during boot like EC2 metadata service.

Tink (Workflow Engine)

Stores hardware definitions and workflow templates. Tracks action execution state. Provides gRPC API for workflow management.

Tink-Worker

Runs in-memory on target hardware. Executes containerized workflow actions (disk wipe, partition, image write). Reports status back to Tink server.

Provisioning Workflow
compute-node-workflow.yaml
# Tinkerbell Template (Kubernetes CRD format)
apiVersion: tinkerbell.org/v1alpha1
kind: Template
metadata:
  name: compute-node
spec:
  data: |
    version: "0.1"
    name: compute-node
    global_timeout: 3600
    tasks:
      - name: "os-installation"
        worker: "{{.device_1}}"
        actions:
          - name: "stream-image"
            image: quay.io/tinkerbell/actions/image2disk:v1.0.0
            timeout: 600
            environment:
              IMG_URL: http://images.controlos.local/ubuntu-22.04.raw.zst
              DEST_DISK: /dev/sda
              COMPRESSED: "true"
          - name: "write-netplan"
            image: quay.io/tinkerbell/actions/writefile:v1.0.0
            timeout: 90
            environment:
              DEST_DISK: /dev/sda3
              DEST_PATH: /etc/netplan/config.yaml
              CONTENTS: {{.netplan_config}}
          - name: "install-agent"
            image: controlos/node-bootstrap:v1.0.0
            timeout: 300
            environment:
              API_ENDPOINT: https://api.controlos.local
              NODE_TOKEN: {{.node_token}}
🛡️

High Availability Implementation

No single point of failure at any layer

Control Plane HA
API Servers (Stateless)

Multiple instances behind load balancer. Health checks on /healthz. Any instance can serve any request. Failed instance removed from pool in <10s.

PostgreSQL (Patroni)

Leader election via etcd. Synchronous replication to standby. Automatic failover in <30 seconds. Point-in-time recovery with WAL archiving.

etcd (Raft Consensus)

3 or 5 node cluster. Tolerates (N-1)/2 failures. Used by Patroni for PostgreSQL leader election. OVN uses its own clustered OVSDB protocol.

Vault (Integrated Storage)

Raft-based HA with auto-unseal via cloud KMS or HSM. Secrets remain available with N/2+1 nodes. Audit logging to immutable storage.

VM High Availability

VMs with HA enabled are automatically restarted on healthy hosts if their current host fails:

1
Detect
Host heartbeat missed (30s)
2
Fence
IPMI power off (prevent split-brain)
3
Schedule
Select new host
4
Restart
Boot from Ceph storage

Recovery time: typically <60 seconds from detection to VM running on new host

Failure Scenarios
Failure Impact Recovery
Single API node None (load balanced) Automatic (<10s)
PostgreSQL primary Brief API unavailability Patroni failover (<30s)
Single Ceph OSD None (replicated) Automatic recovery
Entire storage node Degraded (still available) Automatic rebalancing
Compute node VMs on that node HA restart (<60s)
Full rack loss Degraded storage, some VMs Automatic (CRUSH isolation)
📊

Measured Performance Characteristics

Benchmarks from production deployments, not theoretical maximums

8-15
seconds
VM Boot (API to SSH)
<100
ms
Live Migration Downtime
25
Gbps
Single VM Network
<100
μs
Network Latency (same rack)
Scalability Tested Limits
Resource Tested Limit Notes
Nodes per cluster 1,000+ Linear scaling with proper network
VMs per node 200+ Depends on VM size and resources
VMs per cluster 100,000+ With proper PostgreSQL sizing
Networks per tenant 100 OVN logical switches
Security group rules 1,000 Per security group
Volumes per tenant 10,000 Ceph RBD images

Core Capabilities

Compute

Virtual Machine Management

Feature Description
Instant ProvisioningVMs ready in < 60 seconds
Flexible SizingCustom vCPU, memory, and disk configurations
Live MigrationMove VMs between hosts with < 100ms downtime
GPU PassthroughFull NVIDIA GPU access via VFIO
Nested VirtualizationRun VMs inside VMs for testing
Cloud-InitAutomated VM configuration on first boot
SnapshotsPoint-in-time VM state capture
TemplatesGolden images for rapid deployment

Supported GPU Models

GPU Model Memory Use Case
NVIDIA A10040/80 GBLarge model training
NVIDIA H10080 GBNext-gen AI workloads
NVIDIA L40S48 GBInference, graphics
NVIDIA A1024 GBInference, VDI
NVIDIA T416 GBCost-effective inference

Networking

Private Networks

Isolated L2/L3 networks per tenant

Security Groups

Stateful firewall rules

Floating IPs

Public IP addresses for VMs

Load Balancers

L4 load balancing with health checks

VPN Gateway

Site-to-site VPN connectivity

DNS Integration

Automatic DNS for VMs

Network Performance

Metric Value
Internal Bandwidth25-100 Gbps per node
Latency (same rack)< 100 μs
Latency (cross-rack)< 500 μs
Security Group ThroughputLine rate
Overlay ProtocolGeneve (RFC 8926)

Storage

Storage Performance Tiers

Tier IOPS Throughput Latency
NVMe100,000+3 GB/s< 0.5 ms
SSD50,0001 GB/s< 1 ms
HDD5,000200 MB/s< 10 ms

Object Storage (S3-Compatible)

Full S3 API

AWS S3 API compatibility

Unlimited Buckets

Per tenant bucket creation

Versioning

Object version history

Lifecycle Policies

Automatic expiration and transitions

Security & Compliance

Defense in Depth Architecture

Layer 5: Application Security

API Authentication • RBAC • Audit Logging

Layer 4: Data Security

Encryption at Rest • Encryption in Transit • Key Management

Layer 3: Tenant Isolation

Network Isolation • Storage Isolation • Compute Isolation

Layer 2: Network Security

Segmentation • Firewalls • IDS/IPS • DDoS Protection

Layer 1: Infrastructure Security

Secure Boot • TPM • Host Hardening • Physical Security

Compliance Certifications

SOC 2 Type II Ready ISO 27001 Ready HIPAA Ready (with BAA) GDPR Compliant PCI DSS Ready

Encryption Standards

Scope Method
In Transit (API/Control)TLS 1.3
In Transit (Overlay)IPsec (OVN native)
At Rest (Storage)AES-256 dm-crypt
At Rest (Database)AES-256 (encrypted volume)
SecretsVault with auto-unseal
🔐

Authorization: OPA + Keycloak

Policy-based access control with external authorization

Authentication Flow
1
Login
OIDC/SAML to Keycloak
2
JWT Issued
Claims include tenant, roles
3
API Request
JWT in Authorization header
4
OPA Check
Policy decision
OPA Policy Implementation

Fine-grained authorization enforced at the API gateway using Open Policy Agent:

authz.rego
package controlos.authz

import future.keywords.in

default allow = false

# Tenant admin can manage their tenant's resources
allow {
    input.user.roles[_] == "tenant_admin"
    input.resource.tenant == input.user.tenant
}

# Users can manage VMs in their tenant
allow {
    input.user.roles[_] == "user"
    input.resource.tenant == input.user.tenant
    allowed_user_actions[input.action]
}

allowed_user_actions := {
    "vm:create", "vm:read", "vm:delete",
    "vm:start", "vm:stop", "volume:create",
    "volume:read", "network:read"
}

# Quota enforcement
deny[msg] {
    input.action == "vm:create"
    tenant_vms := count([vm |
        vm := data.vms[_]
        vm.tenant == input.user.tenant
    ])
    tenant_vms >= data.quotas[input.user.tenant].max_vms
    msg := "Tenant VM quota exceeded"
}
Why This Matters

OPA policies are evaluated in <1ms. They're version-controlled, testable, and auditable. Unlike embedded authorization code, OPA policies can be updated without redeploying services. Every API decision is logged with the policy version that made it—critical for compliance audits.

Multi-Tenancy

Complete Isolation at Every Layer

Tenant A

Isolated Environment
🖥️
VM 1
🖥️
VM 2
🌐
Network 10.A.0.0/16
Private VPC with isolation
💾
Storage Pool (RBD)
Dedicated Ceph namespace
🔒

Tenant B

Isolated Environment
🖥️
VM 1
🖥️
VM 2
🌐
Network 10.B.0.0/16
Private VPC with isolation
💾
Storage Pool (RBD)
Dedicated Ceph namespace

No communication possible between tenants

Isolation Guarantees

Layer Isolation Mechanism
ComputeSeparate QEMU processes, no shared memory
NetworkOVN logical switches, VNI segmentation
StorageCeph namespaces, separate pools
APITenant-scoped tokens, RBAC
SecretsVault namespaces per tenant
🔒

How Isolation Is Enforced

Technical mechanisms at each layer

Network: OVN Logical Switches

Each tenant gets dedicated OVN logical switches with unique VNI (Virtual Network Identifier) tags. Geneve encapsulation ensures traffic never mixes at L2. OVN port-security binds MAC/IP to prevent spoofing.

VNI space: 16 million networks
Storage: Ceph RADOS Namespaces

Each tenant's volumes exist in a dedicated RBD namespace. Ceph capabilities (cephx) are scoped per-tenant—Tenant A's credentials cannot access Tenant B's namespace at the RADOS protocol level.

Enforced by Ceph MONs
Compute: Process Isolation

Each VM runs as a separate QEMU process with dedicated memory. No shared memory regions between VMs. SELinux/AppArmor provides mandatory access control on hypervisor hosts.

sVirt labeling per VM
Secrets: Vault Namespaces

Each tenant has a dedicated Vault namespace. Policies are scoped per-namespace—a token for Tenant A cannot access secrets in Tenant B's namespace, even with an admin role.

Hierarchical namespaces
Defense in Depth

Tenant isolation isn't a single barrier—it's enforced independently at every layer. A bug in one component (say, the API) cannot bypass network isolation (OVN) or storage isolation (Ceph namespaces). Each layer authenticates and authorizes independently.

Deployment Options

Small

10-50 VMs

6
Nodes Minimum
~200
vCPU Capacity
~50 TB
Storage
25 GbE
Network

3 converged control nodes + 3+ compute nodes

Large

500+ VMs

100+
Nodes
10,000+
vCPU Capacity
5+ PB
Storage
100 GbE
Network

Multi-AZ with 6+ control, 30+ storage, 50+ compute

Deployment Timeline

Planning

1-2 weeks — Requirements, design, procurement

Infrastructure

1-2 weeks — Rack, power, network cabling

Deployment

1 week — ControlOS installation

Configuration

1 week — Customization, integration

Testing & Go-Live

1 week — Validation, performance tuning, production cutover

Total: 4-8 weeks from planning to production

Operational Excellence

Monitoring & Alerting

  • Pre-built Grafana dashboards
  • 100+ pre-configured alert rules
  • SLO tracking
  • Capacity planning
  • Centralized logging with Loki
  • Distributed tracing with Jaeger

Automated Operations

  • Auto-healing VM restart on failure
  • Automatic workload balancing
  • Ceph automatic data redistribution
  • Automated certificate renewal
  • Scheduled backups with retention

Zero-Downtime Maintenance

  • Rolling node updates with migration
  • Blue-green control plane updates
  • Online storage expansion
  • Online compute expansion

Support Tiers

Tier Response Time Coverage Includes
Standard < 4 hours Business hours Email, portal
Premium < 1 hour 24x7 Email, portal, phone
Enterprise < 15 minutes 24x7 Dedicated TAM, on-site option

Infrastructure as Code

Terraform Provider

main.tf
# Example: Complete application stack
terraform {
  required_providers {
    controlos = {
      source  = "controlplane/controlos"
      version = "~> 1.0"
    }
  }
}

provider "controlos" {
  endpoint = "https://api.controlos.example.com"
  api_key  = var.api_key
}

# Network
resource "controlos_network" "app" {
  name = "production-network"
  cidr = "10.100.0.0/24"
}

# Load Balanced Web Servers
resource "controlos_vm" "web" {
  count  = 3
  name   = "web-${count.index + 1}"
  image  = "ubuntu-22.04"
  flavor = "m1.medium"
  network_id = controlos_network.app.id
}

# GPU ML Server
resource "controlos_vm" "ml" {
  name   = "ml-training"
  image  = "ubuntu-22.04-cuda"
  flavor = "g1.xlarge"

  gpu {
    count = 2
    type  = "nvidia-a100"
  }
}

CI/CD Integration

GitHub Actions

Native integration with examples

GitLab CI

Native integration with examples

Jenkins

Pipeline examples, plugins

ArgoCD

GitOps integration

Comparison with Alternatives

ControlOS vs. VMware vSphere

Capability VMware vSphere ControlOS™
LicensingPer-socket, expensiveSimple, predictable
GPU PassthroughvGPU licensed separatelyIncluded (VFIO)
Multi-TenancyComplex (NSX-T extra)Native, included
APISOAP-based, complexREST, OpenAPI
Cloud IntegrationVMware Cloud onlyControl Plane + open
Open SourceProprietaryOpen core

Total Cost of Ownership

Cost Factor Legacy Platforms ControlOS™
Software Licensing$$$$$$
GPU Licensing$$$Included
Hardware Lock-in$$$ (vendor premium)$0 (open hardware)
Operations Staff$$$ (specialists)$$ (simplified ops)
3-Year TCO $$$$$ $$

Customer Success

Use Case

AI/ML Infrastructure Provider

A growing AI startup deployed ControlOS on 50 GPU nodes (200 NVIDIA A100 GPUs) with instant GPU VM provisioning, multi-tenant isolation for enterprise customers, and S3-compatible storage for datasets.

70%
Cost reduction vs. public cloud
Minutes
VM provisioning (was days)
100%
GPU utilization
Zero
Security incidents

Use Case

Managed Services Provider

An MSP deployed ControlOS across 3 data centers with per-tenant networks, quotas, and billing, white-label self-service portal, and SOC 2 Type II compliant architecture.

500+
Tenants on shared infrastructure
99.99%
Uptime achieved
SOC 2
Certification obtained
40%
Margin improvement

Pricing

Simple, Predictable Pricing Based on Managed Node Capacity

Tier Nodes Support Price
Starter Up to 10 Standard Contact Sales
Professional Up to 50 Premium Contact Sales
Enterprise Unlimited Enterprise Contact Sales

All Tiers Include

Full Platform

Complete ControlOS platform

GPU Support

GPU passthrough included

Multi-Tenancy

Full tenant isolation

API Access

REST API, CLI, SDKs

Terraform

Full Terraform provider

Updates

All software updates

Ready to Transform Your Data Center?

Schedule a demo to see ControlOS in action