Total Cost of Self-Hosted Inference | Geodd
Total Cost of Self-Hosted Inference
Back to Updates
Uncategorised

Total Cost of Self-Hosted Inference

Bartosz Neuman
April 13, 2026

The total cost of self-hosted inference is not only the hourly GPU price. It includes GPU capacity, supporting compute, storage, networking, orchestration, observability, engineering time, runtime tuning, scaling, incident response, and the cost of keeping latency stable under production load.

Self-hosted inference can be cost-effective when utilization is predictable and the team has strong infrastructure capability. It becomes expensive when the team must overprovision GPUs, debug latency, maintain serving runtimes, and absorb operational failures internally.

The practical question is not “Is self-hosting cheaper?” The better question is: Can this team operate inference reliably at the required cost, latency, throughput, and support level for the next 6–12 months?

What is self-hosted inference?

Self-hosted inference means a team runs its own model serving infrastructure instead of relying fully on a managed inference provider.

The team may use cloud GPUs, bare-metal GPUs, Kubernetes, containers, open-source serving engines, internal monitoring, custom autoscaling, and its own deployment pipelines.

In this model, the customer usually owns most of the operational responsibility:

  • provisioning
  • deployment
  • runtime configuration
  • model serving
  • batching and scheduling
  • monitoring
  • scaling
  • debugging
  • incident response
  • cost control
  • upgrades

Self-hosted inference is different from managed inference.

Managed inference shifts more infrastructure operation to the provider. Serverless inference usually gives the customer an API endpoint without requiring direct infrastructure management. Dedicated inference gives the customer more workload isolation and control when dedicated or single-tenant resources are allocated, while the provider may still own provisioning, monitoring, orchestration, and operational support.

Why self-hosted inference cost is often underestimated

Self-hosted inference often looks cheaper when the comparison starts with GPU pricing.

That comparison is incomplete.

Cloud compute pricing depends on instance type, region, purchase model, and usage duration. AWS describes On-Demand EC2 as pay-as-you-go compute capacity by the hour or second, without upfront payment or long-term commitment. (aws.amazon.com)

GPU pricing may also exclude related infrastructure costs. Google Cloud states that its GPU pricing page does not cover disk and images, networking, sole-tenant nodes pricing, or VM instance pricing. It also notes that attached GPUs add cost in addition to the machine type. (cloud.google.com)

For inference, the GPU is only one part of the system.

A production inference stack also needs host compute, memory, storage, networking, orchestration, observability, deployment workflows, runtime optimization, failure handling, and people who can operate the system when it degrades.

The cost becomes more complex when the workload moves from prototype usage to sustained production traffic.

At that point, the buyer is usually asking a different question:

Is self-hosting still cheaper after utilization, latency, support, engineering time, and failure risk are included?

For Geodd’s target buyer, this is the real decision point. They are not only asking whether the model can run. They are asking whether the system will stay up, scale cleanly, avoid hidden cost, and remain a defensible infrastructure decision 6–12 months later.

What buyers are really trying to decide

Technical buying committees usually evaluate self-hosted inference across four questions.

Is the current setup still economically rational?

A low GPU rate does not guarantee a low cost per token.

If GPUs are idle, overprovisioned, poorly batched, or constrained by latency targets, the effective cost increases.

The team may be paying for capacity that is not producing useful work.

Is the team spending too much engineering time on inference operations?

Self-hosting requires people to operate the system.

That includes deployment, monitoring, scaling, runtime tuning, incident response, version upgrades, model migration, and cost analysis.

The engineering cost is easy to ignore because it may not appear on the infrastructure bill.

Will the system hold under production demand?

A setup can work at low traffic and degrade under concurrency.

The failure may show up as higher P99 latency, lower throughput, GPU memory pressure, request queueing, out-of-memory errors, or unstable token generation speed.

These problems matter because production users experience the system under load, not under isolated test requests.

Would managed or dedicated inference reduce operational burden?

The buyer is not always trying to avoid infrastructure entirely.

Often, they are deciding which parts of inference operations should remain internal and which parts should be owned by a provider.

That is a responsibility-boundary decision, not only a pricing decision.

Visible costs of self-hosted inference

Visible costs are the line items most teams already track.

They are necessary, but they do not represent the full cost.

Cost areaWhat it includesWhy it matters
GPU computeGPU instances, bare-metal GPUs, reserved capacity, committed usage, or on-demand usageUsually the largest visible cost
Host computeCPU, RAM, VM or server resourcesRequired to run the serving stack around the GPU
StorageModel weights, logs, artifacts, datasets, cache storageCan grow with model size and observability volume
NetworkingIngress, egress, inter-region traffic, load balancingMay increase with production traffic and distributed deployments
OrchestrationKubernetes, containers, schedulers, service discoveryRequired for repeatable deployment and scaling
ObservabilityMetrics, logs, traces, alerts, dashboardsRequired to detect latency, errors, saturation, and incidents
RedundancyFailover capacity, multi-zone or multi-region setupRequired when uptime matters

Cloud pricing varies by purchase model. AWS lists several EC2 purchase options, including On-Demand Instances, Savings Plans, Spot Instances, On-Demand Capacity Reservations, and EC2 Capacity Blocks for ML. (aws.amazon.com)

For GPU capacity reservations, AWS states that EC2 Capacity Block reservation prices are updated regularly based on trends in supply and demand. (aws.amazon.com)

This matters for inference because production inference is often latency-sensitive and user-facing.

A lower-cost capacity model may not be acceptable if interruption risk, capacity availability, or support limitations affect the product.

Hidden costs of self-hosted inference

Hidden costs usually appear after the system is in production.

They are not always visible in the first architecture review.

Underutilized GPU capacity

Inference workloads are rarely perfectly steady.

Traffic can be bursty. Request lengths can vary. Output lengths can vary. Context windows can grow. Some hours may be quiet, while peak windows require more capacity.

If the team pays for fixed GPU capacity but uses only part of it, the effective cost per useful token increases.

Overprovisioning for peak demand

Teams often provision for the worst expected traffic window.

That may be necessary for reliability.

It also means the system may carry idle capacity outside peak periods.

This is one common way self-hosted inference becomes more expensive than expected.

Latency targets that limit batching efficiency

Batching can improve throughput, but it can increase latency.

Databricks gives a practical example: on one NVIDIA A100 GPU, maximizing throughput with a batch size of 64 increased throughput by 14x while latency increased by 4x. (databricks.com)

That tradeoff is central to inference cost.

A team cannot optimize only for throughput if the product requires low TTFT, stable streaming, or tight P99 latency.

Engineering time

Self-hosted inference requires ongoing engineering work.

That work includes:

  • deployment automation
  • runtime tuning
  • batching configuration
  • autoscaling logic
  • monitoring setup
  • incident response
  • GPU memory debugging
  • model upgrades
  • serving engine upgrades
  • cost analysis

If senior engineers are pulled into infrastructure work repeatedly, the real cost is higher than the GPU bill.

Debugging and incident response

When a self-hosted endpoint degrades, the internal team owns the failure.

They need to determine whether the issue is caused by the model, runtime, GPU memory, scheduler, batching policy, network, queueing, dependency failure, or traffic pattern.

That work can be expensive because it requires production inference expertise.

Optimization debt

Inference cost changes over time.

A configuration that works for one model may not work for another.

A setup that works for short prompts may fail under long-context traffic.

A serving stack tuned for low concurrency may degrade when request volume increases.

Self-hosted inference needs continuous tuning as the product changes.

Opportunity cost

Every hour spent operating inference infrastructure is an hour not spent on product work, customer features, model quality, or core engineering priorities.

This cost matters most for small technical teams where the same people own both product and infrastructure.

Technical factors that drive inference cost

Inference cost is shaped by workload behavior.

The same GPU can have different cost efficiency depending on model size, context length, request pattern, batching strategy, and latency target.

Workload shape

The workload shape includes:

  • input tokens
  • output tokens
  • request rate
  • concurrent users
  • context length
  • burst behavior
  • streaming requirements
  • batch vs real-time usage
  • model size and architecture

A short-input classification workload has a different cost profile from a long-context chat workload.

A batch processing job has a different cost profile from a user-facing assistant that needs low TTFT.

Prefill and decode

LLM inference has two broad phases.

Prefill processes the input prompt.

Decode generates output tokens.

Long prompts increase prefill cost. Long outputs increase decode cost. High concurrency increases memory pressure and scheduling complexity.

This is why cost per request is often less useful than cost per input token, cost per output token, latency under concurrency, and throughput under a defined latency target.

Batching and scheduling

Batching groups requests together so the GPU can process more work efficiently.

Scheduling determines how requests are admitted, queued, prioritized, and executed.

More aggressive batching can improve throughput, but it may increase queueing delay or latency. This makes batching a cost and product-experience tradeoff, not only a performance optimization. (databricks.com)

KV cache memory

LLMs store key-value attention data during generation. This is commonly called the KV cache.

KV cache memory grows with sequence length and concurrency.

If KV cache memory is managed inefficiently, usable GPU capacity can fall. That can reduce batch size, lower throughput, increase latency, or require more GPUs.

The vLLM / PagedAttention paper identifies KV cache memory as a major constraint in LLM serving. It states that KV cache memory for each request is large and changes dynamically, and that inefficient management can waste memory through fragmentation and redundant duplication. (arxiv.org)

This is one reason inference cost cannot be understood from GPU type alone.

Quantization and precision

Quantization reduces numerical precision to reduce memory use or improve execution efficiency.

Common precision formats include FP16, FP8, INT8, and INT4.

Quantization can reduce cost in some cases, but it must be validated.

The impact depends on the model, workload, hardware, runtime, accuracy requirements, and context length.

It is not safe to claim that quantization has no quality impact across all models.

Serving runtime

The serving runtime affects memory management, batching, scheduling, kernel execution, and throughput.

NVIDIA describes TensorRT-LLM as providing APIs to define LLMs and build TensorRT engines with optimizations for efficient inference on NVIDIA GPUs. (docs.nvidia.com)

vLLM describes PagedAttention as an approach that uses block-level KV cache memory management to reduce memory waste in LLM serving. (arxiv.org)

The right runtime depends on the model, hardware, latency target, concurrency, context length, and operational requirements.

P99 latency

Average latency is not enough for production inference.

A system can show acceptable average latency while P99 latency becomes unacceptable under burst traffic or memory pressure.

For many user-facing inference systems, P99 latency is a better indicator of tail behavior than average latency.

For internal batch workloads, throughput may matter more than P99 latency.

The cost model should match the product requirement.

Self-hosted inference vs managed inference vs dedicated inference

The right category depends on control needs, workload stability, team capability, and operational risk.

| Dimension | Self-hosted inference | Managed inference | Dedicated inference | |---|---|---| | Infrastructure control | Highest | Lower direct control | Higher control than serverless, lower burden than full self-hosting | | Operational ownership | Mostly customer | Mostly provider | Shared, with provider owning defined infrastructure operations | | Cost model | GPU, infra, people, operations | Usually usage-based or provider-defined | Usually capacity, workload, or contract-based | | Utilization risk | Customer owns it | Provider abstracts or manages it | Provider helps align capacity to workload | | Latency control | High, but requires expertise | Depends on provider abstraction | More configurable than serverless | | Scaling burden | Customer-owned | Provider-managed | Provider-managed or coordinated | | Incident response | Internal team | Provider support model | Provider engineering involvement, depending on agreement | | Isolation | Customer-controlled | Usually shared unless specified | Stronger when single-tenant or reserved resources are allocated | | Best fit | Predictable usage and strong internal infra capability | Fast deployment and lower operational burden | Production workloads needing isolation and predictable behavior | | Main risk | Hidden operational cost | Less runtime control | Requires clear workload fit and responsibility boundary |

Key decision criteria

A technical buying committee should evaluate total cost across system behavior, not only infrastructure price.

Decision criterionWhat to evaluateWhy it matters
GPU utilizationAverage utilization, peak utilization, idle capacityLow utilization raises effective cost per token
Latency SLOP95 and P99 latency targetsTighter latency targets can require more capacity
TTFTTime to first token under loadAffects perceived responsiveness in streaming experiences
ThroughputTokens per second or requests per second at target latencyThroughput without latency context can be misleading
Workload shapeInput tokens, output tokens, context length, concurrencyToken pattern drives compute and memory behavior
Traffic patternSteady, bursty, seasonal, or unpredictable demandDetermines whether fixed capacity is efficient
Context lengthShort, long-context, or mixed workloadsLong context increases memory pressure
Engineering capacityWho owns tuning, monitoring, scaling, and incidentsPeople cost is part of total cost
Runtime maturityServing engine, batching, scheduling, quantization, memory managementOptimization depth affects cost behavior
Reliability requirementImpact of degraded inference or downtimeReliability failures create business cost
Support requirementWho helps when the system fails under real demandWeak support shifts recovery back to the internal team
6–12 month fitExpected growth in model size, traffic, and context lengthPrevents short-term architecture decisions from becoming rework

How to calculate the total cost of self-hosted inference

A useful cost model should include visible cost, utilization-adjusted cost, engineering cost, and risk cost.

1. Estimate visible infrastructure cost

Start with the direct infrastructure bill:

  • GPU capacity
  • host compute
  • memory
  • storage
  • networking
  • load balancing
  • monitoring tools
  • redundancy
  • backup or failover capacity

This should be calculated by region and provider because pricing and capacity availability vary. Cloud providers also separate pricing by instance type, commitment model, and related infrastructure components. (cloud.google.com)

2. Estimate utilization-adjusted cost

The paid GPU cost is not the same as useful GPU cost.

If the system pays for 100 GPU-hours but only uses 35 GPU-hours effectively, the cost per useful token is much higher than the headline GPU price suggests.

Utilization should be measured against production traffic, not synthetic peak benchmarks alone.

3. Estimate cost per token or request

For LLM workloads, cost should usually be evaluated by token behavior.

Useful metrics include:

  • cost per million input tokens
  • cost per million output tokens
  • cost per request
  • cost per concurrent user
  • cost at target P99 latency
  • cost at expected peak traffic
  • cost at expected average traffic

A system optimized for low cost per token at high batch size may not satisfy latency requirements for an interactive product.

4. Add engineering and operations cost

Include the people cost of:

  • setup
  • maintenance
  • monitoring
  • on-call
  • incident response
  • debugging
  • runtime upgrades
  • model migration
  • optimization work
  • documentation
  • internal support

This cost is often larger than expected when the team lacks dedicated MLOps or inference infrastructure capacity.

5. Add risk cost

Risk cost includes the business impact of degraded or unavailable inference.

This may include:

  • delayed product releases
  • customer-facing latency
  • failed requests
  • internal engineering interruptions
  • emergency re-architecture
  • higher support load
  • loss of confidence in the system

This cost is workload-specific.

A non-critical batch job has a different risk profile from a customer-facing production assistant.

6. Compare against managed or dedicated inference

The comparison should not be:

GPU hourly price vs provider token price.

A better comparison is:

Total internal operating cost vs total provider cost at the required reliability, latency, throughput, and support level.

Fit / not fit: self-hosted inference

Self-hosted inference is a fit whenSelf-hosted inference may not fit when
GPU utilization is predictable and highTraffic is uneven or hard to forecast
The team has strong infrastructure and MLOps capabilityGPUs are frequently idle
Workloads are stableP99 latency degrades under concurrency
Latency requirements are well understoodEngineers spend too much time on infrastructure
The team needs full control over the stackThe serving stack requires constant tuning
Engineering time is availableObservability is incomplete
The organization can handle incidents directlyIncident response depends on a small internal team
Long-term workload volume justifies ownershipVendor support is weak or slow
Internal tooling is already matureThe architecture may need rework in 6–12 months

Self-hosting should be treated as an operational commitment.

It is not only a deployment choice.

Risks and tradeoffs

Cost risk

A low GPU price can hide poor utilization, idle capacity, overprovisioning, and engineering overhead.

This is a common mistake in self-hosted inference cost analysis.

Reliability risk

A system can pass functional tests and still degrade under sustained production traffic.

Reliability depends on scheduling, memory behavior, monitoring, failover, capacity planning, and incident response.

Latency risk

P99 latency can fail before average latency looks concerning.

For interactive applications, this can affect the user experience even when the system appears healthy on aggregate dashboards.

Scaling risk

GPU inference often has different scaling constraints than stateless web workloads because model memory, KV cache memory, request queueing, warm capacity, runtime behavior, and traffic bursts affect scaling decisions.

Support risk

If support is weak, the internal team becomes the final escalation layer.

This risk matters most when the system fails under real demand.

Lock-in risk

Both self-hosted and managed inference create forms of lock-in.

Self-hosting can create lock-in through internal tooling, runtime assumptions, deployment scripts, and operational knowledge.

Managed inference can create provider dependency.

The better question is which dependency is safer for the workload and team.

Rework risk

A serving setup may work for one model, context length, or traffic pattern and fail after the product changes.

Rework risk should be included in the total cost calculation.

Responsibility boundaries

A clear responsibility boundary is one of the most important parts of the decision.

ResponsibilitySelf-hosted inferenceManaged inferenceDedicated inference
Application logicCustomerCustomerCustomer
Model choiceCustomerCustomer / provider-supportedCustomer / provider-supported
Model deploymentCustomerProvider-managedProvider-managed or shared
GPU provisioningCustomerProviderProvider
Runtime optimizationCustomerProviderProvider / shared
Batching and schedulingCustomerProviderProvider / shared
MonitoringCustomerProviderProvider
ScalingCustomerProviderProvider / coordinated
Incident responseCustomerProvider support modelProvider engineering involvement, depending on agreement
Cost optimizationCustomerProvider / sharedProvider / shared
Product behaviorCustomerCustomerCustomer

The boundary should be explicit before the team compares pricing.

Otherwise, the buyer may compare two options that do not include the same operational scope.

Common misconceptions about self-hosted inference cost

“GPU hourly price tells us the cost.”

GPU hourly price is only the visible compute price.

It does not include utilization, latency constraints, supporting infrastructure, observability, engineering time, or incident response.

“Higher throughput always means better economics.”

Higher throughput can reduce unit cost, but only if latency remains acceptable.

For interactive products, high throughput with poor TTFT or P99 latency may not be usable.

“Self-hosting means we have full control.”

Self-hosting gives more control, but it also gives the team more responsibility.

The team must operate the deployment, runtime, monitoring, scaling, and failure response.

“Managed inference removes all customer responsibility.”

Managed inference changes the responsibility boundary.

The provider may own infrastructure and runtime operations, but the customer still owns product requirements, application logic, model behavior expectations, and workload patterns.

“Dedicated inference is just raw GPU rental.”

Dedicated inference should not be evaluated as raw GPU access alone.

In a managed dedicated inference model, the value depends on isolation, orchestration, monitoring, runtime tuning, support, and responsibility ownership.

How Geodd frames inference cost

Geodd treats inference cost as an operational system problem, not only a GPU pricing problem.

Cost behavior depends on how models are deployed, optimized, scheduled, monitored, scaled, and supported under production load.

Geodd provides Serverless Inferencing and Dedicated Inferencing for teams that want managed inference options, Dedicated GPU infrastructure for teams that need dedicated compute, DeployPad as the deployment and orchestration layer, Optimised Model Engine as the model execution and optimization layer, and MLOps Services as the operational support layer.

Geodd’s product material describes managed inference as covering deployment, model and pipeline optimization, runtime execution, scaling under load, monitoring, and debugging.

DeployPad is described as Geodd’s control and orchestration layer for model selection, workload definition, infrastructure selection, deployment, monitoring, and usage management.

The Optimised Model Engine is described as Geodd’s execution layer for improving speed, latency, throughput, and predictability of inference under load.

For dedicated workloads, Geodd’s Dedicated Deployments provide single-tenant GPU clusters with managed provisioning, orchestration, monitoring, failure recovery, and workload isolation.

This does not mean managed inference or dedicated inference is always cheaper than self-hosting.

It means that for teams where cost is being driven by overprovisioning, low utilization, tuning work, observability gaps, or incident burden, managed inference or dedicated inference should be evaluated against the full cost of internal ownership.

Practical evaluation checklist

Before choosing or continuing with self-hosted inference, evaluate the following.

AreaQuestions to answer
UtilizationWhat is average GPU utilization? What is peak utilization? How much paid capacity is idle?
TrafficIs demand steady, bursty, seasonal, or unpredictable?
LatencyWhat are P50, P95, and P99 latency under real load?
TTFTIs time to first token acceptable for the product experience?
ThroughputHow many tokens per second can the system sustain at the required latency?
MemoryIs KV cache memory limiting concurrency or context length?
ScalingHow quickly can the system add capacity during traffic changes?
ReliabilityWhat happens when a GPU, node, region, or runtime fails?
ObservabilityCan the team see queueing, TTFT, TPOT, memory pressure, errors, and saturation?
IncidentsHow many inference incidents or degraded-performance events occur per month?
Engineering timeHow many hours are spent on tuning, debugging, deployment, and support?
Cost modelWhat is the cost per million input and output tokens under real usage?
Future fitWill the system still work after model, traffic, and context length growth?
OwnershipWho is accountable when inference fails under production demand?

Teams comparing internal operation against a provider should also review pricing, technical documentation, and the relevant product scope before comparing costs.

FAQ

What is the total cost of self-hosted inference?

The total cost of self-hosted inference includes GPU capacity, host compute, storage, networking, orchestration, observability, engineering time, runtime optimization, scaling, incident response, and the cost of reliability under production load.

Is self-hosted inference cheaper than managed inference?

Self-hosted inference can be cheaper when GPU utilization is high, traffic is predictable, and the team has strong infrastructure capability. It can become more expensive when utilization is low, latency tuning is difficult, or engineers spend significant time operating the stack.

What hidden costs are often missed in self-hosted inference?

Common hidden costs include idle GPUs, overprovisioning, monitoring, logging, latency debugging, runtime tuning, on-call work, incident response, model migration, and re-architecture when workload patterns change.

How does GPU utilization affect inference cost?

Low GPU utilization increases the effective cost per token because the team pays for capacity that is not producing useful work. High utilization can improve cost efficiency, but only if latency and reliability remain acceptable.

Why do latency targets increase inference cost?

Strict latency targets reduce batching flexibility. Batching can improve throughput, but larger batch sizes can increase latency. This means low-latency products may need more capacity than throughput-only benchmarks suggest. (databricks.com)

What metrics should teams benchmark before self-hosting inference?

Teams should benchmark TTFT, TPOT, P50 latency, P95 latency, P99 latency, throughput, concurrent requests, GPU utilization, memory utilization, error rate, cold-start behavior, and cost per million input and output tokens.

Databricks’ endpoint benchmarking guidance also frames latency and throughput as key endpoint performance metrics. (docs.databricks.com)

When does dedicated inference make more sense than self-hosting?

Dedicated inference may make more sense when the workload needs isolation, predictable performance, and more control than serverless inference, but the team does not want to own full infrastructure operations internally.

What should a managed inference provider own?

A managed inference provider should clearly define its responsibility for deployment, infrastructure operation, scaling, monitoring, runtime optimization, failure response, and debugging. The customer should still own application logic, product behavior, model requirements, and workload expectations.

Does managed inference remove all customer responsibility?

No. Managed inference changes the responsibility boundary. The provider may own the inference infrastructure and operations, but the customer still owns product requirements, application integration, model selection decisions, usage patterns, and business logic.

What is the main mistake in comparing self-hosted and managed inference?

The main mistake is comparing GPU hourly cost against managed inference pricing without including utilization, latency targets, engineering time, support, monitoring, scaling, and incident response.

The comparison should be based on total operating cost at the required reliability and performance level.