Kubernetes Pod Requests and Limits: From Fundamentals to FinOps Excellence

Steve Younger
Apr 27
27 min read

CPU, Memory compute resources being applied to pod resources using requests and limits

Traditional virtual machine environments have long enforced resource allocations at deployment time: when you spin up a VM, you explicitly choose its CPU count and memory size, and those resources remain reserved for that VM for its lifetime. Kubernetes and container platforms operate differently: by default, containers can run without any specified resource requests or limits, allowing them to consume as much CPU or memory as the node can spare. This flexibility is powerful—but it also means teams must consciously define resource boundaries for each container, or risk noisy neighbors and unpredictable cluster behavior.

In the cloud-native era, mastering Kubernetes resource requests and limits is essential not only for application performance but also for cost efficiency. These settings determine how much CPU and memory each Pod requests(guaranteed to get) and the limit it cannot exceed. Done right, they keep your apps stable and your cloud bills lean. Done wrong, they can lead to unstable services or wasted budget. This article begins at the 100-level fundamentals of requests/limits and Quality of Service (QoS) classes, then ramps up to 200-level real-world challenges (including impacts on stability, chargeback accuracy, and cost optimization), and finally delves into 300-level deep technical insights on CPU throttling and Pod evictions. Throughout, we connect these concepts to FinOps best practices – demonstrating how savvy resource management in Kubernetes leads to financial operations excellence. Let’s dive in.

Kubernetes Resource Requests and Limits 101 (The Fundamentals)

What Are Requests and Limits? In Kubernetes, every container in a Pod can specify a resource request (the minimum CPU/memory it needs) and a resource limit (the maximum it’s allowed to use). The scheduler uses requests to decide which node can fit the Pod, treating the request as a guaranteed slice of resources for that Pod. Limits, on the other hand, are enforced at runtime: if a container tries to use more than its CPU limit, the Linux kernel will throttle it, and if it uses more memory than its memory limit, it can be terminated (OOM killed). For example, if a container requests 0.5 CPU and 200Mi of memory, Kubernetes ensures the node has at least that free for the container. If it also has a limit of 1 CPU and 300Mi, the container can burst up to 1 full CPU when needed and use up to 300Mi memory, but no more. This prevents any single container from hogging resources beyond a defined point. If no limit is set, a container can theoretically use all available resources on the node – which can be dangerous in multi-tenant clusters.

Quality of Service (QoS) Classes: Kubernetes uses the combination of requests and limits to assign each Pod a QoS class that reflects its priority under resource pressure. There are three QoS classes: Guaranteed, Burstable, and BestEffort.

Guaranteed: A Pod is “Guaranteed” if every container in it has memory and CPU requests equal to their limits. This means the Pod’s required resources are fully reserved on its node. Guaranteed Pods have the highest priority; Kubernetes will only evict them as a last resort.
Burstable: If a Pod does not meet the strict criteria for Guaranteed but has at least a request set for one or more resources, it is classed as Burstable. These Pods have some reserved resources (their requests) but can burst up to their limits. They get medium priority – they won’t be the first killed, but they are not as safe as Guaranteed. Most real-world workloads end up as Burstable: e.g. a container requests 1 CPU but can use up to 2 CPU if available (limit 2).
BestEffort: If a Pod has no resource requests at all (i.e. all requests are 0 or not set), it’s treated as BestEffort. BestEffort Pods have the lowest priority. They have no guaranteed resources – Kubernetes will allow them to run only if there’s leftover capacity. Under contention, BestEffort Pods are the first to be evicted or throttled since the scheduler assumes they’re “optional” work.

Why QoS Matters: QoS classes directly tie into cluster stability. The Kubernetes scheduler and kubelet use these classes to decide eviction order during resource shortages. For instance, if a node runs out of memory, Kubernetes will reclaim resources by killing Pods starting with BestEffort, then Burstable (those that exceed their request), and only if absolutely necessary touching Guaranteed Pods. This behavior means that if you accidentally deploy a critical service without requests (thus BestEffort), it could be terminated without warning when the node is under pressure. Even Burstable Pods can be at risk if they constantly exceed their requested memory. In short, setting requests and limits properly isn’t just an academic exercise – it determines which apps survive a resource crunch.

A Simple Example: Imagine a Kubernetes node with 4 CPU and 16GB RAM total. You schedule two Pods: Pod A requests 2 CPU and 4GB, Pod B requests 2 CPU and 4GB. They exactly fill the node’s capacity in terms of requests, so no other Pods can be scheduled there (ensuring these two get the resources they asked for). If Pod A has no CPU limit, it could potentially use all 4 CPUs if Pod B is idle – but if Pod B suddenly needs its share, Pod A will be throttled back by the scheduler’s fairness policies. If Pod A had a limit of 2 CPU, Kubernetes (via the Linux cgroups) would strictly cap it to 2 CPUs even if the other CPUs are free. Likewise, if neither Pod set any requests (BestEffort), Kubernetes might jam many pods onto the node – and if they all try to use CPU or memory simultaneously, the node could become overloaded, leading to chaos (CPU contention or the kernel OOM killer shooting down processes). This simple scenario shows why requests and limits exist: they are safety rails for fair resource sharing.

For a step-by-step walkthrough on configuring resource requests and limits (and how those translate into QoS classes), check out the official Kubernetes documentation: Quality of Service for Pods.

Kubernetes Requests & Limits 201: Real-World Challenges in Production

Once you grasp the basics, the next step is understanding how these settings play out in production environments – and how they connect to FinOps (financial operations) concerns. Poorly tuned requests and limits can hurt your application stability, lead to inaccurate cost allocations, and inflate your cloud bill with waste. Let’s explore some common challenges:

Over-Provisioning vs. Under-Provisioning: One frequent issue is over-provisioning – setting requests far higher than the app typically needs “just in case.” This might make everyone feel safe, but it means a lot of allocated resources sit idle. In a large cluster, many over-provisioned Pods lead to low overall utilization: you’re paying for capacity that’s never used because requests tie it up. In fact, a recent Google Cloud analysis found that CPU and memory requests are often substantially over-provisioned on average. Even the most efficient teams had room to improve, especially on CPU requests. This study identified workload rightsizing – adjusting requests to match actual usage – as the single biggest opportunity to reduce waste in Kubernetes environments. If you request 4 CPUs for a service that only ever uses 1 CPU, that’s 3 CPUs per node sitting idle (perhaps multiplied across dozens of nodes), adding cost with no benefit.

The opposite problem, under-provisioning, can be just as troublesome. If you set requests too low (or not at all), you pack many pods onto a node, but when they actually need to use CPU or memory, there isn’t enough to go around. The kernel will start killing processes, or Kubernetes will evict Pods. Under-provisioning often happens when teams deploy BestEffort Pods or give a tiny request to avoid “wasting” resources. The Google Cloud report warns that indiscriminately deploying BestEffort or under-requested Burstable workloads can hurt cluster algorithms (bin packing, scheduling, autoscaling) and negatively affect end-user experience. In other words, trying to save cost by not reserving resources can backfire with reliability issues.

Stability and Performance Impacts: Improper requests and limits directly impact application performance. For instance, setting a memory limit too low can cause a container to get OOM-killed when it tries to use more memory, crashing your app. Setting a CPU limit too low might cause heavy CPU throttling (as we’ll deep-dive in the next section), making your app slow or unresponsive during spikes. Conversely, not setting any limit could let one noisy neighbor consume so much CPU that it starves other critical services on the same node. Every setting is a trade-off. Many production outages or incidents can be traced to these configurations – like a service being killed because it burst slightly above an artificial memory cap, or a Pod going slow because it was throttled by a CPU limit during peak load.

QoS and Eviction in Practice: The QoS classes we covered are not just theoretical – they decide who lives and who dies under pressure. A common poor practice is deploying many BestEffort Pods (no requests) to maximize node utilization. While it might increase reported utilization, it’s a ticking time bomb. If the node experiences any resource crunch, those BestEffort Pods will be the first to go. And they’ll be killed without graceful termination. If those Pods are stateless workers, maybe that’s fine; but if they hold any important workload, your users will notice the sudden disappearance. Even Burstable Pods that use much more memory than their request are at risk – they can be evicted when the node runs low on memory, because Kubernetes prioritizes reclaiming memory from Pods that exceeded what they reserved. The production challenge is finding the right balance: you want to give each Pod enough requested resources to run reliably, but not so much that you waste expensive infrastructure on slack capacity.

Chargeback and Showback Accuracy: In a multi-team or multi-application cluster, FinOps teams often implement chargeback/showback models – attributing cloud costs to the teams or products that incur them. Kubernetes complicates this because resources are shared on nodes. How do you decide what portion of a node’s cost goes to a given Pod or team? The industry practice is to use resource requests as the accounting unit, under the assumption that requests represent the resources “reserved” for that Pod. Many Kubernetes cost tools and platforms allocate costs based on each Pod’s requests (sometimes using usage metrics as well, but requests are a common baseline) . This means if a team sets very high requests, they could be charged for more than they actually use (effectively paying for their idle buffer). On the other hand, if they set no requests (BestEffort), the cost tools might attribute zero cost to that Pod – even if it actually consumed a lot of CPU! This is not just hypothetical: the Google Cloud report explicitly notes that when Pods don’t set requests (or are grossly under-requested), it “can limit showback and chargeback accuracy.” For example, a BestEffort Pod could chew up lots of CPU and memory, but since it requested none, no cost is attributed to it . This undermines the whole point of FinOps, which is to make teams accountable for the resources they use. From a fairness perspective, every team should pay for what they use – but in Kubernetes, if you’re not careful, a team could overuse resources while appearing blameless on cost reports. Properly sizing requests fixes this: it aligns cost attribution with reality and discourages abusing “free” unrequested capacity.

Kubernetes QoS classes and how they relate to FinOps practices, enabling chargeback and showback.

Cost Optimization and FinOps Practices: Effective Kubernetes resource management is at the heart of Kubernetes cost optimization. The 2024 “State of Kubernetes Cost Optimization” report by Google Cloud identified four golden signalsfor cost efficiency: Workload Rightsizing, Demand-Based Downscaling, Cluster Bin Packing, and Discount Coverage . Let’s break those down in plain terms and see how they relate to requests and limits:

Workload Rightsizing: Are your Pods’ requests in line with their actual usage? Or are you consistently requesting 4x more CPU than needed? Rightsizing is about eliminating over-provisioning without harming performance. The report found rightsizing to be the most important signal – clusters that focus on it see significantly less waste. A key finding was that even “Elite” teams who set memory requests efficiently still often over-provision CPU, so there’s usually room to improve. The takeaway: continuously review and adjust your requests/limits based on real utilization data. This directly saves money and improves cluster throughput.
Demand-Based Downscaling: Do you scale your workloads (and cluster) down when demand is low? This goes beyond individual Pod requests to how you automate resource management. Kubernetes offers tools like the Horizontal Pod Autoscaler (HPA) to reduce Pod replicas when load drops, and the Cluster Autoscaler (CA) to remove idle nodes. But these only work properly if requests/limits are set sanely. The report noted that Elite performers scale down 4× more than low performers by aggressively using autoscaling . In numbers, top teams enabled Cluster Autoscaler 1.4× more, HPA 2.3× more, and even Vertical Pod Autoscaler 18× more than the lowest performers . Simply put, they let the system automatically rightsize and eliminate waste. A cluster won’t scale itself down if Pods declare they need all the resources all the time – so setting reasonable requests enables downscaling. Conversely, if you never set requests/limits, HPA can’t safely scale (it relies on targets like CPU utilization % which depend on requests) and the Cluster Autoscaler might not remove nodes because it sees pods that could use resources. An interesting insight from the report: just turning on Cluster Autoscaler isn’t enough – you need to actually configure your workloads (HPA/VPA) to scale down at idle, otherwise the nodes stay full of (under-used) pods and can’t be removed.
Cluster Bin Packing: How efficiently are you packing Pods onto nodes? This metric compares requested resources versus the allocatable resources of nodes. If everyone asks for tiny slices, Kubernetes might schedule too many pods per node and cause contention; if everyone asks for huge slices that they don’t use, nodes sit half-empty in practice. Good bin packing requires accurate requests and a mix of workloads that complement each other’s usage patterns. The challenge is that if some pods have no requests (BestEffort), the scheduler might over-pack the node (since it thinks those pods cost nothing) leading to runtime issues. On the flip side, too conservative requests under-fill nodes. The goal is to aim for high utilization without overload – something achievable only if requests correlate to real usage. One warning from the field: clusters with lots of BestEffort pods can trick you into thinking the nodes are underutilized (since requested vs allocatable looks low), tempting you to scale down nodes. But if you do, you could suddenly evict those BestEffort pods and disrupt apps . In essence, misleading bin packing data due to bad requests can cause false optimization moves. Proper requests give you accurate bin packing metrics so you know when you truly have excess capacity.
Discount Coverage: This one is slightly beyond Kubernetes configs – it’s about using cloud provider discounts (like Spot instances or reserved instances/commitments) to reduce costs. The FinOps tie-in here is that after you’ve minimized waste with the first three signals, you should also pay less for the capacity you do need. Kubernetes doesn’t manage this directly, but top teams leverage it heavily. The report noted Elite teams use Spot VMs and long-term commitments 16× more than low performers . This FinOps practice means your cluster’s actual spend is optimized, not just the utilization. While discount coverage isn’t about requests/limits per se, it’s part of FinOps excellence to combine efficient resource usage with smart purchasing. For example, if you rightsize and free up 30% of your cluster, you might downscale those nodes or move some workloads to cheaper Spot instances – saving real dollars. The key is that rightsizing is often the first step; only then can you correctly size your commitments and confidently use Spot for non-critical loads.

In summary, the production challenge is aligning technical resource management with financial outcomes. Misconfigurations like all-Pods BestEffort or chronically inflated requests undermine both reliability and cost efficiency. Conversely, doing things right – setting accurate requests, using limits judiciously, enabling autoscalers – leads to stable apps and optimal cloud bills. Next, we’ll deep dive into why some misconfigurations (like tiny CPU limits or no requests) cause the specific problems they do, and later we’ll compile a checklist of best practices to achieve FinOps excellence in Kubernetes.

Deep Dive (301): CPU Throttling and Pod Evictions – Why Misconfigurations Hurt

At this 300-level, let’s get into the nitty-gritty of what happens under the hood with improper resource settings. We’ll explore two common issues: CPU throttling due to CPU limits, and Pod eviction / OOM kills due to memory pressure and QoS. Understanding these will underscore why the best practices are what they are. For this section, we draw on Alexandru Lazarev’s excellent deep-dive “CPU Limits in Kubernetes: Why Your Pod Is Idle But Still Throttled” and his accompanying 100-slide technical presentation—resources that go deeper than you ever wanted to know about CFS throttling and cgroup quotas. Let’s unpack the key insights.

CPU Limits and Throttling: The Hidden Performance Killer

Setting a CPU limit for a container in Kubernetes triggers the Linux Kernel’s CFS (Completely Fair Scheduler) bandwidth control mechanism under the hood . By default, the kernel uses a 100ms window (called the CFS period) to enforce CPU quotas. If a container’s CPU usage exceeds its share within that window, the process is throttled – essentially put on hold until the next 100ms cycle . This is how Kubernetes ensures no container goes beyond its CPU limit: it doesn’t slow it down in a smooth way, it literally interrupts the execution when the quota is used up and resumes later.

Why is that a problem? Imagine you set a CPU limit of 0.4 cores (i.e. 400 millicpu) for a web service container. That means in each 100ms interval, it can use at most 40ms of CPU time (0.4 of 100ms). Now suppose the service gets a spike of traffic and needs 200ms of CPU time in a burst to handle some requests. With no CPU limit, it could use a full core for 200ms and finish the work. With a 0.4 limit, it uses 40ms out of each 100ms and then gets throttled for the rest of that interval. It would take it 5 intervals (5×100ms = 500ms) to get a total of 200ms CPU time. Lazarev’s analysis illustrates this scenario: a task that would finish in 200ms ends up taking about 440ms (2.2× longer) when constrained by a 0.4 CPU limit . The throttle kicked in 4 times, stretching out the response. For a user, this might mean a web request that usually takes 0.2s suddenly takes nearly 0.5s – a noticeable lag.

An image depicting cgroup CPU processing periods using CPU limits

Now, you might think: “Okay, but if my service rarely hits that limit, I’m fine.” The scary part is that throttling can occur even when average CPU usage is low. It’s the short bursts that get penalized. Your monitoring might show the container at only 20% CPU usage on average (which looks safe under a 40% limit), yet that container experiences periodic throttling during bursty moments. Lazarev points out this paradox: dashboards often show low CPU usage while the app is suffering from latency spikes, making the root cause elusive.

Effects of CPU Throttling: Throttling isn’t just a number on a graph – it manifests as real issues in your application. Some examples observed:

Slower processing and queue backups: If your service can’t use more CPU when needed, requests line up. This increases latency and can overflow request queues.
Failed health checks: Kubernetes liveness and readiness probes might time out if the app gets throttled during critical moments, leading to container restarts even though nothing is “wrong” except CPU was artificially constrained.
Garbage Collection pauses: Runtimes like Java’s JVM or .NET’s CLR might stall if they can’t get enough CPU to finish GC cycles promptly. Lazarev notes that CPU limits can cause GC to lag, sometimes triggering OOM conditions because memory isn’t freed in time.
Missed heartbeats or election timeouts: If your Pod is part of a distributed system (say a database cluster or Kafka node), throttling might cause it to miss heartbeat messages or leader election calls, which can cause cluster instability or failovers.

All this because of an ironically too low CPU ceiling. And it often goes unnoticed until a serious incident, because average metrics don’t scream “CPU bound” when limits are the culprit. This is why some experts now suggest: if possible, avoid setting CPU limits for throughput-critical or latency-sensitive apps. Give them a request (so they get scheduled with guaranteed share), but either set a very high limit or no limit, relying on the kernel’s normal scheduling to allocate CPU fairly. The kernel’s CFS without specific limits will proportionally share CPU based on weights (which Kubernetes derives from the CPU request). That means if two containers each request 1 CPU on a 1-core machine, they’ll roughly split the CPU 50/50 if both are busy. CPU limits, however, put a hard wall and can introduce these throttling side-effects.

To be clear, CPU limits aren’t evil – they exist to protect truly shared clusters (e.g. multi-tenant environments where you don’t want one team’s app using all CPUs). But you should use them with caution. Lazarev’s deep dive (and his comprehensive slide deck) goes into how CFS and cgroups v2 enforce cpu.max quotas and demonstrates that even a Pod that appears “idle” might hit hidden throttling if the limit is misconfigured. The depth of his analysis is beyond our scope here (involving Linux scheduler internals), but the key lesson for Kubernetes practitioners is: improper CPU limits can degrade application performance in non-obvious ways. Always monitor throttling metrics (e.g. container_cpu_cfs_throttled_seconds_total in Prometheus) if you do set CPU limits, and consider removing or raising limits for workloads that can be trusted not to abuse the CPU. It’s often better to rely on CPU requests for baseline scheduling and let bursty workloads use extra CPU when it’s available – that hardware is there to be used, after all, and if no other Pod needs it at that moment, why artificially restrict it?

Memory Limits and Pod Evictions: The OOM Killer and QoS in Action

On the memory side, the dynamics are a bit different. Unlike CPU, we can’t “throttle” memory usage gradually – a process either uses memory or not. Kubernetes memory limits are enforced by the kernel cgroup as a hard cap. If a container tries to allocate more memory than its limit, the kernel’s OOM killer will step in and terminate the process (typically the container’s main process). In Kubernetes terms, the container (and thus Pod) is OOMKilled. This is a blunt but necessary mechanism; it’s better than the entire node crashing due to one runaway container. But it means setting a memory limit too low will cause your app to be killed unexpectedly when it reaches that threshold. Always ensure your memory limit is above your application’s realistic peak usage.

Now, what about evictions? Kubernetes eviction is a higher-level concept than the kernel OOM killer. Eviction kicks in when a node itself is under resource pressure (like overall memory on the node is scarce). The kubelet will preemptively evict Pods to relieve pressure and prevent total node failure. When doing so, it respects QoS classes:

It will evict BestEffort Pods first (since they made no reservations).
If more relief is needed, it will evict Burstable Pods that are using more memory than their request (they’ve gone into “surplus” usage).
Guaranteed Pods are evicted last (only if they still can’t solve pressure and basically the node is about to die).

Memory limit image showing node pressure pod effects on different QoS classes

This ties back to why BestEffort is dangerous for any important workload. Under memory pressure, your BestEffort Pods are essentially sacrificial lambs. For example, if a node’s memory is 95% used, the kubelet might evict a few BestEffort Pods to free memory. They get no grace period – just a SIGKILL, goodbye. If those were doing something important, tough luck (and hopefully your app is designed to restart and recover). Even Burstable Pods aren’t entirely safe: if they’re consuming above what they requested, they’re fair game for eviction once all BestEffort are gone or if none exist.

An anti-pattern seen in the wild is intentionally deploying certain workloads as BestEffort to fill up unused capacity (for example, a batch job that processes data when resources are free). This can be okay if you truly don’t mind them being killed, but some teams forget what’s BestEffort in their cluster. We’ve heard of cases where a critical logging agent or monitoring sidecar was accidentally left with no requests – during a surge it got evicted, and suddenly visibility into the system was lost because the monitoring agent died first! Always double-check that anything important has at least a small request to put it in Burstable class, if not Guaranteed.

Another angle: Pod Priority (a separate feature from QoS) can be used to control eviction order, but QoS is still a baseline. Even a high-priority Pod, if it’s BestEffort, might be terminated before a lower-priority Guaranteed Pod in some scenarios. Generally, QoS is the first filter, then Priority within the same QoS class.

In sum, memory misconfigurations manifest as either OOM kills (if limit too low for that container) or evictions (if request too low and node under pressure). Both result in your Pod crashing. The solution is straightforward: give your app enough memory request to cover its typical needs (so it’s unlikely to be evicted) and set a reasonable limit above that (to allow headroom but still cap runaway situations). And of course, monitor – if you see OOMKilled events, that’s a sign the limit is too low or the app memory usage spiked unexpectedly (which might indicate a bug or just the need to increase the allotment).

The Cost Connection:

It’s worth noting how these technical issues tie back to FinOps. A throttled CPU might mean slower transactions – in an e-commerce app, that could indirectly mean lost revenue or unhappy customers (a business cost). Frequent OOM kills might force you to oversize Pods out of paranoia, which means more memory provisioned cluster-wide and thus higher costs. There’s a sweet spot: allocate just enough resources to avoid performance degradation or crashes, but not so much that you’re paying for unused capacity. Achieving this requires profiling and monitoring your workloads continuously.

Alexandru Lazarev’s deep dive, for example, provides a thought-leadership perspective: he challenges the “widely accepted practice” of slapping CPU limits on everything, showing that it “can sometimes do more harm than good, particularly in latency-sensitive systems” . This kind of nuanced understanding is what differentiates teams that treat Kubernetes as a simple deployment target from those that achieve FinOps excellence with Kubernetes – the latter really know what’s happening under the hood and tune their clusters accordingly.

We’ve seen what can go wrong, from wasted resources to throttled CPUs and evicted Pods. Now let’s turn to a more positive note: how do we apply these insights for FinOps success?

FinOps Excellence: Best Practices for Kubernetes Resource Management

At this point, it’s clear that managing requests and limits in Kubernetes is a balancing act between performance and cost. Embracing FinOps excellence means you consistently govern this balance across your organization’s clusters. Here we distill a set of actionable best practices (and highlight poor vs. good practices) to help you optimize both reliability and cost. Think of this as your Kubernetes resource management checklist for technical and financial success:

1. Always Set Requests (Avoid BestEffort Pods)

Poor Practice

Deploying Pods with no resource requests (BestEffort) to “use every drop of the machine.” This often leads to those Pods getting evicted first under pressure, and they contribute zero to cost estimates (skewing chargeback).

Better Practice

Always set at least a minimal CPU and memory request for every Pod in production. This ensures the Pod is in at least Burstable class and won’t be arbitrarily killed when the node is full. Even a small request (e.g. 50m CPU, 100Mi memory) for low-priority pods is better than none. By setting requests, you not only protect the Pod from sudden eviction, but you also make its resource usage visible to the scheduler and any cost tools. A Google report explicitly advises creating awareness among developers about the importance of setting requests – it’s foundational to both reliability and cost accountability.

2. Right-Size Your Requests and Limits

Poor Practice

“One-size-fits-all” resource values – e.g. every microservice gets 1 CPU, 1GB by default because that’s what an older service needed – or never revisiting initial guesses. This leads to over-provisioning (waste) or under-provisioning (performance issues).

Better Practice

Measure actual usage and adjust. Implement a culture of workload rightsizing. Use metrics from production (via Prometheus, DataDog, etc.) to see each deployment’s peak and average usage. Then set requests slightly above the average or p95 usage, and limits at some safe margin above the peak (depending on how much burst you want to allow). This may involve periodic tuning – applications evolve, usage patterns change. Some teams run automated tools to suggest rightsizing (e.g. Kubernetes Vertical Pod Autoscaler in recommendation mode, or open-source tools like Goldilocks). Even without fancy tools, you can do quarterly reviews of resource usage. Rightsizing ensures you’re not paying for phantom resources that aren’t used. The Google report found most clusters have substantial over-provisioning, and tackling that first yields the biggest cost reductions. One caution: do this gradually and observe how the app behaves as you lower requests; you don’t want to accidentally undercut performance. But in general, if you see a deployment using only 100m CPU and it has a 1 CPU request, you have room to lower it and free 900m for other uses or node reduction.

3. Use Horizontal Pod Autoscaler (HPA) for Burstiness

Poor Practice

Running a fixed number of Pod replicas at all times, sized for peak load (or worse, sized below peak and just hoping nothing goes wrong). This either wastes resources during low traffic or fails to handle high traffic, neither of which is ideal.

Better Practice

Enable the Horizontal Pod Autoscaler on workloads with variable demand. HPA scales the replica count based on metrics (like CPU or custom metrics). Importantly, HPA uses the CPU/memory requests as part of its utilization calculation – it doesn’t consider limits. So correct requests make your HPA decisions more accurate. By using HPA, you can have (for example) 2 replicas when load is low and automatically scale up to 10 replicas when load is high. This ties into FinOps by implementing demand-based downscaling: when demand drops again, HPA will scale back down, allowing the cluster to potentially remove nodes (via Cluster Autoscaler) and save cost. Elite teams were found to leverage HPA much more and thus achieve 4× greater scale-down during off-peak hours . HPA is a powerful tool to ensure you’re only running the pods you need at any given time. Make sure to configure sensible target metrics (e.g. scale at 70% CPU utilization) and have enough headroom in limits to actually let pods reach that utilization.

4. Leverage Cluster Autoscaler (CA) and Right-Size Nodes

Poor Practice

Having a fixed-size Kubernetes cluster or over-provisioning nodes “just in case.” This often results in a lot of unused capacity (nodes with low utilization) or, conversely, running out of room if usage spikes unexpectedly.

Better Practice

Enable Cluster Autoscaler on your cluster if you’re on a cloud platform. CA will add new nodes when pods are pending (can’t be scheduled due to lack of resources) and remove nodes when they become idle. It works hand-in-hand with HPA. For truly efficient operations, use both: HPA scales pods down, freeing nodes, and CA then scales nodes down to cut costs. One key insight (from the report and experience): simply turning on CA isn’t enough – if your pods never scale down or if you request resources that leave fragmentation, CA might not remove nodes. Make sure to deploy HPA/VPA so that pods relinquish resources when possible . Additionally, choose appropriate node sizes for your workloads to improve bin packing. Sometimes using a mix of instance types (some large, some small) allows the scheduler to pack pods more tightly and CA to remove entire nodes when load is low. A well-tuned CA means you don’t pay for idle VMs. This is FinOps gold: it directly translates to not being billed when you don’t need the capacity. Most cloud providers support CA in their managed Kubernetes – take advantage of it if you can.

💡 Ensure Graceful Termination with Signal Handling 💡

When HPA scales pods down or the Cluster Autoscaler evicts nodes, Kubernetes sends a SIGTERM to each container. If your application doesn’t handle that signal properly—closing connections, flushing state, or deregistering—you risk data loss or errors. For a deep dive on writing containers that gracefully shut down in Kubernetes, check out our TechStacksDecoded article on “Kubernetes Container Signal Handling.”

</aside>

5. Be Cautious with CPU Limits (and Use Throttling Metrics)

Poor Practice

Applying strict CPU limits to every container by default (e.g. limit = request in all cases) without understanding the app’s needs. This can inadvertently choke high-performance apps or introduce latency due to throttling (as we explored in depth).

Better Practice

Decide on CPU limits case-by-case. For many workloads, you might set no explicit CPU limit at all – just a request. This allows the app to use any available CPU on the node when it needs to, while still ensuring fairness via the request (the kernel gives proportional CPU shares based on request weights). If you are in a multi-tenant cluster or you know a specific service shouldn’t go beyond a certain CPU for business reasons, go ahead and set a limit, but monitor it. Use dashboards or alerts on the CPU throttling metrics to catch if your app is being throttled frequently. If you see significant throttling and performance issues, consider raising the limit or removing it. Another strategy is to use Kubernetes LimitRange and ResourceQuota policies: for example, enforce that all pods in a namespace have some request, and perhaps set sane maxima. But avoid a blanket policy like “limit = request” if it’s not truly needed for all – that can negate any benefit of bursting. The best practice is: memory limits – yes (protect the node), CPU limits – maybe (only if needed). This nuanced approach comes from understanding the harm a too-low CPU limit can do . By not overusing CPU limits, you let the infrastructure work to your advantage (use spare cycles when available), which can improve throughput without costing more.

6. Set Memory Limits to Prevent Runaways

Poor Practice

Not setting memory limits because “our app never uses more than X” or because you fear OOM kills. This can be risky – a bug or unforeseen usage pattern might make your app consume all node memory and crash the whole node.

Better Practice

Always set a memory limit that is reasonably above the expected usage but within safe bounds. Essentially, you’re drawing a line: “if the app goes beyond this, something’s wrong enough that we prefer it to be restarted.” This prevents one container from OOMing the entire node. Coupled with that, ensure the memory request is a value the app usually needs. For example, if an app normally uses ~500Mi, you might set request=500Mi, limit=800Mi or 1Gi. That way it has breathing room, but if it goes haywire (memory leak, etc.) and tries to use 2Gi, it will get OOM-killed at 1Gi and not jeopardize other pods. Yes, that pod will restart, but that’s better than the whole node going down. Monitor your pods for OOMKilled events and adjust limits if you consistently hit them during normal operation (that’s a sign your limit is too low). Memory limits are a crucial safety net.

7. Use Vertical Pod Autoscaler (VPA) for Continuous Tuning

(This is a more advanced but useful practice.)

Poor Practice

Manually guessing resource sizes for dozens of microservices and not revisiting them for months.

Better Practice

Use Vertical Pod Autoscaler in “recommendation” or “auto” mode for workloads that are not extremely latency-sensitive. VPA can monitor the actual usage of pods over time and suggest updated requests (and optionally limits) for them. In recommendation mode, it doesn’t change anything automatically, but you can check its suggestions and then modify your deployments accordingly. In auto mode (use with caution), it will actually adjust the requests/limits and restart pods during low traffic periods to apply them. The benefit is you offload the tedious work of monitoring and tuning each deployment. If you’re running at scale (many teams, many services), VPA can be a big help to ensure everything is getting the Goldilocks treatment (not too little, not too much). The data from the field shows Elite clusters enabling VPA 18× more than low performers – indicating they invest in this kind of automation heavily. Even if you don’t use VPA, at least periodically audit your largest deployments for rightsizing opportunities.

8. Employ In-Place Resource Resize (Kubernetes v1.33+)

Poor Practice

Whenever workloads need a sizing tweak, teams roll out new Pod versions or manually restart Deployments—causing unnecessary downtime, churning replicas, and delaying FinOps optimizations.

Better Practice

Enable the Kubernetes In-Place Resource Resize feature (KEP-1287) from v1.33 onward. This lets you adjust CPU & memory requests and limits on running Pods without restarting them. Paired with VPA recommendations, you can non-disruptively shrink or expand resources in production, achieving:

Zero-downtime tuning (no replica restarts)
Faster rightsizing cycles (instant application of FinOps insights)
Reduced operational churn (fewer rolling updates)
More agile cost optimization (right-sized resources, always)

Learn more about how In-Place Resource Resize works in our deep dive on Kubernetes v1.33 -

👉 Releases Decoded: Kubernetes v1.33

9. Implement Showback/Chargeback with Guardrails

Poor Practice

Not tying resource usage to any cost accountability, or ignoring Kubernetes in cloud cost accounting. This can lead to teams abusing resources (even unintentionally) because “nobody is watching the meter.”

Better Practice

Adopt a Kubernetes cost allocation tool or process (e.g. Kubecost, Cloud provider cost insights, or in-house reporting) to show teams how their requested vs. used resources translate to dollars. Make it transparent – for example, monthly reports per team namespace, showing requested CPU-hours vs. actual usage and the cost impact. When teams see that their app consistently requests 4 vCPUs but uses 1, and that this costs $X per month of waste, they’ll be incentivized to fix it. Tie this into your FinOps program: create cost optimization goals around increasing utilization. However, ensure you have guardrails: for instance, enforce via policy that no deployment can be BestEffort (requests=0) by using admission controllers or OPA policies. The combination of visibility and policy ensures no one “cheats” the system. The earlier quote from the report about BestEffort pods not being charged is a warning – so your guardrail might be “no BestEffort pods in production namespaces, period.” This way, every team has some skin in the game. Chargeback models vary (some charge by requests, some by actual usage, or a blend), but whatever you choose, align it such that setting sensible requests is rewarded.

10. Consider Pod Priority and QoS for Reliability

Poor Practice

Treating all workloads the same in terms of importance. In a crunch, you might inadvertently evict a critical service before a less critical one just because of how requests were set.

Better Practice

Use Pod Priorities in conjunction with QoS to ensure the most important services survive the longest during resource crises. For example, set your core revenue-service Pods with a high priority class and sufficient Guaranteed QoS, and maybe batch jobs with a lower priority. Kubernetes will evict lower priority pods first if it comes down to choosing (after considering QoS). This isn’t directly a cost optimization, but it protects business value, which is the other half of FinOps (remember, FinOps is about reducing cost while maximizing value). Priorities, plus proper requests, ensure you’re not penny-pinching in a way that kills the crown jewels of your platform.

11. Optimize Node Sizing and Leverage Spot Instances

Poor Practice

Running only one type of node (instance) and never updating node sizes as workload patterns change; or ignoring cheaper pricing options like Spot instances for suitable workloads.

Better Practice

Reevaluate your node instance types and usage of Spot (or other discounts) periodically. Sometimes using a bigger node can improve bin packing (if your workloads can share it well) and reduce per-unit cost. Conversely, lots of tiny nodes might increase overhead. Also identify which workloads can tolerate interruptions (e.g. stateless workers, batch jobs) and consider running those on Spot instances or preemptible VMs with much lower cost. The top FinOps performers achieved extremely high discount coverage – they didn’t just optimize usage, they optimized the price of the infrastructure supporting that usage. In Kubernetes, you can mix node groups in a cluster (some on-demand, some spot) and use taints/tolerations or nodeSelectors to schedule certain pods to the cheaper nodes. Just make sure to use PodDisruptionBudgets and perhaps PriorityClasses wisely so that if the cloud revokes the Spot VMs, your cluster can handle it (and possibly fall back to on-demand). This is advanced, but it’s the icing on the cake: after you’ve minimized waste, minimize unit cost of what you do need. FinOps is achieved when you serve the same workload for less money and with no loss in reliability or performance.

By following these best practices, you’ll avoid the common pitfalls (like BestEffort sprawl, throttling woes, or unjustified cloud bills) and move toward a Kubernetes setup that is efficient, resilient, and financially optimized. It’s not a one-time task but an ongoing process – as the quote goes, “you can’t manage what you can’t see” , so keep investing in observability of both technical and cost metrics. This way, Kubernetes becomes not just an infrastructure platform but a FinOps enabler – giving your team the confidence to innovate rapidly, with resources allocated wisely and cost under control.

Conclusion

Kubernetes Pod requests and limits might start as simple knobs to turn, but as we’ve journeyed from fundamentals to FinOps excellence, we’ve seen they have far-reaching effects. They influence how the scheduler places your workloads, how the kernel shares CPU cycles, how (and if) your pods survive under duress, and how your cloud spend maps to each application. Mastering this aspect of Kubernetes is a hallmark of engineering maturity – it means you’re squeezing the most value out of your infrastructure while safeguarding performance.

In practice, achieving this mastery requires collaboration: developers must understand the importance of specifying accurate requests/limits, DevOps/SRE engineers must provide tooling and policies to guide them, and FinOps analysts must feed back insights about cost and utilization. The organizations that do this – as evidenced by the “Elite” cost-optimized Kubernetes clusters – are running at a different level, with high efficiency and strong reliability hand-in-hand . They treat resource management not as a static configuration, but as a continuous improvement process that involves monitoring, autoscaling, and cost analysis.

By applying the knowledge and best practices outlined here – from avoiding throttling pitfalls to embracing autoscalers and rightsizing – you can tune your Kubernetes environment for both technical performance and financial performance. FinOps excellence in Kubernetes is absolutely attainable: it’s about making every CPU cycle and every gigabyte of memory work towards your business value, and not paying a penny more than necessary for idle or unused capacity.

Lead your team by example: start implementing these practices, educate others on why they matter, and cultivate a culture where efficiency is a shared responsibility. Kubernetes is a powerful platform, and with great power comes great responsibility – to use resources wisely. Master your Pod requests and limits, and you’ll be well on your way to Kubernetes-powered innovation that’s as cost-effective as it is cutting-edge.

Kubernetes Resource Requests and Limits 101 (The Fundamentals)

Kubernetes Requests & Limits 201: Real-World Challenges in Production

Deep Dive (301): CPU Throttling and Pod Evictions – Why Misconfigurations Hurt

CPU Limits and Throttling: The Hidden Performance Killer

Memory Limits and Pod Evictions: The OOM Killer and QoS in Action

The Cost Connection:

FinOps Excellence: Best Practices for Kubernetes Resource Management

1. Always Set Requests (Avoid BestEffort Pods)

Poor Practice

Better Practice

2. Right-Size Your Requests and Limits

Poor Practice

Better Practice

3. Use Horizontal Pod Autoscaler (HPA) for Burstiness

Poor Practice

Better Practice

4. Leverage Cluster Autoscaler (CA) and Right-Size Nodes

Poor Practice

Better Practice

💡 Ensure Graceful Termination with Signal Handling 💡

5. Be Cautious with CPU Limits (and Use Throttling Metrics)

Poor Practice

Better Practice

6. Set Memory Limits to Prevent Runaways

Poor Practice

Better Practice

7. Use Vertical Pod Autoscaler (VPA) for Continuous Tuning

Poor Practice

Better Practice

8. Employ In-Place Resource Resize (Kubernetes v1.33+)

Poor Practice

Better Practice

9. Implement Showback/Chargeback with Guardrails

Poor Practice

Better Practice

10. Consider Pod Priority and QoS for Reliability

Poor Practice

Better Practice

11. Optimize Node Sizing and Leverage Spot Instances

Poor Practice

Better Practice

Conclusion

Kommentare

Stay Up to Date with New Content