Document URL: https://docs.openshift.com/container-platform/3.3/admin_guide/out_of_resource_handling.html#out-of-resource-schedulable-resources-and-eviction-policies Section Number and Name: Describe the issue: Documentation says: kubeletArguments: eviction-hard: - "memory.available<500Mi" system-reserved: - "1.5Gi" - Node memory capacity of 10Gi. - Operator wants to reserve 10 percent of memory capacity for system daemons (kernel, node, etc.). - Operator wants to evict pods at 95 percent memory utilization to reduce thrashing and incidence of system OOM. Issue: Somehow, the 1.5Gi is calculated out of 1Gi (10% of 10Gi) + 0.5 Gi (5% of 5Gi). Need to have more clarifications on the math behind this. Suggestions for improvement: Additional information:
@Derek, looks like this bit of content was sourced from you. Can you please help us clarify the math here?
I agree this is confusing, let me try to explain better. A node reports two values: 1. capacity is how much resource is on the machine 2. allocatable is how much resource is made available for scheduling. The goal is to allow the scheduler to fully allocate a node and to not have evictions occur. Evictions should only occur if pods use more than their requested amount of resource. If a node has 10Gi of capacity, and we want to reserve 10% of that capacity for the system daemons, we do the following: capacity = 10Gi system-reserved = 10Gi * .01 = 1Gi The node allocatable value in this setting becomes: allocatable = capacity - system-reserved = 9Gi This means by default, the scheduler will schedule pods that request 9Gi of memory to that node. If we want to turn on eviction so that eviction is triggered when available memory falls below 5% of capacity, we need the scheduler to see allocatable as 8.5Gi. To do this, the math becomes the following: capacity = 10Gi eviction-threshold = 10Gi * .05 = .5Gi system-reserved = (10Gi * .01) + eviction-threshold = 1.5Gi allocatable = capacity - system-reserved = 8.5Gi The key piece of information is you need to set system-reserved equal to the amount of resource you want to reserve for system-daemons + the amount of resource you want to reserve before triggering evictions.
Work in progress: https://github.com/openshift/openshift-docs/pull/3384
Hello Derek, Thank you. Your explanation is clear, but it does not involve all the variables. If we have only 'allocatable' and we want to trigger eviction when we ran out of it, then it is as you described - we have to reserve some memory for system services and we want some for a threshold. But we have separate options "eviction-hard:" and "eviction-soft:". How they are used? where they are in a scenario you described?
The usage of a soft eviction is more common when you are targeting a certain level of utilization, but you are willing to tolerate temporary spikes. I would recommend that the soft eviction threshold is always less than the hard eviction threshold, but the time period is operator specific. The system reservation should also cover the soft eviction threshold. Let's update the original scenario as follows: If a node has 10Gi of capacity, and we want to reserve 10% of that capacity for the system daemons, we do the following: capacity = 10Gi system-reserved = 10Gi * .01 = 1Gi The node allocatable value in this setting becomes: allocatable = capacity - system-reserved = 9Gi This means by default, the scheduler will schedule pods that request 9Gi of memory to that node. If we want to turn on eviction so that eviction is triggered when the node observes available memory falls below 10% of capacity for 30s, or immediately when it falls below 5% of capacity, we need the scheduler to see allocatable as 8Gi. So basically, you need to ensure your system reservation covers the greater of your eviction thresholds.
I copy/pasted bad text in my previous comment. The usage of a soft eviction is more common when you are targeting a certain level of utilization, but you are willing to tolerate temporary spikes. I would recommend that the soft eviction threshold is always less than the hard eviction threshold, but the time period is operator specific. The system reservation should also cover the soft eviction threshold. If we want to turn on eviction so that eviction is triggered when the node observes available memory falls below 10% of capacity for 30s, or immediately when it falls below 5% of capacity, we need the scheduler to see allocatable as 8Gi. So basically, you need to ensure your system reservation covers the greater of your eviction thresholds.
Thanks! The PR is now updated.
@Alexander Does this look good now? https://github.com/openshift/openshift-docs/pull/3384
Commits pushed to master at https://github.com/openshift/openshift-docs https://github.com/openshift/openshift-docs/commit/c7c279eb08db7c95414087abde8f7e9d9e00523e Bug 1400887, Added clarifying details around kubeletArguments in the Example Scenario section https://github.com/openshift/openshift-docs/commit/940f08ef59992c746e9ba84a95e87d67031db8c5 Merge pull request #3384 from ahardin-rh/set-kubeletArguments Bug 1400887, Added clarifying details around kubeletArguments in the Example Scenario section
@derek
Commit pushed to master at https://github.com/openshift/openshift-docs https://github.com/openshift/openshift-docs/commit/cf0e80637c72ad19e895bac705e9db9e4d99657e Merge pull request #4266 from mburke5678/oor-reorg BUG 1400887 Reorganize the Out of Resource Handling Topic
Released to 3.5
@Michael - please provide a link to the released docs before closing.
Released to 3.5 https://docs.openshift.org/latest/admin_guide/out_of_resource_handling.html