Description of problem: Customer has followed recommendations from bz#1400887. However, system still failing to raise memoryPressure flag and initiate eviction. Latest comment from the customer: Finally, I had time to look into this again. So, I did as you suggested and configured the following: kubeletArguments: eviction-hard: - "memory.available<1Gi" eviction-soft: - "memory.available<100Mi" eviction-soft-grace-period: - "memory.available=60s" eviction-max-pod-grace-period: - "180" eviction-pressure-transition-period: - "60s" kube-reserved: - "cpu=0.5,memory=1Gi" system-reserved: - "cpu=0.5,memory=2Gi" Although, I personally do not understand why it should be "hard-eviction > soft-eviction". I would read the current configuration as: - perform soft eviction if the available memory falls below 100Mi - perform hard eviction if the available memory falls below 1Gi I would want the system to first try soft eviction and only if this does not help, as a last measure, do hard eviction. Now, it is the opposite way around. "memory.available" does not sound to me as "how much the allocatable memory is exceeded". Also, you copy pasted the following from bugzilla: "If we want to turn on eviction so that eviction is triggered when the node observes available memory falls below 10% of capacity for 30s, or immediately when it falls below 5% of capacity, we need the scheduler to see allocatable as 8Gi." - eviction is triggered when the node observes available memory falls below 10% of capacity -> soft eviction - immediately when it falls below 5% of capacity -> hard eviction 10% of capacity is a higher number than 5% of capacity. Nevertheless, I configured the above and made the test again. 7 minutes ago, I had the following: NAME READY STATUS RESTARTS AGE IP NODE best-effort-1-2b4nq 1/1 Running 0 15m 10.94.1.33 ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-2qba6 1/1 Running 0 10m 10.94.1.36 ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-9z960 1/1 Running 0 19m 10.94.1.32 ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-dkbaf 1/1 Running 0 22m 10.94.1.31 ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-hn53i 1/1 Running 0 9m 10.94.1.37 ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-mjru7 1/1 Running 0 12m 10.94.1.35 ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-sj98t 1/1 Running 0 3m 10.94.1.40 ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-t0tcf 1/1 Running 0 4m 10.94.1.39 ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-v14e0 1/1 Running 0 7m 10.94.1.38 ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-x9dlc 1/1 Running 0 13m 10.94.1.34 ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-ygjfg 1/1 Running 0 1m 10.94.1.41 ip-10-191-1-78.eu-west-1.compute.internal burstable-2-1102g 1/1 Running 0 25m 10.94.1.28 ip-10-191-1-78.eu-west-1.compute.internal burstable-2-2gc7g 1/1 Running 0 25m 10.94.1.5 ip-10-191-1-78.eu-west-1.compute.internal burstable-2-3l2i3 1/1 Running 0 25m 10.94.1.10 ip-10-191-1-78.eu-west-1.compute.internal burstable-2-5xxzl 1/1 Running 0 25m 10.94.1.25 ip-10-191-1-78.eu-west-1.compute.internal burstable-2-ie53c 1/1 Running 0 25m 10.94.1.3 ip-10-191-1-78.eu-west-1.compute.internal burstable-2-v1kw6 1/1 Running 0 25m 10.94.1.27 ip-10-191-1-78.eu-west-1.compute.internal burstable-2-vgmx4 1/1 Running 0 25m 10.94.1.7 ip-10-191-1-78.eu-west-1.compute.internal burstable-2-yufqn 1/1 Running 0 25m 10.94.1.16 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-0asmf 1/1 Running 0 28m 10.94.1.19 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-14s5s 1/1 Running 0 28m 10.94.1.9 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-3ysuq 1/1 Running 0 28m 10.94.1.26 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-4hf9q 1/1 Running 0 28m 10.94.1.13 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-4icxh 1/1 Running 0 28m 10.94.1.23 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-cpg6u 1/1 Running 0 28m 10.94.1.4 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-cpif3 1/1 Running 0 28m 10.94.1.29 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-hqhrh 1/1 Running 0 28m 10.94.1.30 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-kl5f6 1/1 Running 0 28m 10.94.1.18 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-kt91m 1/1 Running 0 28m 10.94.1.22 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-ms7hi 1/1 Running 0 28m 10.94.1.11 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-p2dre 1/1 Running 0 28m 10.94.1.8 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-pc2qh 1/1 Running 0 28m 10.94.1.12 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-q9y6y 1/1 Running 0 28m 10.94.1.6 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-qgmkh 1/1 Running 0 28m 10.94.1.24 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-rtrdl 1/1 Running 0 28m 10.94.1.17 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-tmcsu 1/1 Running 0 28m 10.94.1.14 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-tq1w8 1/1 Running 0 28m 10.94.1.15 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-uy0kf 1/1 Running 0 28m 10.94.1.20 ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-yillh 1/1 Running 0 25d 10.94.1.21 ip-10-191-1-78.eu-west-1.compute.internal -> 20 guaranteed, 8 burstable, 11 best-effort That was the moment I started writing this comment, which took me 7 minutes. Then I checked back and see: NAME READY STATUS RESTARTS AGE IP NODE best-effort-1-5jai6 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-625zq 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-9dv5e 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-a1wjb 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-cfvtw 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-dcxf7 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-fkj5j 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-o92lq 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-pngqk 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-td7ei 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal best-effort-1-y6r0u 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal burstable-2-4w6zs 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal burstable-2-9fmq0 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal burstable-2-cr3xz 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal burstable-2-gcsqw 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal burstable-2-pxnhj 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal burstable-2-twezh 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal burstable-2-x1kla 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal burstable-2-ycwz8 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-1f67k 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-1lu47 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-2cpir 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-3wbi2 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-7pl46 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-agt1q 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-dcncj 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-fnpdn 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-gxrqn 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-hmc1m 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-i5x08 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-i8js4 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-iot42 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-o2952 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-onflu 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-oo0ru 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-qfigq 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-xwoe0 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-ytlov 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal guaranteed-3-zkrwp 0/1 Pending 0 7m <none> ip-10-191-1-78.eu-west-1.compute.internal So, no, this configuration did not improve the situation, the node died. ip-10-191-1-78.eu-west-1.compute.internal NotReady 138d The describe node now shows the following: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- OutOfDisk Unknown Mon, 16 Jan 2017 08:38:24 +0100 Mon, 16 Jan 2017 08:39:55 +0100 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Mon, 16 Jan 2017 08:38:30 +0100 Mon, 16 Jan 2017 08:39:55 +0100 NodeStatusUnknown Kubelet stopped posting node status. MemoryPressure False Mon, 16 Jan 2017 08:38:29 +0100 Wed, 09 Nov 2016 17:20:50 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available Also, during schaling up, I never saw MemoryPressure True. Again, I used curl --cacert /etc/origin/master/ca.crt --cert /etc/origin/master/admin.crt --key /etc/origin/master/admin.key https://10.191.1.78:10250/stats/summary --insecure | less on a master to query the used memory. And quite soon (with a low number of best-efforts), the used memory was actually already bigger as allocatable. No pod eviction ever took place. Version-Release number of selected component (if applicable): 3.3.0 How reproducible: - configure kubelet as above. - add some amount of guaranted pods - add burstable pods. - when burstable can not fit the node start adding bestefort pods. behaviour - when there is not memory all pods switch to "pending" state. then node dies. expected behaviour: besefort pods are killed first. guaranted pods survive and continue to function. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I am confused by the configuration. I would expect the following configuration: kubeletArguments: eviction-hard: - "memory.available<100Mi" eviction-soft: - "memory.available<1Gi" eviction-soft-grace-period: - "memory.available=60s" eviction-max-pod-grace-period: - "180" eviction-pressure-transition-period: - "60s" kube-reserved: - "cpu=0.5,memory=1Gi" system-reserved: - "cpu=0.5,memory=2Gi" In the above configuration, you would say if the node observes <100Mi available memory, initiate eviction immediately. If the node observes <1Gi available memory for 60s, initiate eviction. What is the rate of memory usage growth on the node in question? What are the individual pods on the node doing to consume memory and at what rate? Be aware, that the node collects memory usage every 10s, so if you induce an OOM in that 10s period, the oom_killer will be invoked on the OS.
Customer have tried this as well. In fact, this was his first version of configuration. It does not change anything. Behaviour is exactly the same. within bz 1400887 you wrote: "I would recommend that the soft eviction threshold is always less than the hard eviction threshold, but the time period is operator specific." So, this was an attempt to make it working. In a lab, when I used example from documentation (only eviction-hard and system-reserved set) all was working as expected - bestefort pods were stopped first. But, configuration with all options set lead to situation when all pods were stopped and node turned to NotReady state. I did also "oc describe node" to check memoryPressure flag and never saw it raised. As per rate, new pods were added not often then 1 in a minute (I did read about 10s intervals, so did it much more slowly). The pod itself is EAP application created with "oc new-app jboss-eap70-openshift --name=<...>" command. All pods are the same except resource requests/limits. I this this is easily reproducible
I have reviewed the attached information. From the summary log data provided, I ran the following script: $ cat parser.sh #!/usr/bin/env bash # memory.available<100Mi export hard_threshold=104877600 # memory.available<1Gi export soft_threshold=1073741824 # track when thresholds were met export soft_threshold_met=0 for i in $( find . -name "*.log" -type f | sort ); do availableBytes=$(cat $i | python -c 'import sys, json; print json.load(sys.stdin)["node"]["memory"]["availableBytes"]') if [ "$availableBytes" -lt "$hard_threshold" ]; then echo "item $i - hard threshold met, availableBytes $availableBytes" fi if [ "$availableBytes" -lt "$soft_threshold" ]; then echo "item $i - soft threshold met, availableBytes $availableBytes" soft_threshold_met=1 else if [ "$soft_threshold_met" -eq "1" ]; then echo "item $i - soft threshold no longer met, availableBytes $availableBytes" soft_threshold_met=0 fi fi done $ ./parser.sh item ./2017.02.14-14.52.48-summary.log - soft threshold met, availableBytes 1065074688 item ./2017.02.14-14.52.59-summary.log - soft threshold met, availableBytes 965672960 item ./2017.02.14-14.53.09-summary.log - soft threshold met, availableBytes 965672960 item ./2017.02.14-14.53.19-summary.log - soft threshold met, availableBytes 971403264 item ./2017.02.14-14.53.29-summary.log - soft threshold met, availableBytes 975388672 item ./2017.02.14-14.53.40-summary.log - soft threshold met, availableBytes 975388672 item ./2017.02.14-14.53.50-summary.log - soft threshold met, availableBytes 972570624 item ./2017.02.14-14.54.00-summary.log - soft threshold no longer met, availableBytes 1745387520 It appears in the test scenario, we crossed the soft threshold (i.e. 1073741824) for what appears to be ~1m. I am continuing to do more investigation to try and determine why MemoryPressure condition is not raised.
Are there node logs associated with the shift32-n2.gsslab.pnq.redhat.com machine?
Adding more detail notes as I evaluate the provided information. Per my previous comment, the soft eviction threshold was crossed at the following time: item ./2017.02.14-14.52.59-summary.log - soft threshold met, availableBytes 965672960 The node should have reported MemoryPressure=True at this time. If I look at log-file output attached, oc describe node was performed at Tue Feb 14 14:55:30 IST 2017 and shows the following: 2m 2m 1 {kubelet shift32-n2.gsslab.pnq.redhat.com} Normal NodeHasInsufficientMemory Node shift32-n2.gsslab.pnq.redhat.com status is now: NodeHasInsufficientMemory The above tells me that MemoryPressure was observed by the kubelet here: https://github.com/openshift/ose/blob/enterprise-3.3/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go#L3365 And it made this observation at ~14:53 which is consistent with the values shown in the summary log. After ~1m (which is consistent with the grace-period provided of 60s) from observation of threshold being crossed, I see eviction threshold was met and reclaim was initiated: 1m 1m 1 {kubelet shift32-n2.gsslab.pnq.redhat.com} Warning EvictionThresholdMet Attempting to reclaim memory Per the summary log information, this appears consistent as occurring between: item ./2017.02.14-14.53.50-summary.log - soft threshold met, availableBytes 972570624 item ./2017.02.14-14.54.00-summary.log - soft threshold no longer met, availableBytes 1745387520 If I look at log-file output attached, oc describe node was performed at Tue Feb 14 14:55:30 IST 2017 and shows the following: 17h 1s 2 {kubelet shift32-n2.gsslab.pnq.redhat.com} Normal NodeHasSufficientMemory Node shift32-n2.gsslab.pnq.redhat.com status is now: NodeHasSufficientMemory The event was thrown ~1m after the eviction action occurred which is consistent with eviction-pressure-transition-period: "60s" as configured. The log-file attached shows the following prior to the oc describe node output: # oc get pods -w NAME READY STATUS RESTARTS AGE best-effort-1-boktc 0/1 Evicted 0 5m best-effort-1-oxob7 0/1 Pending 0 50s best-effort-1-s0muh 0/1 Pending 0 45s burstable-2-19gbf 0/1 Evicted 0 14m burstable-2-fshvh 0/1 ContainerCreating 0 45s guaranteed-6-0pth4 1/1 Running 0 16m guaranteed-6-jz678 0/1 Evicted 0 16m guaranteed-6-kysun 0/1 ContainerCreating 0 41s guaranteed-6-poqmd 0/1 ContainerCreating 0 24s guaranteed-6-rmszf 0/1 Evicted 0 16m From this information, I can see 3 pods were evicted and the two new best-effort pods are pending as expected (presumably because they have not been scheduled yet to the node). The eviction manager is more aggressive in ose 3.3 than it is in ose 3.4 when we did work to try and ensure that it does not work on stale stats in subsequent eviction decisions, which could happen if cadvisor housekeeping became out of sync w/ eviction manager: see: https://github.com/kubernetes/kubernetes/pull/32724 I can try locally to increase the eviction-pressure-transition-period beyond 60s, but from the information requested, it looks like the system is working as designed for 3.3.
I recommend the customer keep the default eviction-pressure-transition period to 5m0s rather than reduce it to 60s. Is there a reason why the customer wanted to reduce the amount of time?