1413666 – [3.3] Memory pressure flag is not raised:

Bug 1413666 - [3.3] Memory pressure flag is not raised:

Summary: [3.3] Memory pressure flag is not raised:

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Derek Carr
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-16 16:08 UTC by Alexander Koksharov
Modified:	2020-08-13 08:49 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-03-01 15:44:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Alexander Koksharov 2017-01-16 16:08:33 UTC

Description of problem:

Customer has followed recommendations from bz#1400887. However, system still failing to raise memoryPressure flag and initiate eviction.

Latest comment from the customer:

Finally, I had time to look into this again. So, I did as you suggested and configured the following:

kubeletArguments:
  eviction-hard:
    - "memory.available<1Gi"
  eviction-soft:
    - "memory.available<100Mi"
  eviction-soft-grace-period:
    - "memory.available=60s"
  eviction-max-pod-grace-period:
    - "180"
  eviction-pressure-transition-period:
    - "60s"
  kube-reserved:
    - "cpu=0.5,memory=1Gi"
  system-reserved:
    - "cpu=0.5,memory=2Gi"

Although, I personally do not understand why it should be "hard-eviction > soft-eviction". I would read the current configuration as:
- perform soft eviction if the available memory falls below 100Mi
- perform hard eviction if the available memory falls below 1Gi

I would want the system to first try soft eviction and only if this does not help, as a last measure, do hard eviction. Now, it is the opposite way around.

"memory.available" does not sound to me as "how much the allocatable memory is exceeded".

Also, you copy pasted the following from bugzilla: "If we want to turn on eviction so that eviction is triggered when the node observes available memory falls below 10% of capacity for 30s, or immediately when it falls below 5% of capacity, we need the scheduler to see allocatable as 8Gi."
- eviction is triggered when the node observes available memory falls below 10% of capacity -> soft eviction
- immediately when it falls below 5% of capacity -> hard eviction
10% of capacity is a higher number than 5% of capacity.

Nevertheless, I configured the above and made the test again. 7 minutes ago, I had the following:

NAME                  READY     STATUS    RESTARTS   AGE       IP           NODE
best-effort-1-2b4nq   1/1       Running   0          15m       10.94.1.33   ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-2qba6   1/1       Running   0          10m       10.94.1.36   ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-9z960   1/1       Running   0          19m       10.94.1.32   ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-dkbaf   1/1       Running   0          22m       10.94.1.31   ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-hn53i   1/1       Running   0          9m        10.94.1.37   ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-mjru7   1/1       Running   0          12m       10.94.1.35   ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-sj98t   1/1       Running   0          3m        10.94.1.40   ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-t0tcf   1/1       Running   0          4m        10.94.1.39   ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-v14e0   1/1       Running   0          7m        10.94.1.38   ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-x9dlc   1/1       Running   0          13m       10.94.1.34   ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-ygjfg   1/1       Running   0          1m        10.94.1.41   ip-10-191-1-78.eu-west-1.compute.internal
burstable-2-1102g     1/1       Running   0          25m       10.94.1.28   ip-10-191-1-78.eu-west-1.compute.internal
burstable-2-2gc7g     1/1       Running   0          25m       10.94.1.5    ip-10-191-1-78.eu-west-1.compute.internal
burstable-2-3l2i3     1/1       Running   0          25m       10.94.1.10   ip-10-191-1-78.eu-west-1.compute.internal
burstable-2-5xxzl     1/1       Running   0          25m       10.94.1.25   ip-10-191-1-78.eu-west-1.compute.internal
burstable-2-ie53c     1/1       Running   0          25m       10.94.1.3    ip-10-191-1-78.eu-west-1.compute.internal
burstable-2-v1kw6     1/1       Running   0          25m       10.94.1.27   ip-10-191-1-78.eu-west-1.compute.internal
burstable-2-vgmx4     1/1       Running   0          25m       10.94.1.7    ip-10-191-1-78.eu-west-1.compute.internal
burstable-2-yufqn     1/1       Running   0          25m       10.94.1.16   ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-0asmf    1/1       Running   0          28m       10.94.1.19   ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-14s5s    1/1       Running   0          28m       10.94.1.9    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-3ysuq    1/1       Running   0          28m       10.94.1.26   ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-4hf9q    1/1       Running   0          28m       10.94.1.13   ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-4icxh    1/1       Running   0          28m       10.94.1.23   ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-cpg6u    1/1       Running   0          28m       10.94.1.4    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-cpif3    1/1       Running   0          28m       10.94.1.29   ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-hqhrh    1/1       Running   0          28m       10.94.1.30   ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-kl5f6    1/1       Running   0          28m       10.94.1.18   ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-kt91m    1/1       Running   0          28m       10.94.1.22   ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-ms7hi    1/1       Running   0          28m       10.94.1.11   ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-p2dre    1/1       Running   0          28m       10.94.1.8    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-pc2qh    1/1       Running   0          28m       10.94.1.12   ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-q9y6y    1/1       Running   0          28m       10.94.1.6    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-qgmkh    1/1       Running   0          28m       10.94.1.24   ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-rtrdl    1/1       Running   0          28m       10.94.1.17   ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-tmcsu    1/1       Running   0          28m       10.94.1.14   ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-tq1w8    1/1       Running   0          28m       10.94.1.15   ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-uy0kf    1/1       Running   0          28m       10.94.1.20   ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-yillh    1/1       Running   0          25d       10.94.1.21   ip-10-191-1-78.eu-west-1.compute.internal

-> 20 guaranteed, 8 burstable, 11 best-effort

That was the moment I started writing this comment, which took me 7 minutes. Then I checked back and see:

NAME                  READY     STATUS    RESTARTS   AGE       IP        NODE
best-effort-1-5jai6   0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-625zq   0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-9dv5e   0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-a1wjb   0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-cfvtw   0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-dcxf7   0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-fkj5j   0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-o92lq   0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-pngqk   0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-td7ei   0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
best-effort-1-y6r0u   0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
burstable-2-4w6zs     0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
burstable-2-9fmq0     0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
burstable-2-cr3xz     0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
burstable-2-gcsqw     0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
burstable-2-pxnhj     0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
burstable-2-twezh     0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
burstable-2-x1kla     0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
burstable-2-ycwz8     0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-1f67k    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-1lu47    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-2cpir    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-3wbi2    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-7pl46    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-agt1q    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-dcncj    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-fnpdn    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-gxrqn    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-hmc1m    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-i5x08    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-i8js4    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-iot42    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-o2952    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-onflu    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-oo0ru    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-qfigq    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-xwoe0    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-ytlov    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal
guaranteed-3-zkrwp    0/1       Pending   0          7m        <none>    ip-10-191-1-78.eu-west-1.compute.internal


So, no, this configuration did not improve the situation, the node died.
ip-10-191-1-78.eu-west-1.compute.internal     NotReady   138d

The describe node now shows the following:
  Type                  Status          LastHeartbeatTime                       LastTransitionTime                      Reason                          Message
  ----                  ------          -----------------                       ------------------                      ------                          -------
  OutOfDisk             Unknown         Mon, 16 Jan 2017 08:38:24 +0100         Mon, 16 Jan 2017 08:39:55 +0100         NodeStatusUnknown               Kubelet stopped posting node status.
  Ready                 Unknown         Mon, 16 Jan 2017 08:38:30 +0100         Mon, 16 Jan 2017 08:39:55 +0100         NodeStatusUnknown               Kubelet stopped posting node status.
  MemoryPressure        False           Mon, 16 Jan 2017 08:38:29 +0100         Wed, 09 Nov 2016 17:20:50 +0100         KubeletHasSufficientMemory      kubelet has sufficient memory available

Also, during schaling up, I never saw MemoryPressure True. Again, I used
curl --cacert /etc/origin/master/ca.crt --cert /etc/origin/master/admin.crt --key /etc/origin/master/admin.key https://10.191.1.78:10250/stats/summary --insecure | less
on a master to query the used memory. And quite soon (with a low number of best-efforts), the used memory was actually already bigger as allocatable. No pod eviction ever took place.
 


Version-Release number of selected component (if applicable):
3.3.0

How reproducible:
- configure kubelet as above.
- add some amount of guaranted pods
- add burstable pods. 
- when burstable can not fit the node start adding bestefort pods.

behaviour - when there is not memory all pods switch to "pending" state. then node dies.
expected behaviour: besefort pods are killed first. guaranted pods survive and continue to function.



Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Derek Carr 2017-01-16 16:29:52 UTC

I am confused by the configuration.

I would expect the following configuration:

kubeletArguments:
  eviction-hard:
    - "memory.available<100Mi"
  eviction-soft:
    - "memory.available<1Gi"
  eviction-soft-grace-period:
    - "memory.available=60s"
  eviction-max-pod-grace-period:
    - "180"
  eviction-pressure-transition-period:
    - "60s"
  kube-reserved:
    - "cpu=0.5,memory=1Gi"
  system-reserved:
    - "cpu=0.5,memory=2Gi"

In the above configuration, you would say if the node observes <100Mi available memory, initiate eviction immediately.  If the node observes <1Gi available memory for 60s, initiate eviction.

What is the rate of memory usage growth on the node in question?  What are the individual pods on the node doing to consume memory and at what rate?  Be aware, that the node collects memory usage every 10s, so if you induce an OOM in that 10s period, the oom_killer will be invoked on the OS.

Comment 2 Alexander Koksharov 2017-01-17 13:10:07 UTC

Customer have tried this as well. In fact, this was his first version of configuration. It does not change anything. Behaviour is exactly the same.

within bz 1400887 you wrote: "I would recommend that the soft eviction threshold is always less than the hard eviction threshold, but the time period is operator specific." So, this was an attempt to make it working.

In a lab, when I used example from documentation (only eviction-hard and system-reserved set) all was working as expected - bestefort pods were stopped first.
But, configuration with all options set lead to situation when all pods were stopped and node turned to NotReady state. I did also "oc describe node" to check memoryPressure flag and never saw it raised.

As per rate, new pods were added not often then 1 in a minute (I did read about 10s intervals, so did it much more slowly). The pod itself is EAP application created with "oc new-app jboss-eap70-openshift --name=<...>" command. All pods are the same except resource requests/limits.

I this this is easily reproducible

Comment 16 Derek Carr 2017-02-20 22:50:09 UTC

I have reviewed the attached information.

From the summary log data provided, I ran the following script:

$ cat parser.sh
#!/usr/bin/env bash
# memory.available<100Mi
export hard_threshold=104877600
# memory.available<1Gi
export soft_threshold=1073741824
# track when thresholds were met
export soft_threshold_met=0
for i in $( find . -name "*.log" -type f | sort ); do
    availableBytes=$(cat $i | python -c 'import sys, json; print json.load(sys.stdin)["node"]["memory"]["availableBytes"]')
    if [ "$availableBytes" -lt "$hard_threshold" ];
    then
        echo "item $i - hard threshold met, availableBytes $availableBytes"
    fi
    if [ "$availableBytes" -lt "$soft_threshold" ];
    then
        echo "item $i - soft threshold met, availableBytes $availableBytes"
        soft_threshold_met=1
    else
        if [ "$soft_threshold_met" -eq "1" ]; then
            echo "item $i - soft threshold no longer met, availableBytes $availableBytes"
            soft_threshold_met=0
        fi
    fi
done

$ ./parser.sh
item ./2017.02.14-14.52.48-summary.log - soft threshold met, availableBytes 1065074688
item ./2017.02.14-14.52.59-summary.log - soft threshold met, availableBytes 965672960
item ./2017.02.14-14.53.09-summary.log - soft threshold met, availableBytes 965672960
item ./2017.02.14-14.53.19-summary.log - soft threshold met, availableBytes 971403264
item ./2017.02.14-14.53.29-summary.log - soft threshold met, availableBytes 975388672
item ./2017.02.14-14.53.40-summary.log - soft threshold met, availableBytes 975388672
item ./2017.02.14-14.53.50-summary.log - soft threshold met, availableBytes 972570624
item ./2017.02.14-14.54.00-summary.log - soft threshold no longer met, availableBytes 1745387520

It appears in the test scenario, we crossed the soft threshold (i.e. 1073741824) for what appears to be ~1m.

I am continuing to do more investigation to try and determine why MemoryPressure condition is not raised.

Comment 17 Derek Carr 2017-02-20 22:55:42 UTC

Are there node logs associated with the shift32-n2.gsslab.pnq.redhat.com machine?

Comment 18 Derek Carr 2017-02-21 00:09:01 UTC

Adding more detail notes as I evaluate the provided information.

Per my previous comment, the soft eviction threshold was crossed at the following time:

item ./2017.02.14-14.52.59-summary.log - soft threshold met, availableBytes 965672960

The node should have reported MemoryPressure=True at this time.

If I look at log-file output attached, oc describe node was performed at Tue Feb 14 14:55:30 IST 2017 and shows the following:

  2m		2m		1	{kubelet shift32-n2.gsslab.pnq.redhat.com}			Normal		NodeHasInsufficientMemory	Node shift32-n2.gsslab.pnq.redhat.com status is now: NodeHasInsufficientMemory

The above tells me that MemoryPressure was observed by the kubelet here:
https://github.com/openshift/ose/blob/enterprise-3.3/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go#L3365

And it made this observation at ~14:53 which is consistent with the values shown in the summary log.

After ~1m (which is consistent with the grace-period provided of 60s) from observation of threshold being crossed, I see eviction threshold was met and reclaim was initiated:

  1m		1m		1	{kubelet shift32-n2.gsslab.pnq.redhat.com}			Warning		EvictionThresholdMet		Attempting to reclaim memory

Per the summary log information, this appears consistent as occurring between:

item ./2017.02.14-14.53.50-summary.log - soft threshold met, availableBytes 972570624
item ./2017.02.14-14.54.00-summary.log - soft threshold no longer met, availableBytes 1745387520

If I look at log-file output attached, oc describe node was performed at Tue Feb 14 14:55:30 IST 2017 and shows the following:

  17h		1s		2	{kubelet shift32-n2.gsslab.pnq.redhat.com}			Normal		NodeHasSufficientMemory		Node shift32-n2.gsslab.pnq.redhat.com status is now: NodeHasSufficientMemory

The event was thrown ~1m after the eviction action occurred which is consistent with eviction-pressure-transition-period: "60s" as configured.

The log-file attached shows the following prior to the oc describe node output:

# oc get pods -w
NAME                  READY     STATUS              RESTARTS   AGE
best-effort-1-boktc   0/1       Evicted             0          5m
best-effort-1-oxob7   0/1       Pending             0          50s
best-effort-1-s0muh   0/1       Pending             0          45s
burstable-2-19gbf     0/1       Evicted             0          14m
burstable-2-fshvh     0/1       ContainerCreating   0          45s
guaranteed-6-0pth4    1/1       Running             0          16m
guaranteed-6-jz678    0/1       Evicted             0          16m
guaranteed-6-kysun    0/1       ContainerCreating   0          41s
guaranteed-6-poqmd    0/1       ContainerCreating   0          24s
guaranteed-6-rmszf    0/1       Evicted             0          16m

From this information, I can see 3 pods were evicted and the two new best-effort pods are pending as expected (presumably because they have not been scheduled yet to the node).

The eviction manager is more aggressive in ose 3.3 than it is in ose 3.4 when we did work to try and ensure that it does not work on stale stats in subsequent eviction decisions, which could happen if cadvisor housekeeping became out of sync w/ eviction manager:
see: https://github.com/kubernetes/kubernetes/pull/32724

I can try locally to increase the eviction-pressure-transition-period beyond 60s, but from the information requested, it looks like the system is working as designed for 3.3.

Comment 19 Derek Carr 2017-02-21 00:35:32 UTC

I recommend the customer keep the default eviction-pressure-transition period to 5m0s rather than reduce it to 60s.  Is there a reason why the customer wanted to reduce the amount of time?

Note You need to log in before you can comment on or make changes to this bug.