1451902 – Nodes flapping, going NotReady with "PLEG is not healthy" messages for workloads that were ok in 3.5

Bug 1451902 - Nodes flapping, going NotReady with "PLEG is not healthy" messages for workloads that were ok in 3.5

Summary: Nodes flapping, going NotReady with "PLEG is not healthy" messages for worklo...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.6.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.8.0
Assignee:	Ben Bennett
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:	aos-scalability-37
Duplicates (6):	1451110 1474875 1486914 1493182 1493603 1503252 (view as bug list)
Depends On:	1503702
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-17 19:33 UTC by Mike Fiedler
Modified:	2023-09-14 03:57 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-12-07 13:20:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
syslog of node going NotReady (1.69 MB, application/x-gzip) 2017-07-10 20:28 UTC, Mike Fiedler	no flags	Details
syslogs matching with data in comment 31 and comment 32 (15.49 MB, application/x-gzip) 2017-08-14 15:02 UTC, Mike Fiedler	no flags	Details
View All

Description Mike Fiedler 2017-05-17 19:33:43 UTC

Description of problem:

Under at least two types of workloads that were fine in OCP 3.5, the SVT team is seeing node readiness flapping on 3.6.74 and 3.6.75.

1. On a 100 node cluster,  running cluster-loader to create 200 projects with 5000 running pause pods.   After ~150 or so projects  and 3K pods are created, many nodes start flapping Ready/NotReady with messages like this in the log:

May 17 13:29:58 192 atomic-openshift-node: I0517 13:29:58.447773   51063 kubelet.go:1757] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m5.146059486s ago; threshold is 3m0s]
May 17 13:30:02 192 atomic-openshift-node: I0517 13:30:02.277003   51063 kubelet_node_status.go:402] Recording NodeNotReady event message for node 192.1.0.40
May 17 13:30:02 192 atomic-openshift-node: I0517 13:30:02.277040   51063 kubelet_node_status.go:705] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-05-17 13:30:02.276982027 -0400 EDT LastTransitionTime:2017-05-17 13:30:02.276982027 -0400 EDT Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m8.975341225s ago; threshold is 3m0s}
May 17 13:30:03 192 atomic-openshift-node: I0517 13:30:03.447992   51063 kubelet.go:1757] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m10.146302683s ago; threshold is 3m0s]
May 17 13:30:08 192 atomic-openshift-node: I0517 13:30:08.448232   51063 kubelet.go:1757] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m15.146531326s ago; threshold is 3m0s]
May 17 13:30:13 192 atomic-openshift-node: I0517 13:30:13.448446   51063 kubelet.go:1757] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m20.14676978s ago; threshold is 3m0s]

2. Running concurrent builds.  On a 1 master/1 infra/2 node cluster 


Version-Release number of selected component (if applicable): 3.6.74


How reproducible: Always, in these environments


Steps to Reproduce:
1.  100 node 3.6.74 cluster.  3 master/etcd, 3 infra, 100 compute nodes
2.  Run cluster-loader (https://github.com/openshift/svt/tree/master/openshift_scalability) with the configuration below
3.  oc get nodes -w | grep NotReady


Actual results:

After 150 - 200 projects are created you'll start seeing nodes flap.   Checking oc get pods --all-namespaces shows many pods in ContainerCreating.   Node logs will show PLEG not healthy messages.

Expected results:

Pod creation works successfully, as in 3.5.

Additional info:

cluster-loader config:

projects:
  - num: 300
    basename: svt-6-
    tuning: default
    templates:
      -
        num: 1
        file: ./content/build-config-template.json
      -
        num: 1
        file: ./content/build-template.json
      -
        num: 1
        file: ./content/image-stream-template.json
      - 
        num: 5
        file: ./content/deployment-config-1rep-pause-template.json
        parameters:
          -
            ENV_VALUE: "asodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij12"
      -
        num: 10
        file: ./content/deployment-config-2rep-pause-template.json
        parameters:
          -
            ENV_VALUE: "asodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij12"
      -
        num: 2
        file: ./content/ssh-secret-template.json
      -
        num: 1
        file: ./content/route-template.json
      # rcs and services are implemented in deployments.
tuningsets:
  - name: default
    templates:
      stepping:
        stepsize: 5
        pause: 20 s
      rate_limit:
        delay: 250 ms
quotas:
  - name: default

Comment 2 Mike Fiedler 2017-05-17 19:48:12 UTC

I didn't complete the description of the second scenario:

1 master/etcd, 1 infra, 2 compute.   Running 30 concurrent builds repeatedly using https://github.com/openshift/svt/blob/master/openshift_performance/ose3_perf/scripts/build_test-README.md


For the first scenario (cluster load up) docker is 1.12.6-16

For the second scenario (concurrent builds) docker is 1.12.6-28

Comment 5 Mike Fiedler 2017-05-18 18:47:45 UTC

docker is responsive to docker ps and docker info.   No messages of the sort "Is the docker daemon running?"

The only errors I see from docker in the logs are of this sort:

May 17 11:13:05 192 dockerd-current: time="2017-05-17T11:13:05.121153019-04:00" level=error msg="Handler for POST /v1.24/containers/864f0898ef549452c8aa392ccac7c3622055b142d7809f5de6ae26fabd667436/stop?t=10 returned error: Container 864f0898ef549452c8aa392ccac7c3622055b142d7809f5de6ae26fabd667436 is already stopped"
May 17 11:13:05 192 dockerd-current: time="2017-05-17T11:13:05.121199385-04:00" level=error msg="Handler for POST /v1.24/containers/864f0898ef549452c8aa392ccac7c3622055b142d7809f5de6ae26fabd667436/stop returned error: Container 864f0898ef549452c8aa392ccac7c3622055b142d7809f5de6ae26fabd667436 is already stopped"

Comment 8 Mike Fiedler 2017-05-19 16:12:58 UTC

Here is the curl for a node in NotReady:

# HELP kubelet_pleg_relist_interval_microseconds Interval in microseconds between relisting in PLEG.
# TYPE kubelet_pleg_relist_interval_microseconds summary
kubelet_pleg_relist_interval_microseconds{quantile="0.5"} 5.3671457e+07
kubelet_pleg_relist_interval_microseconds{quantile="0.9"} 6.2126903e+07
kubelet_pleg_relist_interval_microseconds{quantile="0.99"} 6.2126903e+07
kubelet_pleg_relist_interval_microseconds_sum 3.52310665464e+11
kubelet_pleg_relist_interval_microseconds_count 324919
# HELP kubelet_pleg_relist_latency_microseconds Latency in microseconds for relisting pods in PLEG.
# TYPE kubelet_pleg_relist_latency_microseconds summary
kubelet_pleg_relist_latency_microseconds{quantile="0.5"} 5.2662612e+07
kubelet_pleg_relist_latency_microseconds{quantile="0.9"} 6.1126786e+07
kubelet_pleg_relist_latency_microseconds{quantile="0.99"} 6.1126786e+07
kubelet_pleg_relist_latency_microseconds_sum 2.7190932586e+10
kubelet_pleg_relist_latency_microseconds_count 324919

Comment 9 Seth Jennings 2017-05-19 16:36:25 UTC

Ok, so the worst relists complete in about about 61s with the relist interval being 62s i.e. relist completion time + relistPeriod.

On my idle cluster, the worst relists complete in about 23ms for comparison.  So docker is under pressure here.  I could definitely see the 0.99999 quantile (not calculated) being >3m, hence the flapping.

We need to figure out why the docker API response times are getting in that ~50-60s range.

Comment 10 Jeremy Eder 2017-05-19 16:39:41 UTC

Setting correct severity as this will block Online from using 3.6 as well as our own scale testing (scheduled for June 5).

Comment 11 Mike Fiedler 2017-05-19 17:22:44 UTC

I do see  pods in ContainerCreating status for long periods of time after a while in these runs.   This is a rate we've sustained in the past on previous versions of Docker/OCP.  

I am going to try an experiment to slow things down and see if that makes a difference.  If it does, I'll try to bisect.

Comment 12 Seth Jennings 2017-05-19 18:36:46 UTC

Moving to containers to figure out the long docker API response times.

Comment 14 Dan Williams 2017-05-30 20:41:41 UTC

No chance to get --loglevel=5 logs from a problematic node?

Comment 16 Dan Williams 2017-06-03 04:08:43 UTC

Some other odd stuff:

Jun  2 12:42:50 svt-n-2-67 atomic-openshift-node: I0602 12:42:50.463923    6244 proxy.go:120] Watch Added event for NetNamespace "svt-6-9"
...
Jun  2 12:42:57 svt-n-2-67 atomic-openshift-node: E0602 12:42:57.317028    6244 eventqueue.go:109] event processing failed: skipped adding service rules for serviceEvent: Added, Error: failed to find netid for namespace: svt-6-9 in vnid map

Seems rather odd that having added a watch for the netns, that we wouldn't be able to find it in the vnid map.

Comment 17 Derek Carr 2017-06-07 18:13:31 UTC

*** Bug 1456943 has been marked as a duplicate of this bug. ***

Comment 19 Dan Williams 2017-06-08 22:54:36 UTC

After looking further at it, yes we have a network VNID issue that is causing pod creation failures, but I believe that is due to long blocking calls in kubelet that do not allow the SDN eventqueue loops to run.  That appears to be due to the docker delays, since:

1) it's not the PLEG eventChannel blocking, because it has a capacity of 1000 and we certainly don't have 1000 pods running locally

2) the only other potential blocker in PLEG relist() is PLEG's updateCache() -> GetPodStatus(), which of course calls into Docker.  It takes 7 seconds for each loop of updateCache() (as seen by looking for getSandboxIDByPodUID).  That's almost certainly all in the dockershim runtime.

Comment 20 Dan Williams 2017-06-09 20:29:22 UTC

*** Bug 1451110 has been marked as a duplicate of this bug. ***

Comment 24 Mike Fiedler 2017-07-10 20:27:21 UTC

Sorry about that - was hoping for an overlap as I had data collection going only during that time.   Also, be aware that the pbench performance data is UTC timestamps


Please disregard that data.

I have new data.

sar data for the node is here:   http://pbench.perf.lab.eng.bos.redhat.com/results/EC2::ip-172-31-49-249/nodes_flapping_a/tools-default/ip-172-31-21-61.us-west-2.compute.internal/sar/

Node syslog is attached.   The sar data covers the time around NotReady incident at 15:32:31 in the log.    

In the pbench/sar data this will show as 19:32:31 (EDT + 4)

Comment 25 Mike Fiedler 2017-07-10 20:28:56 UTC

Created attachment 1295938 [details]
syslog of node going NotReady

New syslog of node going NotReady.  Log time is 15:32:31, matching performance data is 19:32:31

Comment 26 Jhon Honce 2017-07-26 15:56:32 UTC

Monitoring is being added to verify that this issue is related to the AWS configuration and BurstBalance QoS. If BurstBalance proves to be the issue then this issue will be re-assigned to the Ops team.

Comment 31 Mike Fiedler 2017-08-14 15:00:01 UTC

I hit this with io1 (IOPS provisioned) EBS volumes as docker storage on the node.   They were provisioned with 2400 IOPS.   The workload was scaling up from 250 pause pods to 500 pause pods on the node.    How can I see BurstBalance as described in comment 26?

During the scale up, I see

Aug 11 21:14:13 ip-172-31-39-235 atomic-openshift-node: I0811 21:14:13.214977   87049 kubelet_node_status.go:410] Recording NodeNotReady event message for node ip-172-31-29-59.us-west-2.compute.internal
Aug 11 21:14:13 ip-172-31-39-235 atomic-openshift-node: I0811 21:14:13.215085   87049 kubelet_node_status.go:717] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-08-11 21:14:13.214952238 +0000 UTC LastTransitionTime:2017-08-11 21:14:13.214952238 +0000 UTC Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m6.796043404s ago; threshold is 3m0s}


From the pbench data the node does not appear to be IO or CPU bound - the instance is 4vCPU and 16GB RAM.   I'll link the data in a follow-up private comment.

Comment 33 Mike Fiedler 2017-08-14 15:02:02 UTC

Created attachment 1313230 [details]
syslogs matching with data in comment 31 and comment 32

Comment 35 Ben Bennett 2017-10-11 19:35:13 UTC

PR https://github.com/openshift/origin/pull/16692 was merged to make it not check docker for hostports when there are none defined.  This may help reduce the iptables load.

Comment 37 Ruben Romero Montes 2017-10-17 15:19:17 UTC

Is there are any workaround on this, as this issue has the potential causing an avalanche effect on the nodes e.g.;

Scenario;
1. Use the drain command on node1 to evict pods and update.
2. node2 (under load) changes state to "NotReady", which promts kubernentes to also evict pods from node2.
3. This causes more load on the remaining nodes, and a more high probability  of failure (i.e. change state to notReady).

Comment 38 Ben Bennett 2017-10-18 12:29:05 UTC

This is backported to 3.6 with https://github.com/openshift/ose/pull/900

That will reduce the contention on the iptables lock.

Comment 39 Ben Bennett 2017-10-18 19:15:08 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1503702 is the bug tracking the change to iptables in the kernel to improve performance.

Comment 40 Ben Bennett 2017-10-18 19:33:32 UTC

*** Bug 1486914 has been marked as a duplicate of this bug. ***

Comment 41 Ben Bennett 2017-10-24 17:40:04 UTC

*** Bug 1493182 has been marked as a duplicate of this bug. ***

Comment 42 Ben Bennett 2017-10-24 17:41:02 UTC

*** Bug 1493603 has been marked as a duplicate of this bug. ***

Comment 43 Ben Bennett 2017-10-26 18:01:38 UTC

*** Bug 1474875 has been marked as a duplicate of this bug. ***

Comment 44 Ben Bennett 2017-10-26 18:02:05 UTC

*** Bug 1503252 has been marked as a duplicate of this bug. ***

Comment 46 Ben Bennett 2017-12-07 13:20:18 UTC

After looking at a representative sample of the online cluster it looks like this has been resolved by the work that:

- added longer iptables timeout (5s)
- changed not to call iptables on pod add/delete (unless there is a hostport)
- have the kernel fix applied to make iptables faster

None of the five online nodes I checked had "PLEG is not healthy" errors in the logs.

Comment 48 Red Hat Bugzilla 2023-09-14 03:57:47 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.

amurdaca
aos-bugs
bbennett
dcbw
decarr
dgoodwin
dwalsh
eparis
fshaikh
imcleod
jeder
jokerman
jupierce
maschmid
mifiedle
mmccomas
nbhatt
pportant
rromerom
sten
vlaad