Description of problem: Under at least two types of workloads that were fine in OCP 3.5, the SVT team is seeing node readiness flapping on 3.6.74 and 3.6.75. 1. On a 100 node cluster, running cluster-loader to create 200 projects with 5000 running pause pods. After ~150 or so projects and 3K pods are created, many nodes start flapping Ready/NotReady with messages like this in the log: May 17 13:29:58 192 atomic-openshift-node: I0517 13:29:58.447773 51063 kubelet.go:1757] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m5.146059486s ago; threshold is 3m0s] May 17 13:30:02 192 atomic-openshift-node: I0517 13:30:02.277003 51063 kubelet_node_status.go:402] Recording NodeNotReady event message for node 192.1.0.40 May 17 13:30:02 192 atomic-openshift-node: I0517 13:30:02.277040 51063 kubelet_node_status.go:705] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-05-17 13:30:02.276982027 -0400 EDT LastTransitionTime:2017-05-17 13:30:02.276982027 -0400 EDT Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m8.975341225s ago; threshold is 3m0s} May 17 13:30:03 192 atomic-openshift-node: I0517 13:30:03.447992 51063 kubelet.go:1757] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m10.146302683s ago; threshold is 3m0s] May 17 13:30:08 192 atomic-openshift-node: I0517 13:30:08.448232 51063 kubelet.go:1757] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m15.146531326s ago; threshold is 3m0s] May 17 13:30:13 192 atomic-openshift-node: I0517 13:30:13.448446 51063 kubelet.go:1757] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m20.14676978s ago; threshold is 3m0s] 2. Running concurrent builds. On a 1 master/1 infra/2 node cluster Version-Release number of selected component (if applicable): 3.6.74 How reproducible: Always, in these environments Steps to Reproduce: 1. 100 node 3.6.74 cluster. 3 master/etcd, 3 infra, 100 compute nodes 2. Run cluster-loader (https://github.com/openshift/svt/tree/master/openshift_scalability) with the configuration below 3. oc get nodes -w | grep NotReady Actual results: After 150 - 200 projects are created you'll start seeing nodes flap. Checking oc get pods --all-namespaces shows many pods in ContainerCreating. Node logs will show PLEG not healthy messages. Expected results: Pod creation works successfully, as in 3.5. Additional info: cluster-loader config: projects: - num: 300 basename: svt-6- tuning: default templates: - num: 1 file: ./content/build-config-template.json - num: 1 file: ./content/build-template.json - num: 1 file: ./content/image-stream-template.json - num: 5 file: ./content/deployment-config-1rep-pause-template.json parameters: - ENV_VALUE: "asodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij12" - num: 10 file: ./content/deployment-config-2rep-pause-template.json parameters: - ENV_VALUE: "asodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij12" - num: 2 file: ./content/ssh-secret-template.json - num: 1 file: ./content/route-template.json # rcs and services are implemented in deployments. tuningsets: - name: default templates: stepping: stepsize: 5 pause: 20 s rate_limit: delay: 250 ms quotas: - name: default
I didn't complete the description of the second scenario: 1 master/etcd, 1 infra, 2 compute. Running 30 concurrent builds repeatedly using https://github.com/openshift/svt/blob/master/openshift_performance/ose3_perf/scripts/build_test-README.md For the first scenario (cluster load up) docker is 1.12.6-16 For the second scenario (concurrent builds) docker is 1.12.6-28
docker is responsive to docker ps and docker info. No messages of the sort "Is the docker daemon running?" The only errors I see from docker in the logs are of this sort: May 17 11:13:05 192 dockerd-current: time="2017-05-17T11:13:05.121153019-04:00" level=error msg="Handler for POST /v1.24/containers/864f0898ef549452c8aa392ccac7c3622055b142d7809f5de6ae26fabd667436/stop?t=10 returned error: Container 864f0898ef549452c8aa392ccac7c3622055b142d7809f5de6ae26fabd667436 is already stopped" May 17 11:13:05 192 dockerd-current: time="2017-05-17T11:13:05.121199385-04:00" level=error msg="Handler for POST /v1.24/containers/864f0898ef549452c8aa392ccac7c3622055b142d7809f5de6ae26fabd667436/stop returned error: Container 864f0898ef549452c8aa392ccac7c3622055b142d7809f5de6ae26fabd667436 is already stopped"
Here is the curl for a node in NotReady: # HELP kubelet_pleg_relist_interval_microseconds Interval in microseconds between relisting in PLEG. # TYPE kubelet_pleg_relist_interval_microseconds summary kubelet_pleg_relist_interval_microseconds{quantile="0.5"} 5.3671457e+07 kubelet_pleg_relist_interval_microseconds{quantile="0.9"} 6.2126903e+07 kubelet_pleg_relist_interval_microseconds{quantile="0.99"} 6.2126903e+07 kubelet_pleg_relist_interval_microseconds_sum 3.52310665464e+11 kubelet_pleg_relist_interval_microseconds_count 324919 # HELP kubelet_pleg_relist_latency_microseconds Latency in microseconds for relisting pods in PLEG. # TYPE kubelet_pleg_relist_latency_microseconds summary kubelet_pleg_relist_latency_microseconds{quantile="0.5"} 5.2662612e+07 kubelet_pleg_relist_latency_microseconds{quantile="0.9"} 6.1126786e+07 kubelet_pleg_relist_latency_microseconds{quantile="0.99"} 6.1126786e+07 kubelet_pleg_relist_latency_microseconds_sum 2.7190932586e+10 kubelet_pleg_relist_latency_microseconds_count 324919
Ok, so the worst relists complete in about about 61s with the relist interval being 62s i.e. relist completion time + relistPeriod. On my idle cluster, the worst relists complete in about 23ms for comparison. So docker is under pressure here. I could definitely see the 0.99999 quantile (not calculated) being >3m, hence the flapping. We need to figure out why the docker API response times are getting in that ~50-60s range.
Setting correct severity as this will block Online from using 3.6 as well as our own scale testing (scheduled for June 5).
I do see pods in ContainerCreating status for long periods of time after a while in these runs. This is a rate we've sustained in the past on previous versions of Docker/OCP. I am going to try an experiment to slow things down and see if that makes a difference. If it does, I'll try to bisect.
Moving to containers to figure out the long docker API response times.
No chance to get --loglevel=5 logs from a problematic node?
Some other odd stuff: Jun 2 12:42:50 svt-n-2-67 atomic-openshift-node: I0602 12:42:50.463923 6244 proxy.go:120] Watch Added event for NetNamespace "svt-6-9" ... Jun 2 12:42:57 svt-n-2-67 atomic-openshift-node: E0602 12:42:57.317028 6244 eventqueue.go:109] event processing failed: skipped adding service rules for serviceEvent: Added, Error: failed to find netid for namespace: svt-6-9 in vnid map Seems rather odd that having added a watch for the netns, that we wouldn't be able to find it in the vnid map.
*** Bug 1456943 has been marked as a duplicate of this bug. ***
After looking further at it, yes we have a network VNID issue that is causing pod creation failures, but I believe that is due to long blocking calls in kubelet that do not allow the SDN eventqueue loops to run. That appears to be due to the docker delays, since: 1) it's not the PLEG eventChannel blocking, because it has a capacity of 1000 and we certainly don't have 1000 pods running locally 2) the only other potential blocker in PLEG relist() is PLEG's updateCache() -> GetPodStatus(), which of course calls into Docker. It takes 7 seconds for each loop of updateCache() (as seen by looking for getSandboxIDByPodUID). That's almost certainly all in the dockershim runtime.
*** Bug 1451110 has been marked as a duplicate of this bug. ***
Sorry about that - was hoping for an overlap as I had data collection going only during that time. Also, be aware that the pbench performance data is UTC timestamps Please disregard that data. I have new data. sar data for the node is here: http://pbench.perf.lab.eng.bos.redhat.com/results/EC2::ip-172-31-49-249/nodes_flapping_a/tools-default/ip-172-31-21-61.us-west-2.compute.internal/sar/ Node syslog is attached. The sar data covers the time around NotReady incident at 15:32:31 in the log. In the pbench/sar data this will show as 19:32:31 (EDT + 4)
Created attachment 1295938 [details] syslog of node going NotReady New syslog of node going NotReady. Log time is 15:32:31, matching performance data is 19:32:31
Monitoring is being added to verify that this issue is related to the AWS configuration and BurstBalance QoS. If BurstBalance proves to be the issue then this issue will be re-assigned to the Ops team.
I hit this with io1 (IOPS provisioned) EBS volumes as docker storage on the node. They were provisioned with 2400 IOPS. The workload was scaling up from 250 pause pods to 500 pause pods on the node. How can I see BurstBalance as described in comment 26? During the scale up, I see Aug 11 21:14:13 ip-172-31-39-235 atomic-openshift-node: I0811 21:14:13.214977 87049 kubelet_node_status.go:410] Recording NodeNotReady event message for node ip-172-31-29-59.us-west-2.compute.internal Aug 11 21:14:13 ip-172-31-39-235 atomic-openshift-node: I0811 21:14:13.215085 87049 kubelet_node_status.go:717] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-08-11 21:14:13.214952238 +0000 UTC LastTransitionTime:2017-08-11 21:14:13.214952238 +0000 UTC Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m6.796043404s ago; threshold is 3m0s} From the pbench data the node does not appear to be IO or CPU bound - the instance is 4vCPU and 16GB RAM. I'll link the data in a follow-up private comment.
Created attachment 1313230 [details] syslogs matching with data in comment 31 and comment 32
PR https://github.com/openshift/origin/pull/16692 was merged to make it not check docker for hostports when there are none defined. This may help reduce the iptables load.
Is there are any workaround on this, as this issue has the potential causing an avalanche effect on the nodes e.g.; Scenario; 1. Use the drain command on node1 to evict pods and update. 2. node2 (under load) changes state to "NotReady", which promts kubernentes to also evict pods from node2. 3. This causes more load on the remaining nodes, and a more high probability of failure (i.e. change state to notReady).
This is backported to 3.6 with https://github.com/openshift/ose/pull/900 That will reduce the contention on the iptables lock.
https://bugzilla.redhat.com/show_bug.cgi?id=1503702 is the bug tracking the change to iptables in the kernel to improve performance.
*** Bug 1486914 has been marked as a duplicate of this bug. ***
*** Bug 1493182 has been marked as a duplicate of this bug. ***
*** Bug 1493603 has been marked as a duplicate of this bug. ***
*** Bug 1474875 has been marked as a duplicate of this bug. ***
*** Bug 1503252 has been marked as a duplicate of this bug. ***
After looking at a representative sample of the online cluster it looks like this has been resolved by the work that: - added longer iptables timeout (5s) - changed not to call iptables on pod add/delete (unless there is a hostport) - have the kernel fix applied to make iptables faster None of the five online nodes I checked had "PLEG is not healthy" errors in the logs.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days