Created attachment 1327993 [details] syslog at loglevel 5 for node that went NotReady Description of problem: Scaling a 50 x m4.xlarge compute node cluster to a density of 240 pause pods per node (achievable in 3.6) in 1000 namespaces. After 950 namespaces and 11,400 pods are created, nodes start going into NotReady and pods get stuck in Pending. The logs contain the messages below as the NotReady reason - NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized Full logs for a failed node at loglevel 5 are attached Sep 19 12:21:28 ip-172-31-50-91.us-west-2.compute.internal atomic-openshift-node[8577]: I0919 12:21:28.593007 8577 kubelet_node_status.go:433] Recording NodeNotReady event message for node ip-172-31-50-91.us-west-2.compute.internal Sep 19 12:21:28 ip-172-31-50-91.us-west-2.compute.internal atomic-openshift-node[8577]: I0919 12:21:28.593036 8577 kubelet_node_status.go:757] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-09-19 12:21:28.592998669 +0000 UTC LastTransitionTime:2017-09-19 12:21:28.592998669 +0000 UTC Reason:KubeletNotReady Message:runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized} Sep 19 12:21:28 ip-172-31-50-91.us-west-2.compute.internal atomic-openshift-node[8577]: I0919 12:21:28.598079 8577 sdn_controller.go:205] AddHostSubnetRules for ip-172-31-61-231.us-west-2.compute.internal (host: "ip-172-31-61-231.us-west-2.compute.internal", ip: "172.31.61.231", subnet: "172.20.84.0/24") Sep 19 12:21:28 ip-172-31-50-91.us-west-2.compute.internal atomic-openshift-node[8577]: I0919 12:21:28.613218 8577 sdn_controller.go:205] AddHostSubnetRules for ip-172-31-61-32.us-west-2.compute.internal (host: "ip-172-31-61-32.us-west-2.compute.internal", ip: "172.31.61.32", subnet: "172.20.65.0/24") Sep 19 12:21:28 ip-172-31-50-91.us-west-2.compute.internal atomic-openshift-node[8577]: W0919 12:21:28.615392 8577 proxier.go:488] clusterCIDR not specified, unable to distinguish between internal and external traffic Version-Release number of selected component (if applicable): openshift v3.7.0-0.126.4 kubernetes v1.7.0+80709908fd How reproducible: Always when scaling to this level Steps to Reproduce: 1. On AWS, install cluster with 1 master/etcd, 3 infra, 50 compute (m4.xlarge) 2. Run cluster-loader (https://github.com/openshift/svt/tree/master/openshift_scalability) with the configuration file below to create 1000 projects with 12000 pods Actual results: After 950 projects are created and 11,398 pods are running, nodes start going into NotReady and additional pods fail to start. Expected results: All pods start successfully Additional info: python cluster-loader.py -f config.yaml -p 5 cluster-loader config: projects: - num: 1000 basename: svt-1- ifexists: delete tuning: default pods: - total: 12 - num: 100 image: gcr.io/google_containers/pause-amd64:3.0 basename: pausepods file: content/pod-pause.json storage: - type: none quotas: - name: default file: default tuningsets: - name: default pods: stepping: stepsize: 50 pause: 60 s rate_limit: delay: 0 ms
Do you have full node logs from the affected nodes? openshift-sdn only writes out the config file to enable NodeReady once it's done initial setup. So it appears that it's taking a while to get everything set up, and we should figure out how long things really are taking. Also, what's the CPU load on the affected nodes during setup?
The attached file is the full node log. journalctl > <output> I'll grab pbench data the next time I set this up. Possibly tomorrow.
There's a lot of weird log messages there. It looks like atomic-openshift-node gets restarted quite a few times by something, I'm not sure whether that's ansible or systemd or what. At the 12:21:28 mark, the node has just restarted, and the SDN isn't up and running yet. That takes a bit of time. It's still initializing at 12:22:56. Then for some reason, openshift-node gets killed by ansible at 12:23:01 and restarted. Then it looks like everything is mostly ready at 12:23:11? So I'm not sure this is a network problem. It looks like something is either slowing the node down (in the first service run in the logs up until 12:20-ish) and then something is restarting openshift-node. Any idea what that is?
Reproduced this today and captured system data right after a node when NotReady: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-17-233/20170927_from890_a/tools-default/ip-172-31-44-27.us-west-2.compute.internal/ sar shows the system (4vCPU, 16GB) memory getting busy in peaks - it has 244 pods running on it. 2.5 cores used in spikes and 5GB memory in use. I grabbed the logs - there are earlier instances but if you search up from the bottom on NotReady you will find the instance captured here. Note that the node service restarted immediately after. Maybe this should belong to Pod component? Will try to repro tomorrow with loglevel 5 - this one was only 2.
Created attachment 1331567 [details] Node log/system log (loglevel 2)
This may be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1451902
*** This bug has been marked as a duplicate of bug 1451902 ***