Created attachment 1426152 [details] Docker/atomic-openshift-node + journalctl -xe logs Description of problem: When installing OCP on OpenStack the chance of having infra nodes in NotReady is very high. Version-Release number of selected component (if applicable): [openshift@master-0 ~]$ oc version oc v3.10.0-0.27.0 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://master-2.scale-ci.example.com:8443 openshift v3.10.0-0.27.0 kubernetes v1.10.0+b81c8f8 [cloud-user@ansible-host openshift-ansible]$ git describe openshift-ansible-3.10.0-0.28.0-7-gc1ec797 How reproducible: 5 out of 6 installs so far Steps to Reproduce: 1. Try a multi-master install of OpenShift on OCP Actual results: Infra nodes in NotReady state [openshift@master-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION app-node-0.scale-ci.example.com Ready compute 2h v1.10.0+b81c8f8 app-node-1.scale-ci.example.com Ready compute 2h v1.10.0+b81c8f8 cns-0.scale-ci.example.com Ready compute 2h v1.10.0+b81c8f8 cns-1.scale-ci.example.com Ready compute 2h v1.10.0+b81c8f8 cns-2.scale-ci.example.com Ready compute 2h v1.10.0+b81c8f8 infra-node-0.scale-ci.example.com NotReady compute,infra 2h v1.10.0+b81c8f8 infra-node-1.scale-ci.example.com NotReady compute,infra 2h v1.10.0+b81c8f8 infra-node-2.scale-ci.example.com NotReady compute,infra 2h v1.10.0+b81c8f8 master-0.scale-ci.example.com Ready master 2h v1.10.0+b81c8f8 master-1.scale-ci.example.com Ready master 2h v1.10.0+b81c8f8 master-2.scale-ci.example.com Ready master 2h v1.10.0+b81c8f8 root@infra-node-2: ~ # journalctl -u atomic-openshift-node|grep cni|tail -n1 Apr 24 11:55:27 infra-node-2.scale-ci.example.com atomic-openshift-node[9400]: E0424 11:55:27.752505 9400 kubelet.go:2125] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized Expected results: All nodes in Ready state, no errors in "journalctl -u atomic-openshift-node" Additional info: https://bugzilla.redhat.com/show_bug.cgi?id=1568583 However, this bug seems to be different, as there are no lstat issues there.
Only thing that sticks out to me from the logs is that this environment is still pulling an image /openshift3/node:v3.10.0 for the node sync pod. Which is https://bugzilla.redhat.com/show_bug.cgi?id=1570257 We need to get that fix in and try again. The old image hasn't been updated in a while and it will also double the number of images which have to be downloaded which slows installation unless all images are pre-cached.
Thanks for looking into this Scott. I've used openshift-ansible-3.10.0-0.28.0-45-g818ef02 with https://github.com/openshift/openshift-ansible/pull/8122 manually applied. /openshift3/node:v3.10.0 is no longer pulled, but infra nodes are still NotReady. root@master-0: ~/openshift-ansible # docker images |grep openshift3|sed 's|^[^/]*/||' openshift3/ose-haproxy-router v3.10.0-0.28.0 87e58469f582 14 hours ago 1.64 GB openshift3/ose-node v3.10.0 89f735b7eb1f 14 hours ago 1.83 GB openshift3/ose-node v3.10.0-0.28.0 89f735b7eb1f 14 hours ago 1.83 GB openshift3/ose-docker-builder v3.10.0-0.28.0 0d4065236422 14 hours ago 1.62 GB openshift3/ose-sti-builder v3.10.0-0.28.0 c730e08aed33 14 hours ago 1.62 GB openshift3/ose-deployer v3.10.0-0.28.0 99f82e8d33e5 14 hours ago 1.62 GB openshift3/ose-control-plane v3.10.0 46d30bb81ad9 16 hours ago 1.62 GB openshift3/ose-docker-registry v3.10.0-0.28.0 ed628f3a20b2 16 hours ago 436 MB openshift3/ose-keepalived-ipfailover v3.10.0-0.28.0 e6d2e9b399e4 16 hours ago 392 MB openshift3/ose-web-console v3.10.0 fddfd2420628 16 hours ago 470 MB openshift3/ose-web-console v3.10.0-0.28.0 fddfd2420628 16 hours ago 470 MB openshift3/ose-service-catalog v3.10.0 9d848896a8b9 16 hours ago 461 MB openshift3/prometheus-node-exporter v3.10.0 685a2be8025c 24 hours ago 223 MB openshift3/registry-console v3.10 066b245fbf1a 26 hours ago 231 MB openshift3/logging-fluentd v3.10 8a98f9c0384e 26 hours ago 286 MB openshift3/ose-template-service-broker v3.10 1c9e686124e9 27 hours ago 284 MB openshift3/ose-pod v3.10.0 2f4c3754b74f 28 hours ago 214 MB openshift3/ose-pod v3.10.0-0.28.0 2f4c3754b74f 28 hours ago 214 MB openshift3/ruby-20-rhel7 latest e5833aa7cf85 15 months ago 443 MB openshift3/python-33-rhel7 latest e18350a7786c 15 months ago 521 MB openshift3/php-55-rhel7 latest 2d6fbdfafa33 15 months ago 569 MB openshift3/nodejs-010-rhel7 latest 226d0b1b7987 15 months ago 430 MB openshift3/perl-516-rhel7 latest 45c996c39407 15 months ago 475 MB
I'm out of ideas, here's a diff of the SDN pod logs between a working and non working node. Working on the left. https://www.diffchecker.com/O22hVjQG
Analysis: ovs-vswitchd is not running in the OVS container on an affected node. This causes openshift-sdn to block in ovs-vsctl since that process waits for the change to happen, and openshift-sdn does not pass "--timeout=X" to ovs-vsctl. The vswitchd is being told to quit by something; it is not exiting abnormally. I currently suspect some error in the openshift-ansible ovs container babysitting script from https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_sdn/files/sdn-ovs.yaml or a race condition where kubelet overlaps execution of the app containers within the pod, causing an old instance which runs "ovs-ctl stop" to actually stop a newly started instance.
vswitchd on the infra nodes appears to be getting OOM-killed: Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 8651] 1001 8651 691 123 6 0 -998 pod Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 8756] 0 8756 2920 227 11 0 999 bash Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 8863] 0 8863 13142 195 28 0 999 monitor Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 8864] 0 8864 13238 586 29 0 999 ovsdb-server Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 8951] 0 8951 14493 753 30 0 999 monitor Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 8952] 0 8952 643321 75418 214 0 999 ovs-vswitchd Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 9310] 0 9310 1090 87 7 0 999 sleep Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: Memory cgroup out of memory: Kill process 9557 (ovs-vswitchd) score 1943 or sacrifice child Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: Killed process 8952 (ovs-vswitchd) total-vm:2573284kB, anon-rss:289176kB, file-rss:12496kB, shmem-rss:0kB
https://github.com/openshift/openshift-ansible/pull/8166
You'd think an OOM kill would show up in 'oc describe pod XXXXX' in the events section, but I guess not...
Related: https://github.com/openshift/origin/pull/19531
(In reply to Dan Williams from comment #11) > https://github.com/openshift/openshift-ansible/pull/8166 I've successfully installed v3.10.0-0.29.0 with openshift-ansible v3.10.0-0.30.0 + the PR. Infra nodes are Ready now. Thanks.
Final analysis... ovs-vswitchd allocates some internal pthreads based on the number of CPU cores available on the system. The default stack size on Fedora/RHEL appears to be 8192K so right off the bat the vswitch has #cores * 8192K allocated. So far OK. But for some reason vswitchd actually dirties all 8192K of those stacks, why I have no reason. So they all end up as RSS on vswitch start. I'll have to pass that over to the OVS team to figure out. Filed https://bugzilla.redhat.com/show_bug.cgi?id=1572797 to track that. Anyway, we can work around the memory usage by restricting the number of threads that ovs-vswitchd creates: ovs-vsctl set Open_vSwitch . other_config:n-revalidator-threads=X ovs-vsctl set Open_vSwitch . other_config:n-handler-threads=Y Where X and Y are some suitable number like X=2 and Y=5 (default is ~1:2.5 revalidator:handler), which would result in 64M of RSS devoted to pthreads. Still better than 320MB that 40 cores would use by default.
Checked with v3.10.0-0.47.0 for both openshift-ansible and atomic-openshift The env setup successfully without oomkill. And the ovs pod resource has been increased "resources": { "limits": { "cpu": "200m", "memory": "400Mi" }, "requests": { "cpu": "100m", "memory": "300Mi" } },
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816