Description of problem: On my Power9 server, at-least once a day, sometimes more than once I run into an issue where a pod will not deploy with the following error. For now I have been just rebooting the node to recover: Warning FailedCreatePodSandBox 13m kubelet, c340f1u15.pok.stglabs.ibm.com Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "aa3d38a120de272975e6a30abddebe47d5fc6efe5605c851b40daf9f3dc99310" network for pod "mycake-2-build": NetworkPlugin cni failed to set up pod "mycake-2-build_local-php" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused, failed to clean up sandbox container "aa3d38a120de272975e6a30abddebe47d5fc6efe5605c851b40daf9f3dc99310" network for pod "mycake-2-build": NetworkPlugin cni failed to teardown pod "mycake-2-build_local-php" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused] Version-Release number of selected component (if applicable): Early drop of 3.10 for Power How reproducible: Once a day at the least. Steps to Reproduce: 1. Deploy pods 2. Wait 3. At end of the day deploy pods again Actual results: Pod deploy will fail Expected results: Pods will not deploy to fail. Additional info:
I decided to look at the openshift-sdn project, and it does some indication of a problem: [root@c340f1u15 ~]# oc get all NAME READY STATUS RESTARTS AGE pod/ovs-xdbnd 1/1 Running 7 5d pod/sdn-4jmrp 0/1 CrashLoopBackOff 682 5d NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/ovs 1 1 1 1 1 <none> 5d daemonset.apps/sdn 1 1 0 1 0 <none> 5d NAME DOCKER REPO TAGS UPDATED imagestream.image.openshift.io/node docker-registry.default.svc:5000/openshift-sdn/node v3.10 5 days ago [root@c340f1u15 ~]# ov get all bash: ov: command not found [root@c340f1u15 ~]# oc logs -f pod/sdn-4jmrp 2018/08/13 13:30:44 socat[4124] E connect(5, AF=1 "/var/run/openshift-sdn/cni-server.sock", 40): Connection refused User "sa" set. Context "default/c340f1u15-pok-stglabs-ibm-com:8443/system:admin" modified. I0813 13:30:45.397639 4101 start_node.go:317] Reading node configuration from /etc/origin/node/node-config.yaml I0813 13:30:45.400967 4101 feature_gate.go:190] feature gates: map[RotateKubeletClientCertificate:true RotateKubeletServerCertificate:true] I0813 13:30:45.401090 4101 start_node.go:467] Starting node networking c340f1u15.pok.stglabs.ibm.com (v3.10.23) W0813 13:30:45.401245 4101 server.go:195] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP. I0813 13:30:45.401273 4101 feature_gate.go:226] feature gates: &{{} map[RotateKubeletClientCertificate:true RotateKubeletServerCertificate:true]} I0813 13:30:45.402639 4101 transport.go:160] Refreshing client certificate from store I0813 13:30:45.402660 4101 certificate_store.go:131] Loading cert/key pair from "/etc/origin/node/certificates/kubelet-client-current.pem". I0813 13:30:45.404115 4101 node.go:150] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "c340f1u15.pok.stglabs.ibm.com" (IP ""), iptables sync period "30s" I0813 13:30:45.408112 4101 node.go:292] Starting openshift-sdn network plugin I0813 13:30:45.480535 4101 ovs.go:162] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0813 13:30:45.480564 4101 sdn_controller.go:139] [SDN setup] full SDN setup required (plugin is not setup) I0813 13:31:15.518223 4101 ovs.go:162] Error executing ovs-vsctl: 2018-08-13T13:31:15Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock) F0813 13:31:15.518269 4101 network.go:46] SDN node startup failed: node SDN setup failed: signal: alarm clock
If you are having trouble recreating it let me know if I can gather any data from my system. I have this happen at least once a day.
Created attachment 1484152 [details] spsreport output requested by Jeffrey Young Even though the problem reported is a SDN issue after deployment, this sosreport is for an issue seen during deployment. If the two issues are not the same, let me know whether I should open a new BZ for the deployment time.
Since the upload directly to BZ did not work, put it up at dropbox.redhat.com THe sosreport file is available at 1615375-sosreport-MKumar.1615375-20180917132659.tar.xz
Manoj, Can you get logs from the OVS pod? It looks like OVS is crashing. You might have to get the dead pod manually with "docker ps -a", "docker logs <container id>"
[root@c340f1u15 ~]# oc project openshift-sdn Already on project "openshift-sdn" on server "https://c340f1u15.pok.stglabs.ibm.com:8443". [root@c340f1u15 ~]# oc get all NAME READY STATUS RESTARTS AGE pod/ovs-xdbnd 1/1 Running 22 42d pod/sdn-7w59v 0/1 CrashLoopBackOff 6519 36d NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/ovs 1 1 1 1 1 <none> 42d daemonset.apps/sdn 1 1 0 1 0 <none> 42d NAME DOCKER REPO TAGS UPDATED imagestream.image.openshift.io/node docker-registry.default.svc:5000/openshift-sdn/node v3.10 6 weeks ago [root@c340f1u15 ~]# oc logs -f pod/sdn-7w59v 2018/09/19 13:50:15 socat[24007] E connect(5, AF=1 "/var/run/openshift-sdn/cni-server.sock", 40): Connection refused User "sa" set. Context "default/c340f1u15-pok-stglabs-ibm-com:8443/system:admin" modified. I0919 13:50:16.414339 23979 start_node.go:317] Reading node configuration from /etc/origin/node/node-config.yaml I0919 13:50:16.418367 23979 feature_gate.go:190] feature gates: map[RotateKubeletClientCertificate:true RotateKubeletServerCertificate:true] I0919 13:50:16.418515 23979 start_node.go:467] Starting node networking c340f1u15.pok.stglabs.ibm.com (v3.10.23) W0919 13:50:16.418701 23979 server.go:195] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP. I0919 13:50:16.418740 23979 feature_gate.go:226] feature gates: &{{} map[RotateKubeletClientCertificate:true RotateKubeletServerCertificate:true]} I0919 13:50:16.420413 23979 transport.go:160] Refreshing client certificate from store I0919 13:50:16.420442 23979 certificate_store.go:131] Loading cert/key pair from "/etc/origin/node/certificates/kubelet-client-current.pem". I0919 13:50:16.422192 23979 node.go:150] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "c340f1u15.pok.stglabs.ibm.com" (IP ""), iptables sync period "30s" I0919 13:50:16.425894 23979 node.go:292] Starting openshift-sdn network plugin I0919 13:50:16.496517 23979 ovs.go:162] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I0919 13:50:16.496551 23979 sdn_controller.go:139] [SDN setup] full SDN setup required (plugin is not setup) I0919 13:50:46.539067 23979 ovs.go:162] Error executing ovs-vsctl: 2018-09-19T13:50:46Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock) F0919 13:50:46.539111 23979 network.go:46] SDN node startup failed: node SDN setup failed: signal: alarm clock
It looks openvswitch is hung. Can you find out why?
Is this cluster deployed to bare metal or PowerVM?
Bare Metal, Power 9. How do I find out why openvswitch is hung?
Look at the logs from /var/log/openvswitch/ovs-vswitchd.log as well as the output from dmesg, etc. Is it possible to get a log in to a system where this happens?
I did not find a /var/log/openvswitch directory on the host. Here is the output from the openshift-sdn project: [root@c340f1u15 containers]# oc get all NAME READY STATUS RESTARTS AGE pod/ovs-xdbnd 1/1 Running 27 54d pod/sdn-7w59v 0/1 CrashLoopBackOff 8000 48d NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/ovs 1 1 0 1 0 <none> 54d daemonset.apps/sdn 1 1 0 1 0 <none> 54d NAME DOCKER REPO TAGS UPDATED imagestream.image.openshift.io/node docker-registry.default.svc:5000/openshift-sdn/node v3.10 7 weeks ago So I grabbed all the refernces to sdn-7w59v in /var/lib/docker:
The /var/log/openvswitch directory will be in one of the pods (in this case, I think it will be in pod/ovs-xdbnd).
Created attachment 1489059 [details] grepping for sdn-7w59v Here is the output of grep 7w59v in /var/lib/docker
[root@c340f1u15 containers]# oc rsh pod/ovs-xdbnd error: unable to upgrade connection: pod does not exist
try oc rsh ovs-xdbnd I don't know the format for specifying the pod.
Same problem: [root@c340f1u15 containers]# oc rsh ovs-xdbnd error: unable to upgrade connection: pod does not exist
The failed and WARN message in Comment 1 is same as the comment 2 in https://bugzilla.redhat.com/show_bug.cgi?id=1633892
If it is the same issue, could I request that the fix for 1633892 be back-ported to 3.10 and made available on ppc64le as well?
That backport is already in progress. When using `oc rsh`, you first have to do `oc get -n openshift-sdn pods` to get valid pod names. Then you can give the current one back to `rsh`.
Same issue, is the networking problem limiting the ability to rsh? [root@c340f1u15 containers]# oc get -n openshift-sdn pods NAME READY STATUS RESTARTS AGE ovs-xdbnd 1/1 Running 27 54d sdn-7w59v 0/1 CrashLoopBackOff 8000 49d [root@c340f1u15 containers]# oc rsh ovs-xdbnd error: unable to upgrade connection: pod does not exist
Who knows. You'll have to do some manual digging then. Go to the node in question, find the docker process that is running ovs, and use docker exec. Do you know the general structure of how Kubernetes runs its pods?
Cannot locate the process. [root@c340f1u15 ~]# docker ps |grep -i ovs [root@c340f1u15 ~]# docker ps |grep -i openv [root@c340f1u15 ~]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 404f7f39f8bd 051b56f67075 "/bin/bash -c '#!/..." 27 hours ago Up 27 hours k8s_api_master-api-localhost.localdomain_kube-system_a2828903fa025af361ea430fe51e9ff6_87 aeac7b707f48 82ebb2916975 "/bin/sh -c '#!/bi..." 33 hours ago Up 33 hours k8s_etcd_master-etcd-localhost.localdomain_kube-system_76ecc08e792faa61bf0d3c64199e554d_22 0db1fda29b5b 051b56f67075 "/bin/bash -c '#!/..." 33 hours ago Up 33 hours k8s_controllers_master-controllers-localhost.localdomain_kube-system_40b8aff16e502487dbf6d7aa689350c1_22 bcd28a73db6a asolo-ocp-deployer:443/openshift3/ose-pod:v3.10.23 "/usr/bin/pod" 33 hours ago Up 33 hours k8s_POD_master-etcd-localhost.localdomain_kube-system_76ecc08e792faa61bf0d3c64199e554d_22 23b82200160f asolo-ocp-deployer:443/openshift3/ose-pod:v3.10.23 "/usr/bin/pod" 33 hours ago Up 33 hours k8s_POD_master-controllers-localhost.localdomain_kube-system_40b8aff16e502487dbf6d7aa689350c1_22 f4d88815fd92 asolo-ocp-deployer:443/openshift3/ose-pod:v3.10.23 "/usr/bin/pod" 33 hours ago Up 33 hours k8s_POD_master-api-localhost.localdomain_kube-system_a2828903fa025af361ea430fe51e9ff6_22
If this is really the case, and you're not just on the wrong node, then your cluster is very, very, very, very, very broken, and it's not the networks' fault. Can you check very closely for that pod? If "oc get pods" lists it as a running pod, but you don't see it in docker... then your cluster is deeply corrupted. Make sure you check the pod's NodeIP and are looking on the correct node in the cluster.
Yes, there is only node in this cluster. Whenever I get into this state (once a day) I have been rebooting to recover. [root@c340f1u15 ~]# oc get pods --all-namespaces -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE default docker-registry-1-qjgv7 1/1 Running 27 55d 10.128.1.113 c340f1u15.pok.stglabs.ibm.com default registry-console-1-vfcs6 1/1 Running 28 55d 10.128.1.124 c340f1u15.pok.stglabs.ibm.com default router-1-bdb2h 1/1 Running 21 55d 9.114.181.52 c340f1u15.pok.stglabs.ibm.com kube-system master-api-c340f1u15.pok.stglabs.ibm.com 1/1 Running 21 55d 9.114.181.52 c340f1u15.pok.stglabs.ibm.com kube-system master-controllers-c340f1u15.pok.stglabs.ibm.com 1/1 Running 21 55d 9.114.181.52 c340f1u15.pok.stglabs.ibm.com kube-system master-etcd-c340f1u15.pok.stglabs.ibm.com 1/1 Running 21 55d 9.114.181.52 c340f1u15.pok.stglabs.ibm.com lacme acmeair-1-7dmbj 1/1 Running 7 31d 10.128.1.114 c340f1u15.pok.stglabs.ibm.com lacme acmeair-1-build 0/1 Completed 0 31d 10.128.0.244 c340f1u15.pok.stglabs.ibm.com local-acme-node acmeairwithmongo-1-build 0/1 Completed 0 31d 10.128.0.249 c340f1u15.pok.stglabs.ibm.com local-acme-node acmeairwithmongo-1-svpql 1/1 Running 7 31d 10.128.1.115 c340f1u15.pok.stglabs.ibm.com local-acme-node mongodb-1-gdgd4 1/1 Running 7 31d 10.128.1.116 c340f1u15.pok.stglabs.ibm.com local-ruby myrub-1-build 0/1 Completed 0 11d 10.128.1.61 c340f1u15.pok.stglabs.ibm.com local-ruby myrub-2-bsgq6 1/1 Running 4 11d 10.128.1.117 c340f1u15.pok.stglabs.ibm.com local-ruby myrub-2-build 0/1 Completed 0 11d 10.128.1.64 c340f1u15.pok.stglabs.ibm.com local-storage local-volume-provisioner-k949l 0/1 CrashLoopBackOff 9031 35d 10.128.1.125 c340f1u15.pok.stglabs.ibm.com lpark mongodb-1-x8wxm 1/1 Running 7 31d 10.128.1.122 c340f1u15.pok.stglabs.ibm.com lpark restify-1-build 0/1 Completed 0 31d 10.128.0.255 c340f1u15.pok.stglabs.ibm.com lpark restify-1-qg7gb 1/1 Running 7 31d 10.128.1.119 c340f1u15.pok.stglabs.ibm.com nginx-demo nginx-ex-1-build 0/1 Completed 0 7d 10.128.1.110 c340f1u15.pok.stglabs.ibm.com nginx-demo nginx-ex-1-wcgrm 1/1 Running 1 7d 10.128.1.120 c340f1u15.pok.stglabs.ibm.com openshift-node sync-b47mf 1/1 Running 20 55d 9.114.181.52 c340f1u15.pok.stglabs.ibm.com openshift-sdn ovs-xdbnd 1/1 Running 27 55d 9.114.181.52 c340f1u15.pok.stglabs.ibm.com openshift-sdn sdn-7w59v 0/1 CrashLoopBackOff 8000 49d 9.114.181.52 c340f1u15.pok.stglabs.ibm.com openshift-web-console webconsole-6dff45c68d-whrtw 1/1 Running 27 55d 10.128.1.121 c340f1u15.pok.stglabs.ibm.com [root@c340f1u15 ~]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 404f7f39f8bd 051b56f67075 "/bin/bash -c '#!/..." 28 hours ago Up 28 hours k8s_api_master-api-localhost.localdomain_kube-system_a2828903fa025af361ea430fe51e9ff6_87 aeac7b707f48 82ebb2916975 "/bin/sh -c '#!/bi..." 34 hours ago Up 34 hours k8s_etcd_master-etcd-localhost.localdomain_kube-system_76ecc08e792faa61bf0d3c64199e554d_22 0db1fda29b5b 051b56f67075 "/bin/bash -c '#!/..." 34 hours ago Up 34 hours k8s_controllers_master-controllers-localhost.localdomain_kube-system_40b8aff16e502487dbf6d7aa689350c1_22 bcd28a73db6a asolo-ocp-deployer:443/openshift3/ose-pod:v3.10.23 "/usr/bin/pod" 34 hours ago Up 34 hours k8s_POD_master-etcd-localhost.localdomain_kube-system_76ecc08e792faa61bf0d3c64199e554d_22 23b82200160f asolo-ocp-deployer:443/openshift3/ose-pod:v3.10.23 "/usr/bin/pod" 34 hours ago Up 34 hours k8s_POD_master-controllers-localhost.localdomain_kube-system_40b8aff16e502487dbf6d7aa689350c1_22 f4d88815fd92 asolo-ocp-deployer:443/openshift3/ose-pod:v3.10.23 "/usr/bin/pod" 34 hours ago Up 34 hours k8s_POD_master-api-localhost.localdomain_kube-system_a2828903fa025af361ea430fe51e9ff6_22
Well, the next time your node is in a bad state, you need to figure out why the ovs pod crashed. You will have to use docker ps -a to find it. If you reboot it, the logs will be lost. It's time to delete and recreate your cluster, fwiw. It will make debugging easier.
It is in the bad state now, and I have left it in that state to debug. If the logs are long gone, I can reboot and it will take me another day to get into that state. Can't delete this cluster as this is the only cluster I have for demonstration purposes (together with Red Hat) for an upcoming conference that is fully functional. The others I tried to setup fail with: https://bugzilla.redhat.com/show_bug.cgi?id=1625817 If the ppc64le version GAs today, I can try to use the public registry as opposed to the private registry method on a new server.
From Weibin's note yesterday, it looks like a fix for 1633892 is available. Can you get me a version of that on ppc64le to try? https://bugzilla.redhat.com/show_bug.cgi?id=1633892 https://github.com/openshift/openshift-ansible/pull/10280
I deleted my cluster and re-created it at 3.10.72. I am still seeing the issue. [root@c340f1u15 ~]# oc get all NAME READY STATUS RESTARTS AGE pod/ovs-jwvwd 1/1 Running 0 3d pod/sdn-5qm95 0/1 CrashLoopBackOff 731 3d [root@c340f1u15 ~]# oc logs -f pod/sdn-5qm95 2018/12/11 14:18:40 socat[20194] E connect(5, AF=1 "/var/run/openshift-sdn/cni-server.sock", 40): Connection refused User "sa" set. Context "default/c340f1u15-pok-stglabs-ibm-com:8443/system:admin" modified. I1211 14:18:41.832696 20154 start_node.go:317] Reading node configuration from /etc/origin/node/node-config.yaml I1211 14:18:41.844785 20154 feature_gate.go:190] feature gates: map[RotateKubeletServerCertificate:true RotateKubeletClientCertificate:true] I1211 14:18:41.844922 20154 start_node.go:467] Starting node networking c340f1u15.pok.stglabs.ibm.com (v3.10.72) W1211 14:18:41.845079 20154 server.go:195] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP. I1211 14:18:41.845107 20154 feature_gate.go:226] feature gates: &{{} map[RotateKubeletClientCertificate:true RotateKubeletServerCertificate:true]} I1211 14:18:41.847970 20154 transport.go:160] Refreshing client certificate from store I1211 14:18:41.848906 20154 certificate_store.go:131] Loading cert/key pair from "/etc/origin/node/certificates/kubelet-client-current.pem". I1211 14:18:41.853063 20154 node.go:150] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "c340f1u15.pok.stglabs.ibm.com" (IP ""), iptables sync period "30s" I1211 14:18:41.902234 20154 node.go:292] Starting openshift-sdn network plugin I1211 14:18:42.033860 20154 ovs.go:166] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused) I1211 14:18:42.033892 20154 sdn_controller.go:139] [SDN setup] full SDN setup required (plugin is not setup) I1211 14:19:12.075047 20154 ovs.go:166] Error executing ovs-vsctl: 2018-12-11T14:19:12Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock) F1211 14:19:12.075125 20154 network.go:46] SDN node startup failed: node SDN setup failed: signal: alarm clock
[root@c340f1u15 ~]# rpm -aq |grep openshif atomic-openshift-clients-3.10.72-1.git.0.3cb2fdc.el7.ppc64le openshift-ansible-roles-3.10.73-1.git.0.8b65cea.el7.noarch openshift-ansible-playbooks-3.10.73-1.git.0.8b65cea.el7.noarch atomic-openshift-excluder-3.10.72-1.git.0.3cb2fdc.el7.noarch atomic-openshift-node-3.10.72-1.git.0.3cb2fdc.el7.ppc64le openshift-ansible-docs-3.10.73-1.git.0.8b65cea.el7.noarch atomic-openshift-docker-excluder-3.10.72-1.git.0.3cb2fdc.el7.noarch atomic-openshift-3.10.72-1.git.0.3cb2fdc.el7.ppc64le openshift-ansible-3.10.73-1.git.0.8b65cea.el7.noarch atomic-openshift-hyperkube-3.10.72-1.git.0.3cb2fdc.el7.ppc64le
Can you get the logs from the ovs pod? "oc rsh -n openshift-sdn ovs-jwvwd cat '/var/log/openvswitch/*.log'"
BTW, this is my only cluster on Power9. sh-4.2# cat ovsdb-server.log 2018-12-07T21:41:44.099Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log 2018-12-07T21:41:44.210Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.9.0 2018-12-07T21:41:54.219Z|00003|memory|INFO|8128 kB peak resident set size after 10.1 seconds 2018-12-07T21:41:54.219Z|00004|memory|INFO|cells:217 json-caches:1 monitors:1 sessions:2 sh-4.2# cat ovs-vswitchd.log 2018-12-07T21:41:44.567Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log 2018-12-07T21:41:44.581Z|00002|ovs_numa|INFO|Discovered 64 CPU cores on NUMA node 0 2018-12-07T21:41:44.581Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 64 CPU cores 2018-12-07T21:41:44.581Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting... 2018-12-07T21:41:44.582Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected 2018-12-07T21:41:44.587Z|00006|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.9.0 2018-12-07T21:41:45.430Z|00007|ofproto_dpif|INFO|system@ovs-system: Datapath supports recirculation 2018-12-07T21:41:45.430Z|00008|ofproto_dpif|INFO|system@ovs-system: VLAN header stack length probed as 2 2018-12-07T21:41:45.430Z|00009|ofproto_dpif|INFO|system@ovs-system: MPLS label stack length probed as 1 2018-12-07T21:41:45.430Z|00010|ofproto_dpif|INFO|system@ovs-system: Datapath supports truncate action 2018-12-07T21:41:45.430Z|00011|ofproto_dpif|INFO|system@ovs-system: Datapath supports unique flow ids 2018-12-07T21:41:45.430Z|00012|ofproto_dpif|INFO|system@ovs-system: Datapath does not support clone action 2018-12-07T21:41:45.430Z|00013|ofproto_dpif|INFO|system@ovs-system: Max sample nesting level probed as 10 2018-12-07T21:41:45.430Z|00014|ofproto_dpif|INFO|system@ovs-system: Datapath supports eventmask in conntrack action 2018-12-07T21:41:45.430Z|00015|ofproto_dpif|INFO|system@ovs-system: Datapath does not support ct_clear action 2018-12-07T21:41:45.430Z|00016|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_state 2018-12-07T21:41:45.430Z|00017|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_zone 2018-12-07T21:41:45.430Z|00018|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_mark 2018-12-07T21:41:45.430Z|00019|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_label 2018-12-07T21:41:45.430Z|00020|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_state_nat 2018-12-07T21:41:45.430Z|00021|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_orig_tuple 2018-12-07T21:41:45.430Z|00022|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_orig_tuple6 2018-12-07T21:41:45.444Z|00023|bridge|INFO|bridge br0: added interface br0 on port 65534 2018-12-07T21:41:45.444Z|00024|bridge|INFO|bridge br0: using datapath ID 00008e1d8fca3a4e 2018-12-07T21:41:45.444Z|00025|connmgr|INFO|br0: added service controller "punix:/var/run/openvswitch/br0.mgmt" 2018-12-07T21:41:45.475Z|00026|bridge|INFO|bridge br0: added interface vxlan0 on port 1 2018-12-07T21:41:45.578Z|00027|bridge|INFO|bridge br0: added interface tun0 on port 2 2018-12-07T21:41:45.663Z|00028|connmgr|INFO|br0<->unix#2: 42 flow_mods in the last 0 s (42 adds) 2018-12-07T21:41:45.668Z|00029|connmgr|INFO|br0<->unix#4: 1 flow_mods in the last 0 s (1 adds) 2018-12-07T21:41:45.797Z|00030|connmgr|INFO|br0<->unix#9: 4 flow_mods in the last 0 s (4 adds) 2018-12-07T21:41:54.588Z|00031|memory|INFO|140096 kB peak resident set size after 10.0 seconds 2018-12-07T21:41:54.588Z|00032|memory|INFO|handlers:10 ports:3 revalidators:4 rules:51 udpif keys:1 2018-12-07T21:43:44.171Z|00033|connmgr|INFO|br0<->unix#17: 4 flow_mods in the last 0 s (4 adds) 2018-12-07T21:43:47.710Z|00034|bridge|INFO|bridge br0: added interface vethf59aad27 on port 3 2018-12-07T21:43:47.710Z|00001|ofproto_dpif_upcall(handler2)|INFO|received packet on unassociated datapath port 4 2018-12-07T21:43:47.764Z|00035|connmgr|INFO|br0<->unix#22: 4 flow_mods in the last 0 s (4 adds) 2018-12-07T21:43:47.789Z|00036|connmgr|INFO|br0<->unix#24: 2 flow_mods in the last 0 s (2 deletes) 2018-12-07T21:43:50.360Z|00037|connmgr|INFO|br0<->unix#26: 2 flow_mods in the last 0 s (2 adds) 2018-12-07T21:44:04.962Z|00038|bridge|INFO|bridge br0: added interface veth54a2f53f on port 4 2018-12-07T21:44:05.064Z|00039|connmgr|INFO|br0<->unix#28: 4 flow_mods in the last 0 s (4 adds) 2018-12-07T21:44:05.090Z|00040|connmgr|INFO|br0<->unix#30: 2 flow_mods in the last 0 s (2 deletes) 2018-12-07T21:44:09.141Z|00041|connmgr|INFO|br0<->unix#32: 2 flow_mods in the last 0 s (2 adds) 2018-12-07T21:44:11.663Z|00042|bridge|INFO|bridge br0: added interface vethf7cf67e9 on port 5 2018-12-07T21:44:11.760Z|00043|connmgr|INFO|br0<->unix#34: 4 flow_mods in the last 0 s (4 adds) 2018-12-07T21:44:11.782Z|00044|connmgr|INFO|br0<->unix#36: 2 flow_mods in the last 0 s (2 deletes) 2018-12-07T21:44:12.263Z|00045|bridge|INFO|bridge br0: added interface veth9c5fc93d on port 6 2018-12-07T21:44:12.364Z|00046|connmgr|INFO|br0<->unix#38: 4 flow_mods in the last 0 s (4 adds) 2018-12-07T21:44:12.391Z|00047|connmgr|INFO|br0<->unix#40: 2 flow_mods in the last 0 s (2 deletes) 2018-12-07T21:44:12.395Z|00048|connmgr|INFO|br0<->unix#42: 2 flow_mods in the last 0 s (2 deletes) 2018-12-07T21:44:12.464Z|00049|connmgr|INFO|br0<->unix#44: 4 flow_mods in the last 0 s (4 deletes) 2018-12-07T21:44:12.493Z|00050|bridge|INFO|bridge br0: deleted interface vethf59aad27 on port 3 2018-12-07T21:44:17.971Z|00051|connmgr|INFO|br0<->unix#48: 2 flow_mods in the last 0 s (2 adds) 2018-12-07T21:44:18.638Z|00052|bridge|INFO|bridge br0: added interface vethc9382a95 on port 7 2018-12-07T21:44:18.665Z|00053|connmgr|INFO|br0<->unix#50: 4 flow_mods in the last 0 s (4 adds) 2018-12-07T21:44:18.775Z|00054|connmgr|INFO|br0<->unix#52: 2 flow_mods in the last 0 s (2 deletes) 2018-12-07T21:44:21.163Z|00055|bridge|INFO|bridge br0: added interface vethccb8dd9b on port 8 2018-12-07T21:44:21.265Z|00056|connmgr|INFO|br0<->unix#54: 4 flow_mods in the last 0 s (4 adds) 2018-12-07T21:44:21.288Z|00057|connmgr|INFO|br0<->unix#56: 2 flow_mods in the last 0 s (2 deletes) 2018-12-07T21:44:24.360Z|00058|connmgr|INFO|br0<->unix#58: 2 flow_mods in the last 0 s (2 deletes) 2018-12-07T21:44:24.376Z|00059|connmgr|INFO|br0<->unix#60: 4 flow_mods in the last 0 s (4 deletes) 2018-12-07T21:44:24.460Z|00060|bridge|INFO|bridge br0: deleted interface veth54a2f53f on port 4 2018-12-07T21:44:32.026Z|00061|connmgr|INFO|br0<->unix#62: 2 flow_mods in the last 0 s (2 deletes) 2018-12-07T21:44:32.066Z|00062|connmgr|INFO|br0<->unix#64: 4 flow_mods in the last 0 s (4 deletes) 2018-12-07T21:44:32.093Z|00063|bridge|INFO|bridge br0: deleted interface veth9c5fc93d on port 6 2018-12-07T21:49:56.184Z|00064|connmgr|INFO|br0<->unix#91: 2 flow_mods in the last 0 s (2 adds) 2018-12-07T21:49:58.451Z|00065|bridge|INFO|bridge br0: added interface veth1951bb32 on port 9 2018-12-07T21:49:58.464Z|00066|connmgr|INFO|br0<->unix#93: 4 flow_mods in the last 0 s (4 adds) 2018-12-07T21:49:58.488Z|00067|connmgr|INFO|br0<->unix#95: 2 flow_mods in the last 0 s (2 deletes) 2018-12-07T21:51:20.260Z|00068|connmgr|INFO|br0<->unix#103: 2 flow_mods in the last 0 s (2 deletes) 2018-12-07T21:51:20.275Z|00069|connmgr|INFO|br0<->unix#105: 4 flow_mods in the last 0 s (4 deletes) 2018-12-07T21:51:20.363Z|00070|bridge|INFO|bridge br0: deleted interface veth1951bb32 on port 9 2018-12-07T21:51:20.462Z|00071|bridge|INFO|bridge br0: added interface vethccd6a2db on port 10 2018-12-07T21:51:20.472Z|00072|connmgr|INFO|br0<->unix#107: 4 flow_mods in the last 0 s (4 adds) 2018-12-07T21:51:20.576Z|00073|connmgr|INFO|br0<->unix#109: 2 flow_mods in the last 0 s (2 deletes) 2018-12-07T21:51:25.862Z|00074|bridge|INFO|bridge br0: added interface veth411f8817 on port 11 2018-12-07T21:51:25.960Z|00075|connmgr|INFO|br0<->unix#111: 4 flow_mods in the last 0 s (4 adds) 2018-12-07T21:51:25.979Z|00076|connmgr|INFO|br0<->unix#113: 2 flow_mods in the last 0 s (2 deletes) 2018-12-07T21:51:29.869Z|00077|connmgr|INFO|br0<->unix#115: 2 flow_mods in the last 0 s (2 deletes) 2018-12-07T21:51:29.890Z|00078|connmgr|INFO|br0<->unix#117: 4 flow_mods in the last 0 s (4 deletes) 2018-12-07T21:51:29.928Z|00079|bridge|INFO|bridge br0: deleted interface vethccd6a2db on port 10 2018-12-08T17:37:42.960Z|00002|daemon_unix(monitor)|INFO|pid 33984 died, killed (Killed), exiting sh-4.2#
Casey: I tried this on 3.11 as well and I can recreate it. It might not have anything to do with the OpenShift level, but OVS on Power9. Did you find anything in the logs?
I found another work-around. Instead of rebooting the node, if I delete both the OVS and SDN pods in the openshift-sdn workspace, then the services come back fine. oc delete pod/ovs-7spft oc delete pod/sdn-gcckt
I installed OpenShift on a AC922 (also power9) system this week, and now this issue is very bothersome. The sdn pod will crash several times a day. Can we get some attention on this problem? I have moved up the severity to 'urgent', as this will be deployed at a client site in a couple of weeks.
(In reply to Manoj Kumar from comment #35) > I found another work-around. Instead of rebooting the node, if I delete both > the OVS and SDN pods in the openshift-sdn workspace, then the services come > back fine. > > oc delete pod/ovs-7spft > oc delete pod/sdn-gcckt Manoj, We were facing the same issue yesterday (01.24.2019) after the upgrade of OpenShift (On RHEL 7.6) from 3.9.33 to 3.10.89 over the last weekend. When I deleted the OVS and SDN pods, the apps started working for sometime and later failed again after few hours. When we verified the journalctl logs, it was saying out of memory (journalctl | grep -i OOM-kill) and the technician from Redhat said the node is ran out of memory. After increasing the memory on the node, we didn't see the issue as of now and it has been 24 hours. We are monitoring the situtation and will update if I see any issues. Thanks, Harsha Polavarapu.
Can we get clarification if the problem is actually always OOM kill of the OVS container? We may need to bump the expected memory limits of the OVS container to match how POWER9 uses memory. If it's some other crash of the OVS container in the SDN pod then we should reassign this bug to the OVS team.
I spent a good part of this week trying to validate Harsha's assertion. To me it is not clear that OOM kill is related to OVS at all, at least on Power9. At times I find that the OVS container dies, without a recent oom-kill. In /var/log/messages I get no indication that OVS is involved.
(In reply to Manoj Kumar from comment #39) > I spent a good part of this week trying to validate Harsha's assertion. To > me it is not clear that OOM kill is related to OVS at all, at least on > Power9. > > At times I find that the OVS container dies, without a recent oom-kill. In > /var/log/messages I get no indication that OVS is involved. Manoj, The root cause of the issue is systemd-journal taking more memory over the time and RedHat support acknowledged that there is a memory leak which in turn was killing the OVS and SDN pods since the memory is not guaranteed. We had the below values set so that the memory is guaranteed for the OVS and SDN pods and we started recycling the systemd-journal daily at midnight. # oc project openshift-sdn # oc edit ds sdn Change the limits to equal the request resources: limits: cpu: 300m memory: 1000Mi requests: cpu: 300m memory: 1000Mi # oc project openshift-sdn # oc edit ds ovs Change the limits to equal the request resources: limits: cpu: 200m memory: 400Mi requests: cpu: 200m memory: 400Mi Today RedHat released the below bugfix for the issue and asked us to update the systemd (yum upgrade systemd). https://access.redhat.com/errata/RHSA-2019:0201 Hopefully this resolves the issue. Thanks, Harsha Polavarapu.
I'm able to see this behavior on a power 9 machine (can't duplicate on a power8). As you can see below, almost all of the pods loose network and are restarted 2-4 times an hour. Seems that kube-service-catalog and openshift-monitoring are particularly sensitive. I can provide access to this system. [ibm-p9b-26 ~]# oc get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-1-h8tpt 1/1 Running 32 15h default registry-console-1-xww6v 1/1 Running 32 15h default router-1-4l2sd 1/1 Running 0 15h kube-service-catalog apiserver-nl4hw 1/1 Running 32 15h kube-service-catalog controller-manager-crd77 1/1 Running 158 15h kube-system master-api-ibm-p9b-26.khw.lab.eng.bos.redhat.com 1/1 Running 0 15h kube-system master-controllers-ibm-p9b-26.khw.lab.eng.bos.redhat.com 1/1 Running 0 15h kube-system master-etcd-ibm-p9b-26.khw.lab.eng.bos.redhat.com 1/1 Running 0 15h local-storage local-volume-provisioner-wlb9q 1/1 Running 64 15h openshift-ansible-service-broker asb-1-ll56b 1/1 Running 63 15h openshift-console console-7456d9547b-rgpdg 1/1 Running 32 15h openshift-monitoring alertmanager-main-0 3/3 Running 97 15h openshift-monitoring alertmanager-main-1 3/3 Running 96 15h openshift-monitoring alertmanager-main-2 3/3 Running 96 15h openshift-monitoring cluster-monitoring-operator-6f5fbd6f8b-k5x64 1/1 Running 132 15h openshift-monitoring grafana-857fc848bf-jddjl 2/2 Running 64 15h openshift-monitoring kube-state-metrics-6b4f6c5c5b-2n5sr 3/3 Running 96 15h openshift-monitoring node-exporter-5vwc4 2/2 Running 0 15h openshift-monitoring prometheus-k8s-0 4/4 Running 130 15h openshift-monitoring prometheus-k8s-1 4/4 Running 130 15h openshift-monitoring prometheus-operator-7855c8646b-89knw 1/1 Running 32 15h openshift-node sync-s78xp 1/1 Running 0 15h openshift-sdn ovs-ndl6f 1/1 Running 32 15h openshift-sdn sdn-dd4j8 1/1 Running 32 15h openshift-template-service-broker apiserver-v65jx 1/1 Running 59 15h openshift-web-console webconsole-7f7f679596-7kkcr 1/1 Running 33 15h operator-lifecycle-manager catalog-operator-54f866c7d5-ltk2f 1/1 Running 62 15h operator-lifecycle-manager olm-operator-55fd8f9bbb-fd7qj 1/1 Running 63 15h test dancer-mysql-example-1-4t5k2 1/1 Running 62 15h test dancer-mysql-example-1-build 0/1 Completed 0 15h test database-1-p42nh 1/1 Running 64 15h test httpd-example-1-5wdf5 1/1 Running 61 15h test httpd-example-1-build 0/1 Completed 0 15h 43m 43m 1 prometheus-operator-7855c8646b-89knw.15802a40ee4c7b24 Pod Warning NetworkFailed openshift-sdn, ibm-p9b-26.khw.lab.eng.bos.redhat.com The pod's network interface has been lost and the pod will be stopped.
I didn't make the changes that Harsha ^^^ had suggested. Once I made those, this cluster seemed to stabilize. I will continue to monitor. thanks!
The cluster didn't stay very stable for very long, back to seeing consistent network failures across projects, with ovs/sdn crashes twice an hour. This machine is an all-in-one node: 128 cpu 62GB memory 50GB disk (using about 56% now) rhel-alt 7.6 (power9 has to use rhel-alt) Are there any specific logs I can provide, or I can provide access to this system (inside redhat vpn).
This PR: https://github.com/openshift/openshift-ansible/pull/11150 From this bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1669311 Fixed the issue on my cluster. Ok to close?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0405
Running the following resolved my issues: # kubeadm init --pod-network-cidr=10.244.0.0/16 For flannel as cni the apiserver needs to have the argument --pod-network-cidr=... to be set to the overlay.