Bug 1615375 - Failed to create pod, due to CNI issues
Summary: Failed to create pod, due to CNI issues
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.10.0
Hardware: ppc64le
OS: Linux
unspecified
urgent
Target Milestone: ---
: 3.10.z
Assignee: Casey Callendrello
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks: 1479956
TreeView+ depends on / blocked
 
Reported: 2018-08-13 13:04 UTC by Manoj Kumar
Modified: 2019-07-12 18:55 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-14 02:15:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
spsreport output requested by Jeffrey Young (33 bytes, text/plain)
2018-09-17 20:28 UTC, Manoj Kumar
no flags Details
grepping for sdn-7w59v (3.71 MB, text/plain)
2018-10-01 13:55 UTC, Manoj Kumar
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0405 0 None None None 2019-03-14 02:15:41 UTC

Description Manoj Kumar 2018-08-13 13:04:58 UTC
Description of problem:

On my Power9 server, at-least once a day, sometimes more than once I run into an issue where a pod will not deploy with the following error. For now I have been just rebooting the node to recover:  Warning  FailedCreatePodSandBox  13m                kubelet, c340f1u15.pok.stglabs.ibm.com  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "aa3d38a120de272975e6a30abddebe47d5fc6efe5605c851b40daf9f3dc99310" network for pod "mycake-2-build": NetworkPlugin cni failed to set up pod "mycake-2-build_local-php" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused, failed to clean up sandbox container "aa3d38a120de272975e6a30abddebe47d5fc6efe5605c851b40daf9f3dc99310" network for pod "mycake-2-build": NetworkPlugin cni failed to teardown pod "mycake-2-build_local-php" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused]

Version-Release number of selected component (if applicable):

Early drop of 3.10 for Power

How reproducible:

Once a day at the least.


Steps to Reproduce:
1. Deploy pods
2. Wait
3. At end of the day deploy pods again

Actual results:

Pod deploy will fail

Expected results:

Pods will not deploy to fail.

Additional info:

Comment 1 Manoj Kumar 2018-08-13 13:34:41 UTC
I decided to look at the openshift-sdn project, and it does some indication of a problem:

[root@c340f1u15 ~]# oc get all
NAME            READY     STATUS             RESTARTS   AGE
pod/ovs-xdbnd   1/1       Running            7          5d
pod/sdn-4jmrp   0/1       CrashLoopBackOff   682        5d

NAME                 DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/ovs   1         1         1         1            1           <none>          5d
daemonset.apps/sdn   1         1         0         1            0           <none>          5d

NAME                                  DOCKER REPO                                           TAGS      UPDATED
imagestream.image.openshift.io/node   docker-registry.default.svc:5000/openshift-sdn/node   v3.10     5 days ago
[root@c340f1u15 ~]# ov get all
bash: ov: command not found
[root@c340f1u15 ~]# oc logs -f pod/sdn-4jmrp 
2018/08/13 13:30:44 socat[4124] E connect(5, AF=1 "/var/run/openshift-sdn/cni-server.sock", 40): Connection refused
User "sa" set.
Context "default/c340f1u15-pok-stglabs-ibm-com:8443/system:admin" modified.
I0813 13:30:45.397639    4101 start_node.go:317] Reading node configuration from /etc/origin/node/node-config.yaml
I0813 13:30:45.400967    4101 feature_gate.go:190] feature gates: map[RotateKubeletClientCertificate:true RotateKubeletServerCertificate:true]
I0813 13:30:45.401090    4101 start_node.go:467] Starting node networking c340f1u15.pok.stglabs.ibm.com (v3.10.23)
W0813 13:30:45.401245    4101 server.go:195] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP.
I0813 13:30:45.401273    4101 feature_gate.go:226] feature gates: &{{} map[RotateKubeletClientCertificate:true RotateKubeletServerCertificate:true]}
I0813 13:30:45.402639    4101 transport.go:160] Refreshing client certificate from store
I0813 13:30:45.402660    4101 certificate_store.go:131] Loading cert/key pair from "/etc/origin/node/certificates/kubelet-client-current.pem".
I0813 13:30:45.404115    4101 node.go:150] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "c340f1u15.pok.stglabs.ibm.com" (IP ""), iptables sync period "30s"
I0813 13:30:45.408112    4101 node.go:292] Starting openshift-sdn network plugin
I0813 13:30:45.480535    4101 ovs.go:162] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0813 13:30:45.480564    4101 sdn_controller.go:139] [SDN setup] full SDN setup required (plugin is not setup)
I0813 13:31:15.518223    4101 ovs.go:162] Error executing ovs-vsctl: 2018-08-13T13:31:15Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
F0813 13:31:15.518269    4101 network.go:46] SDN node startup failed: node SDN setup failed: signal: alarm clock

Comment 2 Manoj Kumar 2018-08-22 19:44:34 UTC
If you are having trouble recreating it let me know if I can gather any data from my system. I have this happen at least once a day.

Comment 3 Manoj Kumar 2018-09-17 20:28:29 UTC
Created attachment 1484152 [details]
spsreport output requested by Jeffrey Young

Even though the problem reported is a SDN issue after deployment, this sosreport is for an issue seen during deployment. If the two issues are not the same, let me know whether I should open a new BZ for the deployment time.

Comment 4 Manoj Kumar 2018-09-17 21:13:26 UTC
Since the upload directly to BZ did not work, put it up at dropbox.redhat.com

THe sosreport file is available at 1615375-sosreport-MKumar.1615375-20180917132659.tar.xz

Comment 5 Casey Callendrello 2018-09-19 09:16:07 UTC
Manoj,
Can you get logs from the OVS pod? It looks like OVS is crashing.

You might have to get the dead pod manually with "docker ps -a", "docker logs <container id>"

Comment 6 Manoj Kumar 2018-09-19 13:54:20 UTC
[root@c340f1u15 ~]# oc project openshift-sdn
Already on project "openshift-sdn" on server "https://c340f1u15.pok.stglabs.ibm.com:8443".
[root@c340f1u15 ~]# oc get all
NAME            READY     STATUS             RESTARTS   AGE
pod/ovs-xdbnd   1/1       Running            22         42d
pod/sdn-7w59v   0/1       CrashLoopBackOff   6519       36d

NAME                 DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/ovs   1         1         1         1            1           <none>          42d
daemonset.apps/sdn   1         1         0         1            0           <none>          42d

NAME                                  DOCKER REPO                                           TAGS      UPDATED
imagestream.image.openshift.io/node   docker-registry.default.svc:5000/openshift-sdn/node   v3.10     6 weeks ago
[root@c340f1u15 ~]# oc logs -f pod/sdn-7w59v 
2018/09/19 13:50:15 socat[24007] E connect(5, AF=1 "/var/run/openshift-sdn/cni-server.sock", 40): Connection refused
User "sa" set.
Context "default/c340f1u15-pok-stglabs-ibm-com:8443/system:admin" modified.
I0919 13:50:16.414339   23979 start_node.go:317] Reading node configuration from /etc/origin/node/node-config.yaml
I0919 13:50:16.418367   23979 feature_gate.go:190] feature gates: map[RotateKubeletClientCertificate:true RotateKubeletServerCertificate:true]
I0919 13:50:16.418515   23979 start_node.go:467] Starting node networking c340f1u15.pok.stglabs.ibm.com (v3.10.23)
W0919 13:50:16.418701   23979 server.go:195] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP.
I0919 13:50:16.418740   23979 feature_gate.go:226] feature gates: &{{} map[RotateKubeletClientCertificate:true RotateKubeletServerCertificate:true]}
I0919 13:50:16.420413   23979 transport.go:160] Refreshing client certificate from store
I0919 13:50:16.420442   23979 certificate_store.go:131] Loading cert/key pair from "/etc/origin/node/certificates/kubelet-client-current.pem".
I0919 13:50:16.422192   23979 node.go:150] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "c340f1u15.pok.stglabs.ibm.com" (IP ""), iptables sync period "30s"
I0919 13:50:16.425894   23979 node.go:292] Starting openshift-sdn network plugin
I0919 13:50:16.496517   23979 ovs.go:162] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0919 13:50:16.496551   23979 sdn_controller.go:139] [SDN setup] full SDN setup required (plugin is not setup)
I0919 13:50:46.539067   23979 ovs.go:162] Error executing ovs-vsctl: 2018-09-19T13:50:46Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
F0919 13:50:46.539111   23979 network.go:46] SDN node startup failed: node SDN setup failed: signal: alarm clock

Comment 7 Casey Callendrello 2018-09-19 14:02:26 UTC
It looks openvswitch is hung. Can you find out why?

Comment 8 Barry Donahue 2018-09-21 10:44:55 UTC
Is this cluster deployed to bare metal or PowerVM?

Comment 9 Manoj Kumar 2018-09-21 13:08:50 UTC
Bare Metal, Power 9.

How do I find out why openvswitch is hung?

Comment 10 Aaron Conole 2018-10-01 13:17:47 UTC
Look at the logs from /var/log/openvswitch/ovs-vswitchd.log as well as the output from dmesg, etc.

Is it possible to get a log in to a system where this happens?

Comment 11 Manoj Kumar 2018-10-01 13:53:06 UTC
I did not find a /var/log/openvswitch directory on the host. 

Here is the output from the openshift-sdn project:

[root@c340f1u15 containers]# oc get all
NAME            READY     STATUS             RESTARTS   AGE
pod/ovs-xdbnd   1/1       Running            27         54d
pod/sdn-7w59v   0/1       CrashLoopBackOff   8000       48d

NAME                 DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/ovs   1         1         0         1            0           <none>          54d
daemonset.apps/sdn   1         1         0         1            0           <none>          54d

NAME                                  DOCKER REPO                                           TAGS      UPDATED
imagestream.image.openshift.io/node   docker-registry.default.svc:5000/openshift-sdn/node   v3.10     7 weeks ago


So I grabbed all the refernces to sdn-7w59v in /var/lib/docker:

Comment 12 Aaron Conole 2018-10-01 13:55:39 UTC
The /var/log/openvswitch directory will be in one of the pods (in this case, I think it will be in pod/ovs-xdbnd).

Comment 13 Manoj Kumar 2018-10-01 13:55:43 UTC
Created attachment 1489059 [details]
grepping for sdn-7w59v

Here is the output of grep 7w59v in /var/lib/docker

Comment 14 Manoj Kumar 2018-10-01 14:02:33 UTC
[root@c340f1u15 containers]# oc rsh pod/ovs-xdbnd 
error: unable to upgrade connection: pod does not exist

Comment 15 Aaron Conole 2018-10-01 15:53:51 UTC
try 

  oc rsh ovs-xdbnd

I don't know the format for specifying the pod.

Comment 16 Manoj Kumar 2018-10-01 16:00:59 UTC
Same problem:

[root@c340f1u15 containers]# oc rsh ovs-xdbnd 
error: unable to upgrade connection: pod does not exist

Comment 17 Weibin Liang 2018-10-01 16:02:18 UTC
The failed and WARN message in Comment 1 is same as the comment 2 in https://bugzilla.redhat.com/show_bug.cgi?id=1633892

Comment 18 Manoj Kumar 2018-10-01 16:07:40 UTC
If it is the same issue, could I request that the fix for 1633892 be back-ported to 3.10 and made available on ppc64le as well?

Comment 19 Casey Callendrello 2018-10-01 16:24:23 UTC
That backport is already in progress.

When using `oc rsh`, you first have to do `oc get -n openshift-sdn pods` to get valid pod names. Then you can give the current one back to `rsh`.

Comment 20 Manoj Kumar 2018-10-01 16:29:17 UTC
Same issue, is the networking problem limiting the ability to rsh?

[root@c340f1u15 containers]# oc get -n openshift-sdn pods
NAME        READY     STATUS             RESTARTS   AGE
ovs-xdbnd   1/1       Running            27         54d
sdn-7w59v   0/1       CrashLoopBackOff   8000       49d
[root@c340f1u15 containers]# oc rsh ovs-xdbnd  
error: unable to upgrade connection: pod does not exist

Comment 21 Casey Callendrello 2018-10-01 16:31:35 UTC
Who knows.

You'll have to do some manual digging then. Go to the node in question, find the docker process that is running ovs, and use docker exec. Do you know the general structure of how Kubernetes runs its pods?

Comment 22 Manoj Kumar 2018-10-02 13:18:20 UTC
Cannot locate the process.

[root@c340f1u15 ~]# docker ps |grep -i ovs
[root@c340f1u15 ~]# docker ps |grep -i openv
[root@c340f1u15 ~]# docker ps
CONTAINER ID        IMAGE                                                COMMAND                  CREATED             STATUS              PORTS               NAMES
404f7f39f8bd        051b56f67075                                         "/bin/bash -c '#!/..."   27 hours ago        Up 27 hours                             k8s_api_master-api-localhost.localdomain_kube-system_a2828903fa025af361ea430fe51e9ff6_87
aeac7b707f48        82ebb2916975                                         "/bin/sh -c '#!/bi..."   33 hours ago        Up 33 hours                             k8s_etcd_master-etcd-localhost.localdomain_kube-system_76ecc08e792faa61bf0d3c64199e554d_22
0db1fda29b5b        051b56f67075                                         "/bin/bash -c '#!/..."   33 hours ago        Up 33 hours                             k8s_controllers_master-controllers-localhost.localdomain_kube-system_40b8aff16e502487dbf6d7aa689350c1_22
bcd28a73db6a        asolo-ocp-deployer:443/openshift3/ose-pod:v3.10.23   "/usr/bin/pod"           33 hours ago        Up 33 hours                             k8s_POD_master-etcd-localhost.localdomain_kube-system_76ecc08e792faa61bf0d3c64199e554d_22
23b82200160f        asolo-ocp-deployer:443/openshift3/ose-pod:v3.10.23   "/usr/bin/pod"           33 hours ago        Up 33 hours                             k8s_POD_master-controllers-localhost.localdomain_kube-system_40b8aff16e502487dbf6d7aa689350c1_22
f4d88815fd92        asolo-ocp-deployer:443/openshift3/ose-pod:v3.10.23   "/usr/bin/pod"           33 hours ago        Up 33 hours                             k8s_POD_master-api-localhost.localdomain_kube-system_a2828903fa025af361ea430fe51e9ff6_22

Comment 23 Casey Callendrello 2018-10-02 13:28:20 UTC
If this is really the case, and you're not just on the wrong node, then your cluster is very, very, very, very, very broken, and it's not the networks' fault.

Can you check very closely for that pod? If "oc get pods" lists it as a running pod, but you don't see it in docker... then your cluster is deeply corrupted.

Make sure you check the pod's NodeIP and are looking on the correct node in the cluster.

Comment 24 Manoj Kumar 2018-10-02 13:44:32 UTC
Yes, there is only node in this cluster. Whenever I get into this state (once a day) I have been rebooting to recover. 

[root@c340f1u15 ~]# oc get pods --all-namespaces -o wide 
NAMESPACE               NAME                                               READY     STATUS             RESTARTS   AGE       IP             NODE
default                 docker-registry-1-qjgv7                            1/1       Running            27         55d       10.128.1.113   c340f1u15.pok.stglabs.ibm.com
default                 registry-console-1-vfcs6                           1/1       Running            28         55d       10.128.1.124   c340f1u15.pok.stglabs.ibm.com
default                 router-1-bdb2h                                     1/1       Running            21         55d       9.114.181.52   c340f1u15.pok.stglabs.ibm.com
kube-system             master-api-c340f1u15.pok.stglabs.ibm.com           1/1       Running            21         55d       9.114.181.52   c340f1u15.pok.stglabs.ibm.com
kube-system             master-controllers-c340f1u15.pok.stglabs.ibm.com   1/1       Running            21         55d       9.114.181.52   c340f1u15.pok.stglabs.ibm.com
kube-system             master-etcd-c340f1u15.pok.stglabs.ibm.com          1/1       Running            21         55d       9.114.181.52   c340f1u15.pok.stglabs.ibm.com
lacme                   acmeair-1-7dmbj                                    1/1       Running            7          31d       10.128.1.114   c340f1u15.pok.stglabs.ibm.com
lacme                   acmeair-1-build                                    0/1       Completed          0          31d       10.128.0.244   c340f1u15.pok.stglabs.ibm.com
local-acme-node         acmeairwithmongo-1-build                           0/1       Completed          0          31d       10.128.0.249   c340f1u15.pok.stglabs.ibm.com
local-acme-node         acmeairwithmongo-1-svpql                           1/1       Running            7          31d       10.128.1.115   c340f1u15.pok.stglabs.ibm.com
local-acme-node         mongodb-1-gdgd4                                    1/1       Running            7          31d       10.128.1.116   c340f1u15.pok.stglabs.ibm.com
local-ruby              myrub-1-build                                      0/1       Completed          0          11d       10.128.1.61    c340f1u15.pok.stglabs.ibm.com
local-ruby              myrub-2-bsgq6                                      1/1       Running            4          11d       10.128.1.117   c340f1u15.pok.stglabs.ibm.com
local-ruby              myrub-2-build                                      0/1       Completed          0          11d       10.128.1.64    c340f1u15.pok.stglabs.ibm.com
local-storage           local-volume-provisioner-k949l                     0/1       CrashLoopBackOff   9031       35d       10.128.1.125   c340f1u15.pok.stglabs.ibm.com
lpark                   mongodb-1-x8wxm                                    1/1       Running            7          31d       10.128.1.122   c340f1u15.pok.stglabs.ibm.com
lpark                   restify-1-build                                    0/1       Completed          0          31d       10.128.0.255   c340f1u15.pok.stglabs.ibm.com
lpark                   restify-1-qg7gb                                    1/1       Running            7          31d       10.128.1.119   c340f1u15.pok.stglabs.ibm.com
nginx-demo              nginx-ex-1-build                                   0/1       Completed          0          7d        10.128.1.110   c340f1u15.pok.stglabs.ibm.com
nginx-demo              nginx-ex-1-wcgrm                                   1/1       Running            1          7d        10.128.1.120   c340f1u15.pok.stglabs.ibm.com
openshift-node          sync-b47mf                                         1/1       Running            20         55d       9.114.181.52   c340f1u15.pok.stglabs.ibm.com
openshift-sdn           ovs-xdbnd                                          1/1       Running            27         55d       9.114.181.52   c340f1u15.pok.stglabs.ibm.com
openshift-sdn           sdn-7w59v                                          0/1       CrashLoopBackOff   8000       49d       9.114.181.52   c340f1u15.pok.stglabs.ibm.com
openshift-web-console   webconsole-6dff45c68d-whrtw                        1/1       Running            27         55d       10.128.1.121   c340f1u15.pok.stglabs.ibm.com
[root@c340f1u15 ~]# docker ps
CONTAINER ID        IMAGE                                                COMMAND                  CREATED             STATUS              PORTS               NAMES
404f7f39f8bd        051b56f67075                                         "/bin/bash -c '#!/..."   28 hours ago        Up 28 hours                             k8s_api_master-api-localhost.localdomain_kube-system_a2828903fa025af361ea430fe51e9ff6_87
aeac7b707f48        82ebb2916975                                         "/bin/sh -c '#!/bi..."   34 hours ago        Up 34 hours                             k8s_etcd_master-etcd-localhost.localdomain_kube-system_76ecc08e792faa61bf0d3c64199e554d_22
0db1fda29b5b        051b56f67075                                         "/bin/bash -c '#!/..."   34 hours ago        Up 34 hours                             k8s_controllers_master-controllers-localhost.localdomain_kube-system_40b8aff16e502487dbf6d7aa689350c1_22
bcd28a73db6a        asolo-ocp-deployer:443/openshift3/ose-pod:v3.10.23   "/usr/bin/pod"           34 hours ago        Up 34 hours                             k8s_POD_master-etcd-localhost.localdomain_kube-system_76ecc08e792faa61bf0d3c64199e554d_22
23b82200160f        asolo-ocp-deployer:443/openshift3/ose-pod:v3.10.23   "/usr/bin/pod"           34 hours ago        Up 34 hours                             k8s_POD_master-controllers-localhost.localdomain_kube-system_40b8aff16e502487dbf6d7aa689350c1_22
f4d88815fd92        asolo-ocp-deployer:443/openshift3/ose-pod:v3.10.23   "/usr/bin/pod"           34 hours ago        Up 34 hours                             k8s_POD_master-api-localhost.localdomain_kube-system_a2828903fa025af361ea430fe51e9ff6_22

Comment 25 Casey Callendrello 2018-10-02 14:04:31 UTC
Well, the next time your node is in a bad state, you need to figure out why the ovs pod crashed. You will have to use docker ps -a to find it. If you reboot it, the logs will be lost.

It's time to delete and recreate your cluster, fwiw. It will make debugging easier.

Comment 26 Manoj Kumar 2018-10-02 14:13:32 UTC
It is in the bad state now, and I have left it in that state to debug. If the logs are long gone, I can reboot and it will take me another day to get into that state.

Can't delete this cluster as this is the only cluster I have for demonstration purposes (together with Red Hat) for an upcoming conference that is fully functional. The others I tried to setup fail with:

https://bugzilla.redhat.com/show_bug.cgi?id=1625817

If the ppc64le version GAs today, I can try to use the public registry as opposed to the private registry method on a new server.

Comment 27 Manoj Kumar 2018-10-02 15:21:18 UTC
From Weibin's note yesterday, it looks like a fix for 1633892 is available. Can you get me a version of that on ppc64le to try?

https://bugzilla.redhat.com/show_bug.cgi?id=1633892
https://github.com/openshift/openshift-ansible/pull/10280

Comment 30 Manoj Kumar 2018-12-11 14:23:57 UTC
I deleted my cluster and re-created it at 3.10.72. I am still seeing the issue.

[root@c340f1u15 ~]# oc get all
NAME            READY     STATUS             RESTARTS   AGE
pod/ovs-jwvwd   1/1       Running            0          3d
pod/sdn-5qm95   0/1       CrashLoopBackOff   731        3d

[root@c340f1u15 ~]# oc logs -f pod/sdn-5qm95
2018/12/11 14:18:40 socat[20194] E connect(5, AF=1 "/var/run/openshift-sdn/cni-server.sock", 40): Connection refused
User "sa" set.
Context "default/c340f1u15-pok-stglabs-ibm-com:8443/system:admin" modified.
I1211 14:18:41.832696   20154 start_node.go:317] Reading node configuration from /etc/origin/node/node-config.yaml
I1211 14:18:41.844785   20154 feature_gate.go:190] feature gates: map[RotateKubeletServerCertificate:true RotateKubeletClientCertificate:true]
I1211 14:18:41.844922   20154 start_node.go:467] Starting node networking c340f1u15.pok.stglabs.ibm.com (v3.10.72)
W1211 14:18:41.845079   20154 server.go:195] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP.
I1211 14:18:41.845107   20154 feature_gate.go:226] feature gates: &{{} map[RotateKubeletClientCertificate:true RotateKubeletServerCertificate:true]}
I1211 14:18:41.847970   20154 transport.go:160] Refreshing client certificate from store
I1211 14:18:41.848906   20154 certificate_store.go:131] Loading cert/key pair from "/etc/origin/node/certificates/kubelet-client-current.pem".
I1211 14:18:41.853063   20154 node.go:150] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "c340f1u15.pok.stglabs.ibm.com" (IP ""), iptables sync period "30s"
I1211 14:18:41.902234   20154 node.go:292] Starting openshift-sdn network plugin
I1211 14:18:42.033860   20154 ovs.go:166] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I1211 14:18:42.033892   20154 sdn_controller.go:139] [SDN setup] full SDN setup required (plugin is not setup)
I1211 14:19:12.075047   20154 ovs.go:166] Error executing ovs-vsctl: 2018-12-11T14:19:12Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
F1211 14:19:12.075125   20154 network.go:46] SDN node startup failed: node SDN setup failed: signal: alarm clock

Comment 31 Manoj Kumar 2018-12-11 14:24:50 UTC
[root@c340f1u15 ~]# rpm -aq |grep openshif
atomic-openshift-clients-3.10.72-1.git.0.3cb2fdc.el7.ppc64le
openshift-ansible-roles-3.10.73-1.git.0.8b65cea.el7.noarch
openshift-ansible-playbooks-3.10.73-1.git.0.8b65cea.el7.noarch
atomic-openshift-excluder-3.10.72-1.git.0.3cb2fdc.el7.noarch
atomic-openshift-node-3.10.72-1.git.0.3cb2fdc.el7.ppc64le
openshift-ansible-docs-3.10.73-1.git.0.8b65cea.el7.noarch
atomic-openshift-docker-excluder-3.10.72-1.git.0.3cb2fdc.el7.noarch
atomic-openshift-3.10.72-1.git.0.3cb2fdc.el7.ppc64le
openshift-ansible-3.10.73-1.git.0.8b65cea.el7.noarch
atomic-openshift-hyperkube-3.10.72-1.git.0.3cb2fdc.el7.ppc64le

Comment 32 Casey Callendrello 2018-12-11 16:21:38 UTC
Can you get the logs from the ovs pod?

"oc rsh -n openshift-sdn ovs-jwvwd cat '/var/log/openvswitch/*.log'"

Comment 33 Manoj Kumar 2018-12-11 16:33:09 UTC
BTW, this is my only cluster on Power9.

sh-4.2# cat ovsdb-server.log 
2018-12-07T21:41:44.099Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log
2018-12-07T21:41:44.210Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.9.0
2018-12-07T21:41:54.219Z|00003|memory|INFO|8128 kB peak resident set size after 10.1 seconds
2018-12-07T21:41:54.219Z|00004|memory|INFO|cells:217 json-caches:1 monitors:1 sessions:2


sh-4.2# cat ovs-vswitchd.log 
2018-12-07T21:41:44.567Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2018-12-07T21:41:44.581Z|00002|ovs_numa|INFO|Discovered 64 CPU cores on NUMA node 0
2018-12-07T21:41:44.581Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 64 CPU cores
2018-12-07T21:41:44.581Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2018-12-07T21:41:44.582Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2018-12-07T21:41:44.587Z|00006|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.9.0
2018-12-07T21:41:45.430Z|00007|ofproto_dpif|INFO|system@ovs-system: Datapath supports recirculation
2018-12-07T21:41:45.430Z|00008|ofproto_dpif|INFO|system@ovs-system: VLAN header stack length probed as 2
2018-12-07T21:41:45.430Z|00009|ofproto_dpif|INFO|system@ovs-system: MPLS label stack length probed as 1
2018-12-07T21:41:45.430Z|00010|ofproto_dpif|INFO|system@ovs-system: Datapath supports truncate action
2018-12-07T21:41:45.430Z|00011|ofproto_dpif|INFO|system@ovs-system: Datapath supports unique flow ids
2018-12-07T21:41:45.430Z|00012|ofproto_dpif|INFO|system@ovs-system: Datapath does not support clone action
2018-12-07T21:41:45.430Z|00013|ofproto_dpif|INFO|system@ovs-system: Max sample nesting level probed as 10
2018-12-07T21:41:45.430Z|00014|ofproto_dpif|INFO|system@ovs-system: Datapath supports eventmask in conntrack action
2018-12-07T21:41:45.430Z|00015|ofproto_dpif|INFO|system@ovs-system: Datapath does not support ct_clear action
2018-12-07T21:41:45.430Z|00016|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_state
2018-12-07T21:41:45.430Z|00017|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_zone
2018-12-07T21:41:45.430Z|00018|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_mark
2018-12-07T21:41:45.430Z|00019|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_label
2018-12-07T21:41:45.430Z|00020|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_state_nat
2018-12-07T21:41:45.430Z|00021|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_orig_tuple
2018-12-07T21:41:45.430Z|00022|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_orig_tuple6
2018-12-07T21:41:45.444Z|00023|bridge|INFO|bridge br0: added interface br0 on port 65534
2018-12-07T21:41:45.444Z|00024|bridge|INFO|bridge br0: using datapath ID 00008e1d8fca3a4e
2018-12-07T21:41:45.444Z|00025|connmgr|INFO|br0: added service controller "punix:/var/run/openvswitch/br0.mgmt"
2018-12-07T21:41:45.475Z|00026|bridge|INFO|bridge br0: added interface vxlan0 on port 1
2018-12-07T21:41:45.578Z|00027|bridge|INFO|bridge br0: added interface tun0 on port 2
2018-12-07T21:41:45.663Z|00028|connmgr|INFO|br0<->unix#2: 42 flow_mods in the last 0 s (42 adds)
2018-12-07T21:41:45.668Z|00029|connmgr|INFO|br0<->unix#4: 1 flow_mods in the last 0 s (1 adds)
2018-12-07T21:41:45.797Z|00030|connmgr|INFO|br0<->unix#9: 4 flow_mods in the last 0 s (4 adds)
2018-12-07T21:41:54.588Z|00031|memory|INFO|140096 kB peak resident set size after 10.0 seconds
2018-12-07T21:41:54.588Z|00032|memory|INFO|handlers:10 ports:3 revalidators:4 rules:51 udpif keys:1
2018-12-07T21:43:44.171Z|00033|connmgr|INFO|br0<->unix#17: 4 flow_mods in the last 0 s (4 adds)
2018-12-07T21:43:47.710Z|00034|bridge|INFO|bridge br0: added interface vethf59aad27 on port 3
2018-12-07T21:43:47.710Z|00001|ofproto_dpif_upcall(handler2)|INFO|received packet on unassociated datapath port 4
2018-12-07T21:43:47.764Z|00035|connmgr|INFO|br0<->unix#22: 4 flow_mods in the last 0 s (4 adds)
2018-12-07T21:43:47.789Z|00036|connmgr|INFO|br0<->unix#24: 2 flow_mods in the last 0 s (2 deletes)
2018-12-07T21:43:50.360Z|00037|connmgr|INFO|br0<->unix#26: 2 flow_mods in the last 0 s (2 adds)
2018-12-07T21:44:04.962Z|00038|bridge|INFO|bridge br0: added interface veth54a2f53f on port 4
2018-12-07T21:44:05.064Z|00039|connmgr|INFO|br0<->unix#28: 4 flow_mods in the last 0 s (4 adds)
2018-12-07T21:44:05.090Z|00040|connmgr|INFO|br0<->unix#30: 2 flow_mods in the last 0 s (2 deletes)
2018-12-07T21:44:09.141Z|00041|connmgr|INFO|br0<->unix#32: 2 flow_mods in the last 0 s (2 adds)
2018-12-07T21:44:11.663Z|00042|bridge|INFO|bridge br0: added interface vethf7cf67e9 on port 5
2018-12-07T21:44:11.760Z|00043|connmgr|INFO|br0<->unix#34: 4 flow_mods in the last 0 s (4 adds)
2018-12-07T21:44:11.782Z|00044|connmgr|INFO|br0<->unix#36: 2 flow_mods in the last 0 s (2 deletes)
2018-12-07T21:44:12.263Z|00045|bridge|INFO|bridge br0: added interface veth9c5fc93d on port 6
2018-12-07T21:44:12.364Z|00046|connmgr|INFO|br0<->unix#38: 4 flow_mods in the last 0 s (4 adds)
2018-12-07T21:44:12.391Z|00047|connmgr|INFO|br0<->unix#40: 2 flow_mods in the last 0 s (2 deletes)
2018-12-07T21:44:12.395Z|00048|connmgr|INFO|br0<->unix#42: 2 flow_mods in the last 0 s (2 deletes)
2018-12-07T21:44:12.464Z|00049|connmgr|INFO|br0<->unix#44: 4 flow_mods in the last 0 s (4 deletes)
2018-12-07T21:44:12.493Z|00050|bridge|INFO|bridge br0: deleted interface vethf59aad27 on port 3
2018-12-07T21:44:17.971Z|00051|connmgr|INFO|br0<->unix#48: 2 flow_mods in the last 0 s (2 adds)
2018-12-07T21:44:18.638Z|00052|bridge|INFO|bridge br0: added interface vethc9382a95 on port 7
2018-12-07T21:44:18.665Z|00053|connmgr|INFO|br0<->unix#50: 4 flow_mods in the last 0 s (4 adds)
2018-12-07T21:44:18.775Z|00054|connmgr|INFO|br0<->unix#52: 2 flow_mods in the last 0 s (2 deletes)
2018-12-07T21:44:21.163Z|00055|bridge|INFO|bridge br0: added interface vethccb8dd9b on port 8
2018-12-07T21:44:21.265Z|00056|connmgr|INFO|br0<->unix#54: 4 flow_mods in the last 0 s (4 adds)
2018-12-07T21:44:21.288Z|00057|connmgr|INFO|br0<->unix#56: 2 flow_mods in the last 0 s (2 deletes)
2018-12-07T21:44:24.360Z|00058|connmgr|INFO|br0<->unix#58: 2 flow_mods in the last 0 s (2 deletes)
2018-12-07T21:44:24.376Z|00059|connmgr|INFO|br0<->unix#60: 4 flow_mods in the last 0 s (4 deletes)
2018-12-07T21:44:24.460Z|00060|bridge|INFO|bridge br0: deleted interface veth54a2f53f on port 4
2018-12-07T21:44:32.026Z|00061|connmgr|INFO|br0<->unix#62: 2 flow_mods in the last 0 s (2 deletes)
2018-12-07T21:44:32.066Z|00062|connmgr|INFO|br0<->unix#64: 4 flow_mods in the last 0 s (4 deletes)
2018-12-07T21:44:32.093Z|00063|bridge|INFO|bridge br0: deleted interface veth9c5fc93d on port 6
2018-12-07T21:49:56.184Z|00064|connmgr|INFO|br0<->unix#91: 2 flow_mods in the last 0 s (2 adds)
2018-12-07T21:49:58.451Z|00065|bridge|INFO|bridge br0: added interface veth1951bb32 on port 9
2018-12-07T21:49:58.464Z|00066|connmgr|INFO|br0<->unix#93: 4 flow_mods in the last 0 s (4 adds)
2018-12-07T21:49:58.488Z|00067|connmgr|INFO|br0<->unix#95: 2 flow_mods in the last 0 s (2 deletes)
2018-12-07T21:51:20.260Z|00068|connmgr|INFO|br0<->unix#103: 2 flow_mods in the last 0 s (2 deletes)
2018-12-07T21:51:20.275Z|00069|connmgr|INFO|br0<->unix#105: 4 flow_mods in the last 0 s (4 deletes)
2018-12-07T21:51:20.363Z|00070|bridge|INFO|bridge br0: deleted interface veth1951bb32 on port 9
2018-12-07T21:51:20.462Z|00071|bridge|INFO|bridge br0: added interface vethccd6a2db on port 10
2018-12-07T21:51:20.472Z|00072|connmgr|INFO|br0<->unix#107: 4 flow_mods in the last 0 s (4 adds)
2018-12-07T21:51:20.576Z|00073|connmgr|INFO|br0<->unix#109: 2 flow_mods in the last 0 s (2 deletes)
2018-12-07T21:51:25.862Z|00074|bridge|INFO|bridge br0: added interface veth411f8817 on port 11
2018-12-07T21:51:25.960Z|00075|connmgr|INFO|br0<->unix#111: 4 flow_mods in the last 0 s (4 adds)
2018-12-07T21:51:25.979Z|00076|connmgr|INFO|br0<->unix#113: 2 flow_mods in the last 0 s (2 deletes)
2018-12-07T21:51:29.869Z|00077|connmgr|INFO|br0<->unix#115: 2 flow_mods in the last 0 s (2 deletes)
2018-12-07T21:51:29.890Z|00078|connmgr|INFO|br0<->unix#117: 4 flow_mods in the last 0 s (4 deletes)
2018-12-07T21:51:29.928Z|00079|bridge|INFO|bridge br0: deleted interface vethccd6a2db on port 10
2018-12-08T17:37:42.960Z|00002|daemon_unix(monitor)|INFO|pid 33984 died, killed (Killed), exiting
sh-4.2#

Comment 34 Manoj Kumar 2019-01-08 16:17:05 UTC
Casey: I tried this on 3.11 as well and I can recreate it. It might not have anything to do with the OpenShift level, but OVS on Power9. Did you find anything in the logs?

Comment 35 Manoj Kumar 2019-01-10 20:42:45 UTC
I found another work-around. Instead of rebooting the node, if I delete both the OVS and SDN pods in the openshift-sdn workspace, then the services come back fine.

oc delete pod/ovs-7spft
oc delete pod/sdn-gcckt

Comment 36 Manoj Kumar 2019-01-24 15:45:10 UTC
I installed OpenShift on a AC922 (also power9) system this week, and now this issue is very bothersome. The sdn pod will crash several times a day.
Can we get some attention on this problem? I have moved up the severity to 'urgent', as this will be deployed at a client site in a couple of weeks.

Comment 37 Harsha Polavarapu 2019-01-24 15:57:58 UTC
(In reply to Manoj Kumar from comment #35)
> I found another work-around. Instead of rebooting the node, if I delete both
> the OVS and SDN pods in the openshift-sdn workspace, then the services come
> back fine.
> 
> oc delete pod/ovs-7spft
> oc delete pod/sdn-gcckt

Manoj,

We were facing the same issue yesterday (01.24.2019) after the upgrade of OpenShift (On RHEL 7.6) from 3.9.33 to 3.10.89 over the last weekend. When I deleted the OVS and SDN pods, the apps started working for sometime and later failed again after few hours. When we verified the journalctl logs, it was saying out of memory (journalctl | grep -i OOM-kill) and the technician from Redhat said the node is ran out of memory. After increasing the memory on the node, we didn't see the issue as of now and it has been 24 hours. We are monitoring the situtation and will update if I see any issues.

Thanks,
Harsha Polavarapu.

Comment 38 Dan Williams 2019-02-01 22:29:08 UTC
Can we get clarification if the problem is actually always OOM kill of the OVS container?  We may need to bump the expected memory limits of the OVS container to match how POWER9 uses memory.

If it's some other crash of the OVS container in the SDN pod then we should reassign this bug to the OVS team.

Comment 39 Manoj Kumar 2019-02-01 22:41:48 UTC
I spent a good part of this week trying to validate Harsha's assertion. To me it is not clear that OOM kill is related to OVS at all, at least on Power9.

At times I find that the OVS container dies, without a recent oom-kill. In /var/log/messages I get no indication that OVS is involved.

Comment 40 Harsha Polavarapu 2019-02-02 00:43:59 UTC
(In reply to Manoj Kumar from comment #39)
> I spent a good part of this week trying to validate Harsha's assertion. To
> me it is not clear that OOM kill is related to OVS at all, at least on
> Power9.
> 
> At times I find that the OVS container dies, without a recent oom-kill. In
> /var/log/messages I get no indication that OVS is involved.

Manoj,

The root cause of the issue is systemd-journal taking more memory over the time and RedHat support acknowledged that there is a memory leak which in turn was killing the OVS and SDN pods since the memory is not guaranteed. We had the below values set so that the memory is guaranteed for the OVS and SDN pods and we started recycling the systemd-journal daily at midnight.

# oc project openshift-sdn 

# oc edit ds sdn       

Change the limits to equal the request

        resources:
          limits:
            cpu: 300m
            memory: 1000Mi
          requests:
            cpu: 300m
            memory: 1000Mi

# oc project openshift-sdn 

# oc edit ds ovs 

Change the limits to equal the request

        resources:
          limits:
            cpu: 200m
            memory: 400Mi
          requests:
            cpu: 200m
            memory: 400Mi

Today RedHat released the below bugfix for the issue and asked us to update the systemd (yum upgrade systemd).
https://access.redhat.com/errata/RHSA-2019:0201

Hopefully this resolves the issue.

Thanks,
Harsha Polavarapu.

Comment 41 Jeff Young 2019-02-04 13:36:37 UTC
I'm able to see this behavior on a power 9 machine (can't duplicate on a power8).

As you can see below, almost all of the pods loose network and are restarted 2-4 times an hour.  Seems that kube-service-catalog and openshift-monitoring are particularly sensitive.
I can provide access to this system.

[ibm-p9b-26 ~]# oc get pods --all-namespaces
NAMESPACE                           NAME                                                       READY     STATUS      RESTARTS   AGE
default                             docker-registry-1-h8tpt                                    1/1       Running     32         15h
default                             registry-console-1-xww6v                                   1/1       Running     32         15h
default                             router-1-4l2sd                                             1/1       Running     0          15h
kube-service-catalog                apiserver-nl4hw                                            1/1       Running     32         15h
kube-service-catalog                controller-manager-crd77                                   1/1       Running     158        15h
kube-system                         master-api-ibm-p9b-26.khw.lab.eng.bos.redhat.com           1/1       Running     0          15h
kube-system                         master-controllers-ibm-p9b-26.khw.lab.eng.bos.redhat.com   1/1       Running     0          15h
kube-system                         master-etcd-ibm-p9b-26.khw.lab.eng.bos.redhat.com          1/1       Running     0          15h
local-storage                       local-volume-provisioner-wlb9q                             1/1       Running     64         15h
openshift-ansible-service-broker    asb-1-ll56b                                                1/1       Running     63         15h
openshift-console                   console-7456d9547b-rgpdg                                   1/1       Running     32         15h
openshift-monitoring                alertmanager-main-0                                        3/3       Running     97         15h
openshift-monitoring                alertmanager-main-1                                        3/3       Running     96         15h
openshift-monitoring                alertmanager-main-2                                        3/3       Running     96         15h
openshift-monitoring                cluster-monitoring-operator-6f5fbd6f8b-k5x64               1/1       Running     132        15h
openshift-monitoring                grafana-857fc848bf-jddjl                                   2/2       Running     64         15h
openshift-monitoring                kube-state-metrics-6b4f6c5c5b-2n5sr                        3/3       Running     96         15h
openshift-monitoring                node-exporter-5vwc4                                        2/2       Running     0          15h
openshift-monitoring                prometheus-k8s-0                                           4/4       Running     130        15h
openshift-monitoring                prometheus-k8s-1                                           4/4       Running     130        15h
openshift-monitoring                prometheus-operator-7855c8646b-89knw                       1/1       Running     32         15h
openshift-node                      sync-s78xp                                                 1/1       Running     0          15h
openshift-sdn                       ovs-ndl6f                                                  1/1       Running     32         15h
openshift-sdn                       sdn-dd4j8                                                  1/1       Running     32         15h
openshift-template-service-broker   apiserver-v65jx                                            1/1       Running     59         15h
openshift-web-console               webconsole-7f7f679596-7kkcr                                1/1       Running     33         15h
operator-lifecycle-manager          catalog-operator-54f866c7d5-ltk2f                          1/1       Running     62         15h
operator-lifecycle-manager          olm-operator-55fd8f9bbb-fd7qj                              1/1       Running     63         15h
test                                dancer-mysql-example-1-4t5k2                               1/1       Running     62         15h
test                                dancer-mysql-example-1-build                               0/1       Completed   0          15h
test                                database-1-p42nh                                           1/1       Running     64         15h
test                                httpd-example-1-5wdf5                                      1/1       Running     61         15h
test                                httpd-example-1-build                                      0/1       Completed   0          15h


43m 43m 1 prometheus-operator-7855c8646b-89knw.15802a40ee4c7b24  Pod  Warning   NetworkFailed    openshift-sdn, ibm-p9b-26.khw.lab.eng.bos.redhat.com   The pod's network interface has been lost and the pod will be stopped.

Comment 42 Jeff Young 2019-02-04 14:01:25 UTC
I didn't make the changes that Harsha ^^^ had suggested. Once I made those, this cluster seemed to stabilize. I will continue to monitor. 
thanks!

Comment 43 Jeff Young 2019-02-04 16:28:41 UTC
The cluster didn't stay very stable for very long, back to seeing consistent network failures across projects, with ovs/sdn crashes twice an hour.

This machine is an all-in-one node:
128 cpu
62GB memory
50GB disk  (using about 56% now)
rhel-alt 7.6 (power9 has to use rhel-alt)

Are there any specific logs I can provide, or I can provide access to this system (inside redhat vpn).

Comment 44 Jeff Young 2019-02-13 18:30:04 UTC
This PR: https://github.com/openshift/openshift-ansible/pull/11150

From this bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1669311

Fixed the issue on my cluster.

Ok to close?

Comment 46 errata-xmlrpc 2019-03-14 02:15:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0405

Comment 47 devopsweb333 2019-07-12 18:55:07 UTC
Running the following resolved my issues:
# kubeadm init --pod-network-cidr=10.244.0.0/16

For flannel as cni the apiserver needs to have the argument --pod-network-cidr=... to be set to the overlay.


Note You need to log in before you can comment on or make changes to this bug.