Bug 1733574 - Containers get stuck in "Container Creating" status with Error deleting network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused in events logs [NEEDINFO]
Summary: Containers get stuck in "Container Creating" status with Error deleting netwo...
Keywords:
Status: NEW
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.11.0
Hardware: Unspecified
OS: Linux
urgent
urgent
Target Milestone: ---
: 3.11.z
Assignee: Alexander Constantinescu
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-26 14:58 UTC by Joel Rosental R.
Modified: 2020-02-18 08:03 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-12-17 08:25:32 UTC
Target Upstream Version:
csekar: needinfo? (aconstan)
csekar: needinfo? (aconstan)


Attachments (Terms of Use)

Description Joel Rosental R. 2019-07-26 14:58:29 UTC
Description of problem:
Containers do not start on some nodes and getting stuck in "Container Creating" status. In the events logs the following error is shown:

Jun 19 13:18:58 ocp2-prnode13 atomic-openshift-node[27308]: I0619 13:18:58.809102   27308 kuberuntime_manager.go:403] No ready sandbox for pod "logging-fluentd-5fzqh_logging(46961f09-8e78-11e9-b367-005056980be3)" can be found. Need to start a new one
Jun 19 13:18:58 ocp2-prnode13 atomic-openshift-node[27308]: W0619 13:18:58.817480   27308 cni.go:243] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "f0153c63fa3da35f14e914e9c5bb35352e71987892d32649c575833ec27dd040"
Jun 19 13:18:58 ocp2-prnode13 atomic-openshift-node[27308]: E0619 13:18:58.831641   27308 cni.go:280] Error deleting network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused
Jun 19 13:18:58 ocp2-prnode13 atomic-openshift-node[27308]: E0619 13:18:58.835453   27308 remote_runtime.go:115] StopPodSandbox "f0153c63fa3da35f14e914e9c5bb35352e71987892d32649c575833ec27dd040" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "logging-fluentd-5fzqh_logging" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused
Jun 19 13:18:58 ocp2-prnode13 atomic-openshift-node[27308]: E0619 13:18:58.835548   27308 kuberuntime_manager.go:799] Failed to stop sandbox {"docker" "f0153c63fa3da35f14e914e9c5bb35352e71987892d32649c575833ec27dd040"}
Jun 19 13:18:58 ocp2-prnode13 atomic-openshift-node[27308]: E0619 13:18:58.835639   27308 kuberuntime_manager.go:594] killPodWithSyncResult failed: failed to "KillPodSandbox" for "46961f09-8e78-11e9-b367-005056980be3" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"logging-fluentd-5fzqh_logging\" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused"
Jun 19 13:18:58 ocp2-prnode13 atomic-openshift-node[27308]: E0619 13:18:58.835699   27308 pod_workers.go:186] Error syncing pod 46961f09-8e78-11e9-b367-005056980be3 ("logging-fluentd-5fzqh_logging(46961f09-8e78-11e9-b367-005056980be3)"), skipping: failed to "KillPodSandbox" for "46961f09-8e78-11e9-b367-005056980be3" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"logging-fluentd-5fzqh_logging\" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused"

$ ls -l /etc/cni/net.d/
total 4
-rw-r--r--. 1 root root 83 Jul 15 09:59 80-openshift-network.conf

$ cat /etc/cni/net.d/80-openshift-network.conf
{
  "cniVersion": "0.2.0",
  "name": "openshift-sdn",
  "type": "openshift-sdn"
}

$ls -l /var/lib/cni/networks/openshift-sdn/
total 24
-rw-r--r--. 1 root root 64 May 29 12:30 10.129.6.116
-rw-r--r--. 1 root root 64 Jul 15 09:59 10.129.6.117
-rw-r--r--. 1 root root 64 Jul 15 11:01 10.129.6.119
-rw-r--r--. 1 root root 64 Jul 15 11:05 10.129.6.120
-rw-r--r--. 1 root root 64 May 22 19:29 10.129.6.27
-rw-r--r--. 1 root root 12 Jul 15 11:05 last_reserved_ip.0

Checking sar statistics it doesn't seem there was any sort of compute resource starvation 

Both the sdn-* and ovs-* pods for all nodes are up and running in openshift-sdn namespace.

Trying to delete the ovs and sdn pods results in the ovs pods to be re-created, and the sdn pod to gets stuck in "Terminating" status with no log produced from the pod.

logs for ovs-* and sdn-* pod after about 30 min after restarting these pods:

Starting ovsdb-server [  OK  ]
Configuring Open vSwitch system IDs [  OK  ]
Starting ovs-vswitchd [  OK  ]
Enabling remote OVSDB managers [  OK  ]
==> /var/log/openvswitch/ovs-vswitchd.log <==
2019-07-19T15:53:08.987Z|00020|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_state_nat
2019-07-19T15:53:08.987Z|00021|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_orig_tuple
2019-07-19T15:53:08.987Z|00022|ofproto_dpif|INFO|system@ovs-system: Datapath supports ct_orig_tuple6
2019-07-19T15:53:09.020Z|00023|bridge|INFO|bridge br0: added interface tun0 on port 2
2019-07-19T15:53:09.021Z|00024|bridge|INFO|bridge br0: added interface vxlan0 on port 1
2019-07-19T15:53:09.021Z|00025|bridge|INFO|bridge br0: added interface br0 on port 65534
2019-07-19T15:53:09.021Z|00026|bridge|INFO|bridge br0: added interface veth37d49786 on port 3
2019-07-19T15:53:09.021Z|00027|bridge|INFO|bridge br0: using datapath ID 0000b6e23344c44b
2019-07-19T15:53:09.021Z|00028|connmgr|INFO|br0: added service controller "punix:/var/run/openvswitch/br0.mgmt"
2019-07-19T15:53:09.025Z|00029|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.9.0
==> /var/log/openvswitch/ovsdb-server.log <==
2019-06-13T20:56:28.843Z|00023|timeval|WARN|context switches: 0 voluntary, 3 involuntary
2019-06-13T20:56:28.868Z|00024|coverage|INFO|Skipping details of duplicate event coverage for hash=f9afb7d2
2019-06-13T20:58:52.759Z|00002|daemon_unix(monitor)|INFO|pid 2760 died, exit status 0, exiting
2019-06-13T20:58:58.646Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log
2019-06-13T20:58:58.801Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.9.0
2019-06-13T20:59:08.809Z|00003|memory|INFO|2616 kB peak resident set size after 10.2 seconds
2019-06-13T20:59:08.809Z|00004|memory|INFO|cells:276 json-caches:1 monitors:1 sessions:1
2019-07-19T15:53:01.227Z|00002|daemon_unix(monitor)|INFO|pid 6544 died, exit status 0, exiting
2019-07-19T15:53:07.824Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log
2019-07-19T15:53:08.003Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.9.0
2019-07-19T15:53:18.012Z|00003|memory|INFO|2616 kB peak resident set size after 10.2 seconds
2019-07-19T15:53:18.013Z|00004|memory|INFO|cells:276 json-caches:1 monitors:1 sessions:1
==> /var/log/openvswitch/ovs-vswitchd.log <==
2019-07-19T15:53:19.028Z|00030|memory|INFO|92252 kB peak resident set size after 10.2 seconds
2019-07-19T15:53:19.029Z|00031|memory|INFO|handlers:5 ports:4 revalidators:3 rules:4

Restarting the docker service seems to solve the issue temporarily.

Version-Release number of selected component (if applicable):

Docker version: 
- docker-1.13.1-75.git8633870.el7_5.x86_64

OCP 3.11.82

RHEL 7.5

How reproducible:
N/A

Steps to Reproduce:
1. As per customer's indications this appears from time to time on a random basis.


Actual results:
Containers are not created and get stuck in "Container Creating" status on node.

Expected results:
Containers should be created.

Comment 18 Alexander Constantinescu 2019-09-05 08:50:52 UTC
Hi 

I will have a look right now. 

Could you someone please provide me with a contact point in case I need further information? ssadhale@redhat.com? 

Best regards,
Alexander

Comment 26 Alexander Constantinescu 2019-09-12 08:27:56 UTC
Hi

Regarding comment #24: did you check which process is using the socket when the SDN boots? And also, I didn't see any SDN logs attached since comment #20, could you - or someone else - please provide me with these as to be able to paint a better picture of what is happening.

Concerning the permission: nothing strikes me as incorrect. I will tell Dan to have a look later today, he might have more insight. 

In general: I was also wondering if something is restarted frequently by manual actions prior to having this behavior happen? I have seen people mention that restarting docker for a while solves the issue temporarily, does something else happen between the docker restart and the pods failing that has not been mentioned?

Thanks in advance!
Alexander

Comment 42 Alexander Constantinescu 2019-12-17 08:25:32 UTC
Hi

I took the liberty of closing this bug as all customer cases or in "CLOSED" or "waiting on Customer". Case 02503861 is not linked to this issue from what it seems, so I unlinked it from this one. 

/Alex


Note You need to log in before you can comment on or make changes to this bug.