Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1865743

Summary: Some pods are stuck in ContainerCreating and some sdn pods are in CrashLoopBackOff
Product: OpenShift Container Platform Reporter: huirwang
Component: NetworkingAssignee: Aniket Bhat <anbhat>
Networking sub component: openshift-sdn QA Contact: huirwang
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aconstan, anbhat, andbartl, apjagtap, apurty, braander, jkaur, jnordell, mifiedle, openshift-bugs-escalate, pamoedom, rgregory, rheinzma, scuppett, simore, sople, srengan, walters, wking, zzhao
Version: 4.6Keywords: Reopened, TestBlocker, Upgrades
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:14:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1859230, 1868448    
Bug Blocks: 1914910    
Attachments:
Description Flags
crash sdn pods logs none

Description huirwang 2020-08-04 03:45:04 UTC
Description of problem:
Some pods are stuck in ContainerCreating and some sdn pods are in CrashLoopBackOff

How reproducible:
Not sure

Steps to Reproduce:
It happened in the shared cluster on vsphere platform, not sure which triggered the issue.

Some pods are in ContainerCreating

oc get pods -n openshift-multus
NAME                                READY   STATUS              RESTARTS   AGE
multus-24g65                        1/1     Running             0          18h
multus-7l66p                        1/1     Running             72         18h
multus-admission-controller-2x4pq   2/2     Running             0          18h
multus-admission-controller-96b26   0/2     ContainerCreating   0          18h
multus-admission-controller-p2n5d   2/2     Running             0          18h
multus-bhqqm                        1/1     Running             0          17h
multus-g7nwj                        1/1     Running             0          19h
multus-gnwmx                        1/1     Running             0          19h
multus-mz4s8                        1/1     Running             71         19h
network-metrics-daemon-8zlzc        2/2     Running             0          18h
network-metrics-daemon-9vw8d        2/2     Running             0          19h
network-metrics-daemon-c8s5p        0/2     ContainerCreating   0          18h
network-metrics-daemon-dm9gm        2/2     Running             0          17h
network-metrics-daemon-jtscd        0/2     ContainerCreating   0          19h
network-metrics-daemon-n8tkn        2/2     Running             0          19h


 oc describe pod multus-admission-controller-96b26 -n openshift-multus
 
snippet
Events:
  Type     Reason                  Age                    From                               Message
  ----     ------                  ----                   ----                               -------
  Warning  FailedCreatePodSandBox  3m38s (x411 over 11h)  kubelet, qeci-6860-sh6bz-master-0  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_multus-admission-controller-96b26_openshift-multus_43cbd9e5-58da-4192-a747-b679155a54f3_0(9bdfa003c7fad5a06cb4b6f65d40fd31c473ac1004a68c94da2051f886d9f0b3): Multus: [openshift-multus/multus-admission-controller-96b26]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
  
  
Some sdn pods are  in CrashLoopBackOff
oc get pods -n openshift-sdn
NAME                   READY   STATUS             RESTARTS   AGE
ovs-2kqf4              1/1     Running            0          18h
ovs-5dpgk              1/1     Running            0          19h
ovs-9rnbh              1/1     Running            0          19h
ovs-kvhgx              1/1     Running            0          19h
ovs-nf56d              1/1     Running            0          19h
ovs-v6fqg              1/1     Running            0          19h
sdn-6jp86              0/1     CrashLoopBackOff   124        19h
sdn-controller-6cx8d   1/1     Running            5          19h
sdn-controller-bxxvk   1/1     Running            0          19h
sdn-controller-f7jtb   1/1     Running            0          19h
sdn-cs5t9              1/1     Running            0          19h
sdn-jm5nk              1/1     Running            0          19h
sdn-knpsl              0/1     CrashLoopBackOff   124        19h
sdn-metrics-4zv7d      1/1     Running            0          19h
sdn-metrics-678n6      1/1     Running            0          19h
sdn-metrics-hlpbm      1/1     Running            0          15h
sdn-metrics-jjmf6      1/1     Running            0          15h
sdn-metrics-k664l      1/1     Running            0          14h
sdn-metrics-mwslc      1/1     Running            0          15h
sdn-rl5vs              1/1     Running            0          19h
sdn-tk4r9              1/1     Running            1          18h


 oc describe pod sdn-knpsl -n openshift-sdn
 
    Reason:    Error
      Message:   _08_03_07_22_49.360591215/kube-proxy-config.yaml for changes
I0804 02:35:30.806807 1846009 node.go:150] Initializing SDN node "qeci-6860-sh6bz-master-0" (10.0.0.217) of type "redhat/openshift-ovs-networkpolicy"
I0804 02:35:30.812695 1846009 cmd.go:159] Starting node networking (v0.0.0-alpha.0-187-g46264714)
I0804 02:35:30.812770 1846009 node.go:338] Starting openshift-sdn network plugin
I0804 02:35:30.913808 1846009 sdn_controller.go:139] [SDN setup] full SDN setup required (Link not found)
I0804 02:36:00.949825 1846009 ovs.go:180] Error executing ovs-vsctl: 2020-08-04T02:36:00Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
I0804 02:36:31.495879 1846009 ovs.go:180] Error executing ovs-vsctl: 2020-08-04T02:36:31Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
I0804 02:36:32.023735 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0804 02:36:32.530003 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0804 02:36:33.160610 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0804 02:36:33.948002 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0804 02:36:34.931858 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0804 02:36:36.158832 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0804 02:36:37.691514 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0804 02:36:39.604701 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0804 02:36:41.996748 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0804 02:36:44.984127 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
F0804 02:36:44.984164 1846009 cmd.go:111] Failed to start sdn: node SDN setup failed: timed out waiting for the condition

Comment 2 huirwang 2020-08-04 03:54:30 UTC
Created attachment 1710253 [details]
crash sdn pods logs

Comment 23 Colin Walters 2020-12-23 17:35:40 UTC
Yeah this is a mess, we probably need to either allocate a static uid/gid for openvswitch, or switch it to systemd-sysusers.
(Both of them have backwards compatibility issues with existing systems, but we can probably hack around that with a systemd pre-start unit that does a chown -R)
See:
https://github.com/coreos/rpm-ostree/issues/49

A short term workaround openvswitch could do is:

```
ExecStartPre=/bin/chown -R openvswitch:openvswitch /var/lib/openvswitch
```
or so.

Modern systemd versions support http://0pointer.net/blog/dynamic-users-with-systemd.html - I think that's available in RHEL8, so that may be a good choice instead.

Comment 24 Brandon Anderson 2020-12-23 19:19:32 UTC
Related case 02789256 reported that operators have come back into ready state without changes, and provided a mustgather prior to this occurring:

https://attachments.access.redhat.com/hydra/rest/cases/02789256/attachments/aefdf051-24b2-4c00-bfc0-65000413da31?usePresignedUrl=true

Comment 25 Tim Rozet 2021-01-04 22:36:29 UTC
*** Bug 1909260 has been marked as a duplicate of this bug. ***

Comment 33 zhaozhanqi 2021-01-27 07:02:41 UTC
this issue is fixed on 4.6.14 version and Planned release date: 2021-Feb-01

Comment 36 errata-xmlrpc 2021-02-24 15:14:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 37 Red Hat Bugzilla 2023-09-15 00:46:01 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days