Bug 1865743 - Some pods are stuck in ContainerCreating and some sdn pods are in CrashLoopBackOff
Summary: Some pods are stuck in ContainerCreating and some sdn pods are in CrashLoopBa...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.7.0
Assignee: Aniket Bhat
QA Contact: huirwang
URL:
Whiteboard:
: 1909260 (view as bug list)
Depends On: 1859230 1868448
Blocks: 1914910
TreeView+ depends on / blocked
 
Reported: 2020-08-04 03:45 UTC by huirwang
Modified: 2023-09-15 00:46 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:14:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
crash sdn pods logs (5.81 KB, application/gzip)
2020-08-04 03:54 UTC, huirwang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2316 0 None closed Bug 1865743: Add a prestart line to change ownership of openvswitch dir 2021-02-17 13:10:12 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:14:42 UTC

Description huirwang 2020-08-04 03:45:04 UTC
Description of problem:
Some pods are stuck in ContainerCreating and some sdn pods are in CrashLoopBackOff

How reproducible:
Not sure

Steps to Reproduce:
It happened in the shared cluster on vsphere platform, not sure which triggered the issue.

Some pods are in ContainerCreating

oc get pods -n openshift-multus
NAME                                READY   STATUS              RESTARTS   AGE
multus-24g65                        1/1     Running             0          18h
multus-7l66p                        1/1     Running             72         18h
multus-admission-controller-2x4pq   2/2     Running             0          18h
multus-admission-controller-96b26   0/2     ContainerCreating   0          18h
multus-admission-controller-p2n5d   2/2     Running             0          18h
multus-bhqqm                        1/1     Running             0          17h
multus-g7nwj                        1/1     Running             0          19h
multus-gnwmx                        1/1     Running             0          19h
multus-mz4s8                        1/1     Running             71         19h
network-metrics-daemon-8zlzc        2/2     Running             0          18h
network-metrics-daemon-9vw8d        2/2     Running             0          19h
network-metrics-daemon-c8s5p        0/2     ContainerCreating   0          18h
network-metrics-daemon-dm9gm        2/2     Running             0          17h
network-metrics-daemon-jtscd        0/2     ContainerCreating   0          19h
network-metrics-daemon-n8tkn        2/2     Running             0          19h


 oc describe pod multus-admission-controller-96b26 -n openshift-multus
 
snippet
Events:
  Type     Reason                  Age                    From                               Message
  ----     ------                  ----                   ----                               -------
  Warning  FailedCreatePodSandBox  3m38s (x411 over 11h)  kubelet, qeci-6860-sh6bz-master-0  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_multus-admission-controller-96b26_openshift-multus_43cbd9e5-58da-4192-a747-b679155a54f3_0(9bdfa003c7fad5a06cb4b6f65d40fd31c473ac1004a68c94da2051f886d9f0b3): Multus: [openshift-multus/multus-admission-controller-96b26]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
  
  
Some sdn pods are  in CrashLoopBackOff
oc get pods -n openshift-sdn
NAME                   READY   STATUS             RESTARTS   AGE
ovs-2kqf4              1/1     Running            0          18h
ovs-5dpgk              1/1     Running            0          19h
ovs-9rnbh              1/1     Running            0          19h
ovs-kvhgx              1/1     Running            0          19h
ovs-nf56d              1/1     Running            0          19h
ovs-v6fqg              1/1     Running            0          19h
sdn-6jp86              0/1     CrashLoopBackOff   124        19h
sdn-controller-6cx8d   1/1     Running            5          19h
sdn-controller-bxxvk   1/1     Running            0          19h
sdn-controller-f7jtb   1/1     Running            0          19h
sdn-cs5t9              1/1     Running            0          19h
sdn-jm5nk              1/1     Running            0          19h
sdn-knpsl              0/1     CrashLoopBackOff   124        19h
sdn-metrics-4zv7d      1/1     Running            0          19h
sdn-metrics-678n6      1/1     Running            0          19h
sdn-metrics-hlpbm      1/1     Running            0          15h
sdn-metrics-jjmf6      1/1     Running            0          15h
sdn-metrics-k664l      1/1     Running            0          14h
sdn-metrics-mwslc      1/1     Running            0          15h
sdn-rl5vs              1/1     Running            0          19h
sdn-tk4r9              1/1     Running            1          18h


 oc describe pod sdn-knpsl -n openshift-sdn
 
    Reason:    Error
      Message:   _08_03_07_22_49.360591215/kube-proxy-config.yaml for changes
I0804 02:35:30.806807 1846009 node.go:150] Initializing SDN node "qeci-6860-sh6bz-master-0" (10.0.0.217) of type "redhat/openshift-ovs-networkpolicy"
I0804 02:35:30.812695 1846009 cmd.go:159] Starting node networking (v0.0.0-alpha.0-187-g46264714)
I0804 02:35:30.812770 1846009 node.go:338] Starting openshift-sdn network plugin
I0804 02:35:30.913808 1846009 sdn_controller.go:139] [SDN setup] full SDN setup required (Link not found)
I0804 02:36:00.949825 1846009 ovs.go:180] Error executing ovs-vsctl: 2020-08-04T02:36:00Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
I0804 02:36:31.495879 1846009 ovs.go:180] Error executing ovs-vsctl: 2020-08-04T02:36:31Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
I0804 02:36:32.023735 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0804 02:36:32.530003 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0804 02:36:33.160610 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0804 02:36:33.948002 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0804 02:36:34.931858 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0804 02:36:36.158832 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0804 02:36:37.691514 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0804 02:36:39.604701 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0804 02:36:41.996748 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
I0804 02:36:44.984127 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
F0804 02:36:44.984164 1846009 cmd.go:111] Failed to start sdn: node SDN setup failed: timed out waiting for the condition

Comment 2 huirwang 2020-08-04 03:54:30 UTC
Created attachment 1710253 [details]
crash sdn pods logs

Comment 23 Colin Walters 2020-12-23 17:35:40 UTC
Yeah this is a mess, we probably need to either allocate a static uid/gid for openvswitch, or switch it to systemd-sysusers.
(Both of them have backwards compatibility issues with existing systems, but we can probably hack around that with a systemd pre-start unit that does a chown -R)
See:
https://github.com/coreos/rpm-ostree/issues/49

A short term workaround openvswitch could do is:

```
ExecStartPre=/bin/chown -R openvswitch:openvswitch /var/lib/openvswitch
```
or so.

Modern systemd versions support http://0pointer.net/blog/dynamic-users-with-systemd.html - I think that's available in RHEL8, so that may be a good choice instead.

Comment 24 Brandon Anderson 2020-12-23 19:19:32 UTC
Related case 02789256 reported that operators have come back into ready state without changes, and provided a mustgather prior to this occurring:

https://attachments.access.redhat.com/hydra/rest/cases/02789256/attachments/aefdf051-24b2-4c00-bfc0-65000413da31?usePresignedUrl=true

Comment 25 Tim Rozet 2021-01-04 22:36:29 UTC
*** Bug 1909260 has been marked as a duplicate of this bug. ***

Comment 33 zhaozhanqi 2021-01-27 07:02:41 UTC
this issue is fixed on 4.6.14 version and Planned release date: 2021-Feb-01

Comment 36 errata-xmlrpc 2021-02-24 15:14:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 37 Red Hat Bugzilla 2023-09-15 00:46:01 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.