Description of problem: Some pods are stuck in ContainerCreating and some sdn pods are in CrashLoopBackOff How reproducible: Not sure Steps to Reproduce: It happened in the shared cluster on vsphere platform, not sure which triggered the issue. Some pods are in ContainerCreating oc get pods -n openshift-multus NAME READY STATUS RESTARTS AGE multus-24g65 1/1 Running 0 18h multus-7l66p 1/1 Running 72 18h multus-admission-controller-2x4pq 2/2 Running 0 18h multus-admission-controller-96b26 0/2 ContainerCreating 0 18h multus-admission-controller-p2n5d 2/2 Running 0 18h multus-bhqqm 1/1 Running 0 17h multus-g7nwj 1/1 Running 0 19h multus-gnwmx 1/1 Running 0 19h multus-mz4s8 1/1 Running 71 19h network-metrics-daemon-8zlzc 2/2 Running 0 18h network-metrics-daemon-9vw8d 2/2 Running 0 19h network-metrics-daemon-c8s5p 0/2 ContainerCreating 0 18h network-metrics-daemon-dm9gm 2/2 Running 0 17h network-metrics-daemon-jtscd 0/2 ContainerCreating 0 19h network-metrics-daemon-n8tkn 2/2 Running 0 19h oc describe pod multus-admission-controller-96b26 -n openshift-multus snippet Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 3m38s (x411 over 11h) kubelet, qeci-6860-sh6bz-master-0 (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_multus-admission-controller-96b26_openshift-multus_43cbd9e5-58da-4192-a747-b679155a54f3_0(9bdfa003c7fad5a06cb4b6f65d40fd31c473ac1004a68c94da2051f886d9f0b3): Multus: [openshift-multus/multus-admission-controller-96b26]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition Some sdn pods are in CrashLoopBackOff oc get pods -n openshift-sdn NAME READY STATUS RESTARTS AGE ovs-2kqf4 1/1 Running 0 18h ovs-5dpgk 1/1 Running 0 19h ovs-9rnbh 1/1 Running 0 19h ovs-kvhgx 1/1 Running 0 19h ovs-nf56d 1/1 Running 0 19h ovs-v6fqg 1/1 Running 0 19h sdn-6jp86 0/1 CrashLoopBackOff 124 19h sdn-controller-6cx8d 1/1 Running 5 19h sdn-controller-bxxvk 1/1 Running 0 19h sdn-controller-f7jtb 1/1 Running 0 19h sdn-cs5t9 1/1 Running 0 19h sdn-jm5nk 1/1 Running 0 19h sdn-knpsl 0/1 CrashLoopBackOff 124 19h sdn-metrics-4zv7d 1/1 Running 0 19h sdn-metrics-678n6 1/1 Running 0 19h sdn-metrics-hlpbm 1/1 Running 0 15h sdn-metrics-jjmf6 1/1 Running 0 15h sdn-metrics-k664l 1/1 Running 0 14h sdn-metrics-mwslc 1/1 Running 0 15h sdn-rl5vs 1/1 Running 0 19h sdn-tk4r9 1/1 Running 1 18h oc describe pod sdn-knpsl -n openshift-sdn Reason: Error Message: _08_03_07_22_49.360591215/kube-proxy-config.yaml for changes I0804 02:35:30.806807 1846009 node.go:150] Initializing SDN node "qeci-6860-sh6bz-master-0" (10.0.0.217) of type "redhat/openshift-ovs-networkpolicy" I0804 02:35:30.812695 1846009 cmd.go:159] Starting node networking (v0.0.0-alpha.0-187-g46264714) I0804 02:35:30.812770 1846009 node.go:338] Starting openshift-sdn network plugin I0804 02:35:30.913808 1846009 sdn_controller.go:139] [SDN setup] full SDN setup required (Link not found) I0804 02:36:00.949825 1846009 ovs.go:180] Error executing ovs-vsctl: 2020-08-04T02:36:00Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock) I0804 02:36:31.495879 1846009 ovs.go:180] Error executing ovs-vsctl: 2020-08-04T02:36:31Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock) I0804 02:36:32.023735 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket I0804 02:36:32.530003 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket I0804 02:36:33.160610 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket I0804 02:36:33.948002 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket I0804 02:36:34.931858 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket I0804 02:36:36.158832 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket I0804 02:36:37.691514 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket I0804 02:36:39.604701 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket I0804 02:36:41.996748 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket I0804 02:36:44.984127 1846009 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket F0804 02:36:44.984164 1846009 cmd.go:111] Failed to start sdn: node SDN setup failed: timed out waiting for the condition
Created attachment 1710253 [details] crash sdn pods logs
Is this the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1858498 or https://bugzilla.redhat.com/show_bug.cgi?id=1853889
Yeah this is a mess, we probably need to either allocate a static uid/gid for openvswitch, or switch it to systemd-sysusers. (Both of them have backwards compatibility issues with existing systems, but we can probably hack around that with a systemd pre-start unit that does a chown -R) See: https://github.com/coreos/rpm-ostree/issues/49 A short term workaround openvswitch could do is: ``` ExecStartPre=/bin/chown -R openvswitch:openvswitch /var/lib/openvswitch ``` or so. Modern systemd versions support http://0pointer.net/blog/dynamic-users-with-systemd.html - I think that's available in RHEL8, so that may be a good choice instead.
Related case 02789256 reported that operators have come back into ready state without changes, and provided a mustgather prior to this occurring: https://attachments.access.redhat.com/hydra/rest/cases/02789256/attachments/aefdf051-24b2-4c00-bfc0-65000413da31?usePresignedUrl=true
*** Bug 1909260 has been marked as a duplicate of this bug. ***
Marking this as a blocker for testing the supported 500 pods/node on OVN (ref: https://docs.openshift.com/container-platform/4.6/scalability_and_performance/planning-your-environment-according-to-object-maximums.html#cluster-maximums-major-releases_object-limits)
this issue is fixed on 4.6.14 version and Planned release date: 2021-Feb-01
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days