Bug 1895024

Summary: 4.6.1 Install Fails with ovs Pods in CrashLoopBackOff
Product: OpenShift Container Platform Reporter: Chris Keller <ckeller>
Component: NetworkingAssignee: Juan Luis de Sousa-Valadas <jdesousa>
Networking sub component: openshift-sdn QA Contact: Ross Brattain <rbrattai>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: high    
Priority: high CC: aaustin, aconstan, anbhat, bugzilla, fshaikh, jdesousa, jkho, jocolema, maupadhy, mjahangi, mwasher, openshift-bugs-escalate, pchavva, rbost, rheinzma, rjamadar, simore, sjayapra, skanakal, sthaha, xialhuan, ychoukse, zzhao
Version: 4.6.z   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-20 13:50:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chris Keller 2020-11-05 15:25:51 UTC
Description of problem:

When installing 4.6.1 on VMware/UPI, the installation fails because the SDN is not properly deployed. OVS pods remain in CrashLoopBackOff state.


Version-Release number of selected component (if applicable):

This bug is for a new 4.6.1 installation. However, I have observed the same behavior with upgrades from 4.5.15->4.6.1, 4.5.16->4.6.1 and 4.6.1->4.6.2. Specific to upgrades, they fail when the MCO updates/reboots the RHCOS nodes.


How reproducible:

Every time.


Steps to Reproduce:

1. Attempt a 4.6.1 installation on VMware. Links to install-config.yaml and manifests are below.

Actual results:

Installation fails because OVS pods are in CrashLoopBackOff state.


Expected results:

Installation succeeds. 

Additional info:

When the cluster is bootstrapping, master nodes remain in a NotReady state:

$ oc get nodes
NAME                                           STATUS     ROLES    AGE     VERSION
master0.acm-managed-2.ocp.pwc.umbrella.local   NotReady   master   8m37s   v1.19.0+d59ce34
master1.acm-managed-2.ocp.pwc.umbrella.local   NotReady   master   6m38s   v1.19.0+d59ce34
master2.acm-managed-2.ocp.pwc.umbrella.local   NotReady   master   7m9s    v1.19.0+d59ce34

Looking at pods in the openshift-sdn namespace we observe:

$ oc get pods -n openshift-sdn
NAME                   READY   STATUS             RESTARTS   AGE
ovs-m9jn5              0/1     CrashLoopBackOff   6          8m41s
ovs-shlcl              0/1     CrashLoopBackOff   6          7m13s
ovs-wgxsf              0/1     CrashLoopBackOff   6          7m43s
sdn-controller-9gfc9   1/1     Running            0          7m13s
sdn-controller-px5sh   1/1     Running            0          7m43s
sdn-controller-zc4s5   1/1     Running            0          8m41s
sdn-d22qs              0/2     CrashLoopBackOff   10         7m43s
sdn-hxspf              0/2     CrashLoopBackOff   10         7m12s
sdn-jpfg9              0/2     CrashLoopBackOff   10         8m40s

SDN pods are crashing because OVS pods are crashing. Looking at the logs for the OVS pods, we see:

$ oc logs -n openshift-sdn ovs-m9jn5
openvswitch is running in container
Starting ovsdb-server ... failed!
ovsdb-server: /var/run/openvswitch/ovsdb-server.pid: pidfile check failed (No such process), aborting
rm: cannot remove '/var/run/openvswitch/ovs-vswitchd.pid': No such file or directory

If I remove the file /var/run/openvswitch/ovsdb-server.pid from the RHCOS node associated with the pod in CrashLoopBackOff state, the pod eventually starts and the issue is resolved.

Could not complete a must-gather given the state of the cluster. sosreport for one of the masters and other pertinent info is available in the Google drive link below.

Comment 3 Andrew Austin 2020-11-18 16:25:45 UTC
I am also running into this issue upgrading a UPI baremetal environment from 4.5.18 to 4.6.4 in the fast channel.

When a node reboots from RHCOS 4.5 to 4.6 as part of the upgrade, the ovs pod on the node enters a CrashLoopBackOff state because it is trying to launch OVS inside the container like Chris's environment.

I noticed that the ovs-connfiguration systemd unit was not starting to create the /var/run/ovs-config-executed file because ovs-vswitchd was not starting. ovs-vswitchd was not starting because the ovs- pod was already trying to run openvswitch in the container.

I tried Chris' workaround of deleting the PID file, but that results in conflicting copies of ovsdb-server running on the node. I was able to workaround the issue in a more sane manner with the procedure below. This must be done after every node reboot.

1. SSH into node
2. touch /var/run/ovs-config-executed
3. Delete ovs-* pod on the node, verify that it restarts and says something like "openvswitch is running in systmed"
4. systemctl start ovs-configuration (this starts all dependent ovs services as well)

Comment 4 Robert Bost 2020-12-01 22:56:10 UTC
Linking customer case 02810779 that saw this issue.

Comment 6 Robert Bost 2020-12-04 22:35:29 UTC
The customer in 02810779 continues to have this problem after a reboot of a node. Below is the state of the services after a reboot. If we manually restart openvswitch.service and ovs-configuration.service, all is well until the next reboot. 

* ovs-configuration.service - Configures OVS with proper host networking configuration
   Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
Dec 03 06:39:57 localhost systemd[1]: Dependency failed for Configures OVS with proper host networking configuration.
Dec 03 06:39:57 localhost systemd[1]: ovs-configuration.service: Job ovs-configuration.service/start failed with result 'dependency'.

* openvswitch.service - Open vSwitch
   Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
Dec 03 06:39:57 localhost systemd[1]: Dependency failed for Open vSwitch.
Dec 03 06:39:57 localhost systemd[1]: openvswitch.service: Job openvswitch.service/start failed with result 'dependency'.

* ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
  Drop-In: /etc/systemd/system/ovs-vswitchd.service.d
           `-10-ovs-vswitchd-restart.conf
   Active: inactive (dead)
Dec 03 06:39:57 localhost systemd[1]: Dependency failed for Open vSwitch Forwarding Unit.
Dec 03 06:39:57 localhost systemd[1]: ovs-vswitchd.service: Job ovs-vswitchd.service/start failed with result 'dependency'.

* ovsdb-server.service - Open vSwitch Database Unit
   Loaded: loaded (/usr/lib/systemd/system/ovsdb-server.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/ovsdb-server.service.d
           `-10-ovsdb-restart.conf
   Active: active (running) since Thu 2020-12-03 06:40:30 UTC; 18h ago


(-> signifies "Requires" in service unit below)
- ovs-configuration.service -> openvswitch.service
- openvswitch.service -> ovsdb-server.service 
- openvswitch.service -> ovs-vswitchd.service
- ovs-vswitchd.service -> ovsdb-server.service
- ovsdb-server.service is the only one without a Requires directive
 

Note that ovsdb-server is the only service in the group that is running. BUT, note the timestamp. It became active AFTER services that are dependent on it failed already. I would have expected the Restart directive on ovs-vswitchd.service to kick-in but I don't fully understand it. 

Unfortunately, we don't have logs from Dec 03 06:00 to review (journal was truncated) but we've another customer in case 02816167 we're asking about gathering information.

Comment 10 Robert Bost 2020-12-08 01:01:58 UTC
I see two customers (both linked to bug) that are running in vSphere seeing this problem where openvswitch and ovs-configuration services require manual restart after a reboot. This is the only workaround at the moment. 

Is there any way we can make these OVS-related services more resilient to a timeout?

Log Analysis:

In the sosreports for both nodes that I asked customers to reboot, I see the following behavior:

Dec 07 08:50:16 localhost systemd[1]: Starting Open vSwitch Database Unit...
...
...
Dec 07 08:51:46 localhost systemd[1]: sssd.service: Start operation timed out. Terminating.
Dec 07 08:51:46 localhost systemd[1]: ovsdb-server.service: Start operation timed out. Terminating.
Dec 07 08:51:46 localhost systemd[1]: ovsdb-server.service: Failed with result 'timeout'.
Dec 07 08:51:46 localhost systemd[1]: Failed to start Open vSwitch Database Unit.
Dec 07 08:51:46 localhost systemd[1]: Dependency failed for Open vSwitch.
Dec 07 08:51:46 localhost systemd[1]: Dependency failed for Configures OVS with proper host networking configuration.
Dec 07 08:51:46 localhost systemd[1]: ovs-configuration.service: Job ovs-configuration.service/start failed with result 'dependency'.
Dec 07 08:51:46 localhost systemd[1]: openvswitch.service: Job openvswitch.service/start failed with result 'dependency'.
Dec 07 08:51:46 localhost systemd[1]: Dependency failed for Open vSwitch Forwarding Unit.
Dec 07 08:51:46 localhost systemd[1]: ovs-vswitchd.service: Job ovs-vswitchd.service/start failed with result 'dependency'.
Dec 07 08:51:46 localhost systemd[1]: ovsdb-server.service: Consumed 82ms CPU time
Dec 07 08:51:46 localhost systemd[1]: ovs-delete-transient-ports.service: Starting requested but asserts failed.
Dec 07 08:51:46 localhost systemd[1]: Assertion failed for Open vSwitch Delete Transient Ports.

After these services timeout initially, only ovsdb-server recovers:

Dec 07 08:51:47 localhost systemd[1]: ovsdb-server.service: Service RestartSec=100ms expired, scheduling restart.
Dec 07 08:51:47 localhost systemd[1]: ovsdb-server.service: Scheduled restart job, restart counter is at 1.
Dec 07 08:52:20 localhost ovs-ctl[1405]: Starting ovsdb-server.
Dec 07 08:52:20 localhost systemd[1]: Started Open vSwitch Database Unit.

Comment 11 Robert Bost 2020-12-08 01:03:53 UTC
Both customers also have low entropy (under 1,000) reported in the sosreports and I can see crng init doesn't complete until >1 minute after boot. It looks like this is a contributor to the slow boot and eventual timeout of OVS-related services.

However, as mentioned in my last comment, I'd still expect the services to be more resilient to this.

Comment 19 Eva Huang 2020-12-16 23:06:20 UTC
Hi Dev team
I flag the bug "customer escalation" as the customer has raised their concern about the progress. It would be appreciated if higher the priority in the situation.  Thank you.
Eva Huang, Escalation Manager

Comment 21 Eva Huang 2020-12-17 06:32:45 UTC
Hi Dev team

Are you working through the Christmas and New Year Period? Please advise. Thank you.

Best regards
Eva Huang
Escalation Manager

Comment 32 Muhammad Selim Jahangir 2020-12-18 05:24:15 UTC
Hi 
I have got a customer(02816539) who has got the same issue. 

Regards
selim

Comment 34 Robert Bost 2020-12-18 17:02:33 UTC
I've seen a few people commenting on this bug, and they are touching the /var/run/ovs-config-executed. I don't think anyone should be doing this as it would indicate ovs-configuration.service ran successfully, which is clearly hasn't if you had to touch the file!

@Juan Luis, can you confirm that users should *not* touch v/ar/run/ovs-config-executed?

I created https://access.redhat.com/solutions/5635011 last week and its linked to this BZ. Also, note https://bugzilla.redhat.com/show_bug.cgi?id=1895024#c10 where I mention a workaround that helped 2 of the 3 linked customers at the time.

Comment 35 Ross Brattain 2020-12-18 18:17:30 UTC
Looks like the current timeouts are

openvswitch.service
TimeoutStartUSec=infinity
ovs-vswitchd.service
TimeoutStartUSec=5min
ovsdb-server.service
TimeoutStartUSec=1min 30s
ovs-configuration.service
TimeoutStartUSec=infinity


We might be able to just bump up the start timeout for now.

cat > /etc/systemd/system/ovsdb-server.service.d/20-ovsdb-timeout.conf
[Service]
TimeoutStartSec=20min

Comment 36 Juan Luis de Sousa-Valadas 2020-12-21 17:55:56 UTC
Robert,
> I've seen a few people commenting on this bug, and they are touching the /var/run/ovs-config-executed. I don't think anyone should be doing this as it would indicate ovs-configuration.service ran successfully, which is clearly hasn't if you had to touch the file!

If you see the script, all it does is cleaning a bunch of stuff, if the stuff isn't there it's not necessary to clean it up first. So it's not necessarily wrong to touch it, if you are aware of what you're not running and understand the consecuences of not running it, but as a rule of a thumb it's not a good idea and the good workaround would be runnning the script or starting the ovs-configuration service manually.

Ross,
About increasing the timeout to 20 minutes or something huge, I'm worried because I don't know what unintended side effects can we face.

Chris,
I believe you can reproduce this easily, can you reproduce this every time or just during the first boot during the installation? If you can reproduce this, is there any change you give me access to a cluster with this issue? I need cluster-admin.

Comment 37 Chris Keller 2020-12-21 18:04:00 UTC
Juan,

Yes I can reproduce this every time I do an installation. I will ping you on chat.

Comment 50 Juan Luis de Sousa-Valadas 2021-01-19 15:19:45 UTC
Hello,
The last reported version to fail is 4.6.9 and we're unable to reproduce this with 4.6.11. All the affected people, can you please try to reproduce with  4.6.11? If you cannot I'll move this to OVS or RHCOS people so that they can figure out what in the changelog has fixed it.
Thanks,
Juan

Comment 51 Robert Heinzmann 2021-01-20 09:51:22 UTC
I ran a quick test with my reproducer:

With 4.6.9 I am able to reproduce the issue. Core reason is slow entropy on the VM's (no rdrand extension)

~~~
[stack@osp16amd ocp-test4]$ oc get node
NAME                 STATUS     ROLES    AGE     VERSION
ocp-gm466-master-0   NotReady   master   7m18s   v1.19.0+7070803
ocp-gm466-master-1   NotReady   master   7m16s   v1.19.0+7070803
ocp-gm466-master-2   NotReady   master   7m20s   v1.19.0+7070803

[core@ocp-gm466-bootstrap ~]$ cat /proc/sys/kernel/random/entropy_avail
152
[core@ocp-gm466-bootstrap ~]$ cat /proc/cpuinfo | grep rdrand | wc -l 
0
[core@ocp-gm466-bootstrap ~]$ uname -a
Linux ocp-gm466-bootstrap 4.18.0-193.29.1.el8_2.x86_64 #1 SMP Thu Oct 22 10:09:53 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
~~~

Reason is again timeout of the ovsdb-server service

~~~
[core@ocp-gm466-master-0 ~]$ sudo journalctl -u ovsdb-server
Jan 20 09:29:47 ocp-gm466-master-0 systemd[1]: ovsdb-server.service: Failed with result 'timeout'.
~~~

I retried with 4.6.12 and here the cluster deploys. It looks like 4.6.12 fixes https://bugzilla.redhat.com/show_bug.cgi?id=1884682 (kernel 4.18.0-193.40.1.el8_2.x86_64) which improves entropy

~~~
[stack@osp16amd ocp-test4]$ oc get node
NAME                 STATUS   ROLES    AGE   VERSION
ocp-8cfxz-master-0   Ready    master   10m   v1.19.0+9c69bdc
ocp-8cfxz-master-1   Ready    master   10m   v1.19.0+9c69bdc
ocp-8cfxz-master-2   Ready    master   10m   v1.19.0+9c69bdc
[core@ocp-8cfxz-master-2 ~]$ cat /proc/sys/kernel/random/entropy_avail
2040
[core@ocp-8cfxz-master-2 ~]$ cat /proc/cpuinfo | grep rdrand | wc -l 
0
[core@ocp-8cfxz-master-2 ~]$ uname -a
Linux ocp-8cfxz-master-2 4.18.0-193.40.1.el8_2.x86_64 #1 SMP Wed Jan 6 10:54:57 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
~~~

So the entropy reason for this issue is gone, however I still consider the missing dependency from ovs-configuration.service to kubelet the root cause as various reasons can trigger this timeout or start issue.

Comment 52 Juan Luis de Sousa-Valadas 2021-01-20 13:50:32 UTC
Hello Robert,
First of all thank you for your excellent analysis. We very rarely see such a deep analysis.

Regarding:
> However it seems that the kubelet does not have a hard systemd dependency to the ovs-configuration.service, allowing the kubelet to start while OVS is not yet initialized, causing all this mess. So the external OVS stack should be more stable.

and

> So the entropy reason for this issue is gone, however I still consider the missing dependency from ovs-configuration.service to kubelet the root cause as various reasons can trigger this timeout or start issue.

That wouldn't actually solve the problem... The ovs-configuration service is a one-shot service and if the ovsdb-server fails to start the ovs-configuration service won't start at all (in newer systemd releases we can set it to restart=on-abnormal so that wouldn't be an issue but we cannot do that in RHEL). So if it failed to start and we had the kubelet.service requiring ovs-configuration.service the kubelet wouldn't start at all.

We could make it a hard dependency to OVS but then the network operator would still report degraded, then the OVS pod would fail to start and my expectation is that everything would work but the network operator would still report as degraded and the installation would still fail. Also not having the kubelet starting is a big issue because it massively affects the debugability of the node, oc debug node/<node> or oc adm must-gather wouldn't work among other things, at least with a kubelet running broken node we can run hostNetwork: false pods.

Regarding what actually changed, that can't be it, 4.6.12 ships: machine-os 46.82.202101131942-0 Red Hat Enterprise Linux CoreOS which ships 4.18.193.37.1.el8_2 which includes the fix in BZ#1884682