1887040 – [upgrade] ovs pod crash for rhel worker when upgarde from 4.5 to 4.6

Bug 1887040 - [upgrade] ovs pod crash for rhel worker when upgarde from 4.5 to 4.6

Summary: [upgrade] ovs pod crash for rhel worker when upgarde from 4.5 to 4.6

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Tim Rozet
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1887607
TreeView+	depends on / blocked

Reported:	2020-10-10 09:28 UTC by zhaozhanqi
Modified:	2021-02-24 15:25 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1887607 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:24:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2154	0	None	closed	Bug 1887040: OVS config: check if OVS is installed	2021-02-18 01:34:41 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:25:21 UTC

Description zhaozhanqi 2020-10-10 09:28:49 UTC

Description of problem:
upgrade from 4.5 to 4.6 with rhel worker and sdn plugin.

ovs pod crashed due to 

oc logs ovs-r4sd8 -n openshift-sdn
openvswitch is running in systemd
id: openvswitch: no such user


Version-Release number of selected component (if applicable):

4.5.0-0.nightly-2020-10-08-190330  --> 4.6.0-rc.2

How reproducible:
always

Steps to Reproduce:
1. upgrade cluster from 4.5 to 4.6 with rhel worker and sdn plugin
2.
3.

Actual results:
Rhel worker ovs pod crashed with logs:
oc logs ovs-r4sd8 -n openshift-sdn
openvswitch is running in systemd
id: openvswitch: no such user


and it blocked the upgrade process



Expected results:


Additional info:

Comment 1 zhaozhanqi 2020-10-10 10:22:24 UTC

this issue happen since rhel worker has not been upgraded to 4.6 version and no openvswith2.13 package installed.
When I met the ovs pod crashed.  then I upgraded the rhel worker to 4.6 version.  after the rhel worker upgraded finished. The rest of worker of cluster can continue upgrade and finally the cluster can upgrade successfully.

is there a way to avoid the ovs pod crashed before upgrade rhel worker, if not. we at least tell customer this situation:  when met ovs pod crash for rhel worker during upgrade to 4.6 version.  it's normal and upgrade rhel worker can resolve this issue.

Comment 2 zhaozhanqi 2020-10-12 01:49:44 UTC

from the document:  https://docs.openshift.com/container-platform/4.5/updating/updating-cluster-rhel-compute.html#rhel-compute-updating_updating-cluster-rhel-compute

 >> After you update your cluster, you must update the Red Hat Enterprise Linux (RHEL) compute machines in your cluster

it's after upgrade the cluster. and then upgrdae the rhcl worker.  if so. this is an issue.

Comment 3 Tim Rozet 2020-10-12 14:35:51 UTC

It looks like openvswitch is upgraded with a playbook post upgrade. This is the order of operations with UPI install so moving it to installer team.

Comment 4 Scott Dodson 2020-10-12 17:15:50 UTC

We're going to have to make sure that the OVS pods in 4.6 maintain compatibility until OVS can be installed on the RHEL Workers as part of the RHEL worker upgrade playbooks. I assume that the reason this is working in RHCOS is because OVS was actually installed in RHCOS 4.5 whereas that wasn't done for RHEL 7 workers.

Comment 5 Tim Rozet 2020-10-12 22:45:56 UTC

Zhanqi, can you please provide the systemd journal to one of your nodes, or provide a setup please? If you didn't have openvswitch installed, I don't see how ovs-configuration.service would have executed and written the /var/run/ovs-config-executed, which we use to determine if OVS is running in systemd.

Comment 7 Scott Dodson 2020-10-13 01:23:06 UTC

I just went through this and what I ran into is that the upgrade halts as soon as the dns, network, and machine-config daemonsets roll out. One of the RHEL workers goes NotReady and none of the daemonsets complete their rollouts. I then attempted to run the upgrade playbooks however the host that was NotReady was never accessible over SSH even when connecting from another host in the cluster though it did respond to ICMP. I assume this is because of dns failure or something along those lines.

I'll look into this a bit more tomorrow but assuming that we're only ever down at most one worker node then this seems like it can be worked in via 4.6.z but it should be high priority.

Comment 9 zhaozhanqi 2020-10-13 08:29:52 UTC

from the rhel node: 

the ovs-configuration service still be started even if the openvswitch.service is not found

sh-4.2# systemctl cat ovs-configuration
# /etc/systemd/system/ovs-configuration.service
[Unit]
Description=Configures OVS with proper host networking configuration
# Removal of this file signals firstboot completion
ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json
# This service is used to move a physical NIC into OVS and reconfigure OVS to use the host IP
Requires=openvswitch.service
Wants=NetworkManager-wait-online.service
After=NetworkManager-wait-online.service openvswitch.service network.service
Before=network-online.target kubelet.service crio.service node-valid-hostname.service

[Service]
# Need oneshot to delay kubelet
Type=oneshot
ExecStart=/usr/local/bin/configure-ovs.sh OpenShiftSDN
StandardOutput=journal+console
StandardError=journal+console

[Install]
WantedBy=network-online.target
sh-4.2# systemctl status openvswitch.service
Unit openvswitch.service could not be found.
sh-4.2# systemctl status ovs-configuration
● ovs-configuration.service - Configures OVS with proper host networking configuration
   Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2020-10-13 05:42:04 UTC; 2h 46min ago
  Process: 1031 ExecStart=/usr/local/bin/configure-ovs.sh OpenShiftSDN (code=exited, status=127)
 Main PID: 1031 (code=exited, status=127)

Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal configure-ovs.sh[1031]: + nmcli connection show ovs-if-phys0
Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal configure-ovs.sh[1031]: + nmcli connection show ovs-port-br-ex
Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal configure-ovs.sh[1031]: + nmcli connection show ovs-if-br-ex
Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal configure-ovs.sh[1031]: + nmcli connection show br-ex
Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal configure-ovs.sh[1031]: + ovs-vsctl --timeout=30 --if-exists del-br br-int -- --if-exists del-br br-local -- --if-exists del-br br-ex
Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal systemd[1]: ovs-configuration.service: main process exited, code=exited, status=127/n/a
Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal configure-ovs.sh[1031]: /usr/local/bin/configure-ovs.sh: line 230: ovs-vsctl: command not found
Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal systemd[1]: Failed to start Configures OVS with proper host networking configuration.
Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal systemd[1]: Unit ovs-configuration.service entered failed state.
Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal systemd[1]: ovs-configuration.service failed.

Comment 10 Tim Rozet 2020-10-13 20:02:47 UTC

Turns out this is a systemd bug. The ovs-configuration service will not run if you try to systmectl start it (because of the missing openvswitch unit) however if you reboot the node, it starts anyway. Filed:
https://bugzilla.redhat.com/show_bug.cgi?id=1888017

Instead of waiting for systemd to fix the bug and backport, we can add a workaround in configure-ovs.sh to check if openvswitch is installed.

Comment 12 Scott Dodson 2020-10-21 18:31:01 UTC

This is only really testable on the transition from 4.5 to 4.6 as 4.6 RHEL workers would have openvswitch installed. So I've tested an upgrade from 4.5 with the backported patch to 4.6 and it was successful, I'm going to leave this bug ON_QA so that QE can take another look if they wish but I'm going to move forward with ensuring that the 4.6 backport can merge by overriding the bugzilla/valid-bug flag on the dependent bug.

Comment 13 zhaozhanqi 2020-10-22 08:48:25 UTC

try to upgrade from 4.5 to 4.7 verison , the cluster with rhel worker can be upgraded and this issue did not be reproduced. since for now there is no 4.7 pundle repo in http://download.eng.bos.redhat.com/rcm-guest/puddles/RHAOS/AtomicOpenShift/ , so for now I still can not do rhel worker upgrade.  but the original issue should be fixed.

move this bug to 'verified'

Comment 16 errata-xmlrpc 2021-02-24 15:24:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.