Description of problem: upgrade from 4.5 to 4.6 with rhel worker and sdn plugin. ovs pod crashed due to oc logs ovs-r4sd8 -n openshift-sdn openvswitch is running in systemd id: openvswitch: no such user Version-Release number of selected component (if applicable): 4.5.0-0.nightly-2020-10-08-190330 --> 4.6.0-rc.2 How reproducible: always Steps to Reproduce: 1. upgrade cluster from 4.5 to 4.6 with rhel worker and sdn plugin 2. 3. Actual results: Rhel worker ovs pod crashed with logs: oc logs ovs-r4sd8 -n openshift-sdn openvswitch is running in systemd id: openvswitch: no such user and it blocked the upgrade process Expected results: Additional info:
this issue happen since rhel worker has not been upgraded to 4.6 version and no openvswith2.13 package installed. When I met the ovs pod crashed. then I upgraded the rhel worker to 4.6 version. after the rhel worker upgraded finished. The rest of worker of cluster can continue upgrade and finally the cluster can upgrade successfully. is there a way to avoid the ovs pod crashed before upgrade rhel worker, if not. we at least tell customer this situation: when met ovs pod crash for rhel worker during upgrade to 4.6 version. it's normal and upgrade rhel worker can resolve this issue.
from the document: https://docs.openshift.com/container-platform/4.5/updating/updating-cluster-rhel-compute.html#rhel-compute-updating_updating-cluster-rhel-compute >> After you update your cluster, you must update the Red Hat Enterprise Linux (RHEL) compute machines in your cluster it's after upgrade the cluster. and then upgrdae the rhcl worker. if so. this is an issue.
It looks like openvswitch is upgraded with a playbook post upgrade. This is the order of operations with UPI install so moving it to installer team.
We're going to have to make sure that the OVS pods in 4.6 maintain compatibility until OVS can be installed on the RHEL Workers as part of the RHEL worker upgrade playbooks. I assume that the reason this is working in RHCOS is because OVS was actually installed in RHCOS 4.5 whereas that wasn't done for RHEL 7 workers.
Zhanqi, can you please provide the systemd journal to one of your nodes, or provide a setup please? If you didn't have openvswitch installed, I don't see how ovs-configuration.service would have executed and written the /var/run/ovs-config-executed, which we use to determine if OVS is running in systemd.
I just went through this and what I ran into is that the upgrade halts as soon as the dns, network, and machine-config daemonsets roll out. One of the RHEL workers goes NotReady and none of the daemonsets complete their rollouts. I then attempted to run the upgrade playbooks however the host that was NotReady was never accessible over SSH even when connecting from another host in the cluster though it did respond to ICMP. I assume this is because of dns failure or something along those lines. I'll look into this a bit more tomorrow but assuming that we're only ever down at most one worker node then this seems like it can be worked in via 4.6.z but it should be high priority.
from the rhel node: the ovs-configuration service still be started even if the openvswitch.service is not found sh-4.2# systemctl cat ovs-configuration # /etc/systemd/system/ovs-configuration.service [Unit] Description=Configures OVS with proper host networking configuration # Removal of this file signals firstboot completion ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json # This service is used to move a physical NIC into OVS and reconfigure OVS to use the host IP Requires=openvswitch.service Wants=NetworkManager-wait-online.service After=NetworkManager-wait-online.service openvswitch.service network.service Before=network-online.target kubelet.service crio.service node-valid-hostname.service [Service] # Need oneshot to delay kubelet Type=oneshot ExecStart=/usr/local/bin/configure-ovs.sh OpenShiftSDN StandardOutput=journal+console StandardError=journal+console [Install] WantedBy=network-online.target sh-4.2# systemctl status openvswitch.service Unit openvswitch.service could not be found. sh-4.2# systemctl status ovs-configuration ● ovs-configuration.service - Configures OVS with proper host networking configuration Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Tue 2020-10-13 05:42:04 UTC; 2h 46min ago Process: 1031 ExecStart=/usr/local/bin/configure-ovs.sh OpenShiftSDN (code=exited, status=127) Main PID: 1031 (code=exited, status=127) Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal configure-ovs.sh[1031]: + nmcli connection show ovs-if-phys0 Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal configure-ovs.sh[1031]: + nmcli connection show ovs-port-br-ex Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal configure-ovs.sh[1031]: + nmcli connection show ovs-if-br-ex Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal configure-ovs.sh[1031]: + nmcli connection show br-ex Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal configure-ovs.sh[1031]: + ovs-vsctl --timeout=30 --if-exists del-br br-int -- --if-exists del-br br-local -- --if-exists del-br br-ex Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal systemd[1]: ovs-configuration.service: main process exited, code=exited, status=127/n/a Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal configure-ovs.sh[1031]: /usr/local/bin/configure-ovs.sh: line 230: ovs-vsctl: command not found Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal systemd[1]: Failed to start Configures OVS with proper host networking configuration. Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal systemd[1]: Unit ovs-configuration.service entered failed state. Oct 13 05:42:04 ip-10-0-55-205.us-east-2.compute.internal systemd[1]: ovs-configuration.service failed.
Turns out this is a systemd bug. The ovs-configuration service will not run if you try to systmectl start it (because of the missing openvswitch unit) however if you reboot the node, it starts anyway. Filed: https://bugzilla.redhat.com/show_bug.cgi?id=1888017 Instead of waiting for systemd to fix the bug and backport, we can add a workaround in configure-ovs.sh to check if openvswitch is installed.
This is only really testable on the transition from 4.5 to 4.6 as 4.6 RHEL workers would have openvswitch installed. So I've tested an upgrade from 4.5 with the backported patch to 4.6 and it was successful, I'm going to leave this bug ON_QA so that QE can take another look if they wish but I'm going to move forward with ensuring that the 4.6 backport can merge by overriding the bugzilla/valid-bug flag on the dependent bug.
try to upgrade from 4.5 to 4.7 verison , the cluster with rhel worker can be upgraded and this issue did not be reproduced. since for now there is no 4.7 pundle repo in http://download.eng.bos.redhat.com/rcm-guest/puddles/RHAOS/AtomicOpenShift/ , so for now I still can not do rhel worker upgrade. but the original issue should be fixed. move this bug to 'verified'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633