2077052 – RHEL 8.6 bump in RHCOS is preventing Azure nodes from (re)booting

Bug 2077052 - RHEL 8.6 bump in RHCOS is preventing Azure nodes from (re)booting

Summary: RHEL 8.6 bump in RHCOS is preventing Azure nodes from (re)booting

Keywords:
Status:	CLOSED DUPLICATE of bug 2078866
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Andreas Karis
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:	2077605
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-20 14:49 UTC by Stephen Benjamin
Modified:	2022-06-01 10:04 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-03 13:54:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Second boot (after force restarting) (352.42 KB, text/plain) 2022-04-20 16:24 UTC, Stephen Benjamin	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 3120	0	None	open	Bug 2078866: configure-ovs: avoid restarting NetworkManager	2022-04-27 16:01:03 UTC

Description Stephen Benjamin 2022-04-20 14:49:46 UTC

Azure is having siginificant problems since the re-introduction of RHEL 8.6 content.

See:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-azure-ovn-upgrade-4.11-micro-release-openshift-release-analysis-aggregator/1516440543658250240

It appears that some nodes never come back after being rebooted into the new version of RHCOS, e.g.:

    Node ci-op-l990y20q-99831-k7z6j-master-1 went unready at 2022-04-19T16:53:18Z, never became ready again


I am working on collecting more data, including serial logs and will provide it as soon as I have it.


Unfinished Jobs

    periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516440534875377664
    periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516440538209849344
    periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516440539908542464
    periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516440540718043136
    periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516440542404153344

Comment 1 Stephen Benjamin 2022-04-20 16:15:05 UTC

Created attachment 1873842 [details]
First boot log from worker upgrading to RHCOS based on 8.6

Comment 2 Stephen Benjamin 2022-04-20 16:23:23 UTC

From the first boot after upgrading to 411.86.202204190939-0

* It does get a DHCP lease:

[    7.476859] NetworkManager[809]: <info>  [1650470349.9631] dhcp4 (eth0): state changed new lease, address=10.0.128.4


Later on, it runs configure-ovs.sh and it does some network-manager things:

[   58.840163] configure-ovs.sh[1733]: Removed nmconnection file /etc/NetworkManager/system-connections/ovs-port-phys0.nmconnection
[   58.840671] configure-ovs.sh[1733]: + nm_config_changed=1
[   58.841126] configure-ovs.sh[1733]: + ovs-vsctl --timeout=30 --if-exists del-br br-ex
[   58.927356] configure-ovs.sh[1733]: + '[' -d /sys/class/net/br-ex1 ']'
[   58.930344] configure-ovs.sh[1733]: + echo 'OVS configuration successfully reverted'
[   58.933182] configure-ovs.sh[1733]: OVS configuration successfully reverted
[   58.933761] configure-ovs.sh[1733]: + reload_nm
[   58.934275] configure-ovs.sh[1733]: + '[' 1 -eq 0 ']'
[   58.934796] configure-ovs.sh[1733]: + nm_config_changed=0
[   58.935893] configure-ovs.sh[1733]: + echo 'Reloading NetworkManager after configuration changes...'
[   58.936924] configure-ovs.sh[1733]: Reloading NetworkManager after configuration changes...
[   58.937889] configure-ovs.sh[1733]: + nmcli network off
[   58.957462] configure-ovs.sh[1733]: + echo 'Waiting for devices to disconnect...'
[   58.960596] configure-ovs.sh[1733]: Waiting for devices to disconnect...
[   58.964345] configure-ovs.sh[1733]: + timeout 60 bash -c 'while nmcli -g DEVICE,STATE d | grep -v :unmanaged; do sleep 5; done'



After this, the host never pulls a DHCP lease again. However, if I force reboot the host:

      $ az vm restart --force --name ci-op-547k206c-99831-fdm9f-worker-centralus3-h8mdx --resource-group ci-op-547k206c-99831-fdm9f-rg --subscription 72e3a972-58b0-4afc-bd4f-da89b39ccebd

It does reboot and come back up into RHCOS 411.86.202204190939-0 just fine (see second boot log), and the host becomes ready.

So something seems wrong with first boot in 411.86.202204190939-0 on OVN.

Comment 3 Stephen Benjamin 2022-04-20 16:24:04 UTC

Created attachment 1873843 [details]
Second boot (after force restarting)

Comment 4 Stephen Benjamin 2022-04-20 16:25:10 UTC

Moving to OVN for them to have a look.

Comment 8 Andreas Karis 2022-05-03 13:54:04 UTC

I'm marking this as a duplicate of 2078866 as the problems are similar enough and the solution is the same.

*** This bug has been marked as a duplicate of bug 2078866 ***

Note You need to log in before you can comment on or make changes to this bug.