Bug 2077052 - RHEL 8.6 bump in RHCOS is preventing Azure nodes from (re)booting
Summary: RHEL 8.6 bump in RHCOS is preventing Azure nodes from (re)booting
Keywords:
Status: CLOSED DUPLICATE of bug 2078866
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Andreas Karis
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On: 2077605
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-20 14:49 UTC by Stephen Benjamin
Modified: 2022-06-01 10:04 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-03 13:54:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Second boot (after force restarting) (352.42 KB, text/plain)
2022-04-20 16:24 UTC, Stephen Benjamin
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 3120 0 None open Bug 2078866: configure-ovs: avoid restarting NetworkManager 2022-04-27 16:01:03 UTC

Description Stephen Benjamin 2022-04-20 14:49:46 UTC
Azure is having siginificant problems since the re-introduction of RHEL 8.6 content.

See:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-azure-ovn-upgrade-4.11-micro-release-openshift-release-analysis-aggregator/1516440543658250240

It appears that some nodes never come back after being rebooted into the new version of RHCOS, e.g.:

    Node ci-op-l990y20q-99831-k7z6j-master-1 went unready at 2022-04-19T16:53:18Z, never became ready again


I am working on collecting more data, including serial logs and will provide it as soon as I have it.


Unfinished Jobs

    periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516440534875377664
    periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516440538209849344
    periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516440539908542464
    periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516440540718043136
    periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516440542404153344

Comment 1 Stephen Benjamin 2022-04-20 16:15:05 UTC
Created attachment 1873842 [details]
First boot log from worker upgrading to RHCOS based on 8.6

Comment 2 Stephen Benjamin 2022-04-20 16:23:23 UTC
From the first boot after upgrading to 411.86.202204190939-0

* It does get a DHCP lease:

[    7.476859] NetworkManager[809]: <info>  [1650470349.9631] dhcp4 (eth0): state changed new lease, address=10.0.128.4


Later on, it runs configure-ovs.sh and it does some network-manager things:

[   58.840163] configure-ovs.sh[1733]: Removed nmconnection file /etc/NetworkManager/system-connections/ovs-port-phys0.nmconnection
[   58.840671] configure-ovs.sh[1733]: + nm_config_changed=1
[   58.841126] configure-ovs.sh[1733]: + ovs-vsctl --timeout=30 --if-exists del-br br-ex
[   58.927356] configure-ovs.sh[1733]: + '[' -d /sys/class/net/br-ex1 ']'
[   58.930344] configure-ovs.sh[1733]: + echo 'OVS configuration successfully reverted'
[   58.933182] configure-ovs.sh[1733]: OVS configuration successfully reverted
[   58.933761] configure-ovs.sh[1733]: + reload_nm
[   58.934275] configure-ovs.sh[1733]: + '[' 1 -eq 0 ']'
[   58.934796] configure-ovs.sh[1733]: + nm_config_changed=0
[   58.935893] configure-ovs.sh[1733]: + echo 'Reloading NetworkManager after configuration changes...'
[   58.936924] configure-ovs.sh[1733]: Reloading NetworkManager after configuration changes...
[   58.937889] configure-ovs.sh[1733]: + nmcli network off
[   58.957462] configure-ovs.sh[1733]: + echo 'Waiting for devices to disconnect...'
[   58.960596] configure-ovs.sh[1733]: Waiting for devices to disconnect...
[   58.964345] configure-ovs.sh[1733]: + timeout 60 bash -c 'while nmcli -g DEVICE,STATE d | grep -v :unmanaged; do sleep 5; done'



After this, the host never pulls a DHCP lease again. However, if I force reboot the host:

      $ az vm restart --force --name ci-op-547k206c-99831-fdm9f-worker-centralus3-h8mdx --resource-group ci-op-547k206c-99831-fdm9f-rg --subscription 72e3a972-58b0-4afc-bd4f-da89b39ccebd

It does reboot and come back up into RHCOS 411.86.202204190939-0 just fine (see second boot log), and the host becomes ready.

So something seems wrong with first boot in 411.86.202204190939-0 on OVN.

Comment 3 Stephen Benjamin 2022-04-20 16:24:04 UTC
Created attachment 1873843 [details]
Second boot (after force restarting)

Comment 4 Stephen Benjamin 2022-04-20 16:25:10 UTC
Moving to OVN for them to have a look.

Comment 8 Andreas Karis 2022-05-03 13:54:04 UTC
I'm marking this as a duplicate of 2078866 as the problems are similar enough and the solution is the same.

*** This bug has been marked as a duplicate of bug 2078866 ***


Note You need to log in before you can comment on or make changes to this bug.