Bug 2077052

Summary: RHEL 8.6 bump in RHCOS is preventing Azure nodes from (re)booting
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: NetworkingAssignee: Andreas Karis <akaris>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: akaris, bleanhar, dornelas, jligon, jschinta, miabbott, mrussell, nstielau
Version: 4.11   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-03 13:54:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2077605    
Bug Blocks:    
Attachments:
Description Flags
Second boot (after force restarting) none

Description Stephen Benjamin 2022-04-20 14:49:46 UTC
Azure is having siginificant problems since the re-introduction of RHEL 8.6 content.

See:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-azure-ovn-upgrade-4.11-micro-release-openshift-release-analysis-aggregator/1516440543658250240

It appears that some nodes never come back after being rebooted into the new version of RHCOS, e.g.:

    Node ci-op-l990y20q-99831-k7z6j-master-1 went unready at 2022-04-19T16:53:18Z, never became ready again


I am working on collecting more data, including serial logs and will provide it as soon as I have it.


Unfinished Jobs

    periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516440534875377664
    periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516440538209849344
    periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516440539908542464
    periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516440540718043136
    periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516440542404153344

Comment 1 Stephen Benjamin 2022-04-20 16:15:05 UTC
Created attachment 1873842 [details]
First boot log from worker upgrading to RHCOS based on 8.6

Comment 2 Stephen Benjamin 2022-04-20 16:23:23 UTC
From the first boot after upgrading to 411.86.202204190939-0

* It does get a DHCP lease:

[    7.476859] NetworkManager[809]: <info>  [1650470349.9631] dhcp4 (eth0): state changed new lease, address=10.0.128.4


Later on, it runs configure-ovs.sh and it does some network-manager things:

[   58.840163] configure-ovs.sh[1733]: Removed nmconnection file /etc/NetworkManager/system-connections/ovs-port-phys0.nmconnection
[   58.840671] configure-ovs.sh[1733]: + nm_config_changed=1
[   58.841126] configure-ovs.sh[1733]: + ovs-vsctl --timeout=30 --if-exists del-br br-ex
[   58.927356] configure-ovs.sh[1733]: + '[' -d /sys/class/net/br-ex1 ']'
[   58.930344] configure-ovs.sh[1733]: + echo 'OVS configuration successfully reverted'
[   58.933182] configure-ovs.sh[1733]: OVS configuration successfully reverted
[   58.933761] configure-ovs.sh[1733]: + reload_nm
[   58.934275] configure-ovs.sh[1733]: + '[' 1 -eq 0 ']'
[   58.934796] configure-ovs.sh[1733]: + nm_config_changed=0
[   58.935893] configure-ovs.sh[1733]: + echo 'Reloading NetworkManager after configuration changes...'
[   58.936924] configure-ovs.sh[1733]: Reloading NetworkManager after configuration changes...
[   58.937889] configure-ovs.sh[1733]: + nmcli network off
[   58.957462] configure-ovs.sh[1733]: + echo 'Waiting for devices to disconnect...'
[   58.960596] configure-ovs.sh[1733]: Waiting for devices to disconnect...
[   58.964345] configure-ovs.sh[1733]: + timeout 60 bash -c 'while nmcli -g DEVICE,STATE d | grep -v :unmanaged; do sleep 5; done'



After this, the host never pulls a DHCP lease again. However, if I force reboot the host:

      $ az vm restart --force --name ci-op-547k206c-99831-fdm9f-worker-centralus3-h8mdx --resource-group ci-op-547k206c-99831-fdm9f-rg --subscription 72e3a972-58b0-4afc-bd4f-da89b39ccebd

It does reboot and come back up into RHCOS 411.86.202204190939-0 just fine (see second boot log), and the host becomes ready.

So something seems wrong with first boot in 411.86.202204190939-0 on OVN.

Comment 3 Stephen Benjamin 2022-04-20 16:24:04 UTC
Created attachment 1873843 [details]
Second boot (after force restarting)

Comment 4 Stephen Benjamin 2022-04-20 16:25:10 UTC
Moving to OVN for them to have a look.

Comment 8 Andreas Karis 2022-05-03 13:54:04 UTC
I'm marking this as a duplicate of 2078866 as the problems are similar enough and the solution is the same.

*** This bug has been marked as a duplicate of bug 2078866 ***