Bug 2052411
Summary: | Minor update is failing due packets lost | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Eran Kuris <ekuris> | |
Component: | openstack-tripleo-heat-templates | Assignee: | Sofer Athlan-Guyot <sathlang> | |
Status: | CLOSED ERRATA | QA Contact: | Jason Grosso <jgrosso> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 16.1 (Train) | CC: | ccamacho, jamsmith, jgrosso, jpretori, mburns, mciecier, ramishra, rheslop, sathlang, spower | |
Target Milestone: | z8 | Keywords: | Triaged | |
Target Release: | 16.1 (Train on RHEL 8.2) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | openstack-tripleo-heat-templates-11.3.2-1.20220114223345.el8ost | Doc Type: | Enhancement | |
Doc Text: |
As of this release, the Red Hat supported method of updating OVN is aligned to the upstream OVN updgrade steps.
|
Story Points: | --- | |
Clone Of: | ||||
: | 2052576 (view as bug list) | Environment: | ||
Last Closed: | 2022-03-24 11:03:13 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 2061319 | |||
Bug Blocks: | 2015920, 2052576 |
Description
Eran Kuris
2022-02-09 08:36:25 UTC
Hi, to test the above review you need to have that patch in Tripleo-upgrade https://review.opendev.org/c/openstack/tripleo-upgrade/+/829394 When using infrared downstream this is what is needed to the update job: OOO_UPGRADE_PLUGIN_GERRIT_CHANGE = 829394 This is so because the tht review need a new step during the update to work. This is the 16.1 manifestation of https://bugzilla.redhat.com/show_bug.cgi?id=2050154 for 16.2. (In reply to Sofer Athlan-Guyot from comment #2) > Hi, > > to test the above review you need to have that patch in Tripleo-upgrade > https://review.opendev.org/c/openstack/tripleo-upgrade/+/829394 > > When using infrared downstream this is what is needed to the update job: > > OOO_UPGRADE_PLUGIN_GERRIT_CHANGE = 829394 > > This is so because the tht review need a new step during the update to work. Do we have a patch for the OSP16.1 branch? Hi, so the difference between your job and update job is that we don't modify the kernel argument list during update and you do, by adding the tsx=on parameter. Then you hit a bug in the tripleo-ansible as described in https://bugzilla.redhat.com/show_bug.cgi?id=2061319. Basically, the compute node reboot during the udpate. This shouldn't happen as modifying kernel parameter during update shouldn't reboot nodes. The fact that it doesn't recover is because isn't set to auto reboot after the hypervizor's reboot: [stack@undercloud-0 ~]$ openstack --os-cloud overcloud server list +--------------------------------------+---------------------+---------+--------------------------------------------------+-----------------------------+-----------------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+---------------------+---------+--------------------------------------------------+-----------------------------+-----------------------+ | a5610a6e-27bb-4b57-86ff-5c21225d346e | instance_7c2a0e6555 | SHUTOFF | internal_net_7c2a0e6555=192.168.0.50, 10.0.0.205 | upgrade_workload_7c2a0e6555 | v1-512M-5G-7c2a0e6555 | +--------------------------------------+---------------------+---------+--------------------------------------------------+-----------------------------+-----------------------+ Starting the vm back solve the ping. openstack --os-cloud overcloud server start instance_7c2a0e6555 [stack@undercloud-0 ~]$ ping 10.0.0.205 PING 10.0.0.205 (10.0.0.205) 56(84) bytes of data. 64 bytes from 10.0.0.205: icmp_seq=1 ttl=63 time=2.59 ms ^C --- 10.0.0.205 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 2.589/2.589/2.589/0.000 ms Bottom line, ovn update is working, but we have a OSP compute node rebooted. Removing the tsx=on parameter during update should solve you issue, or else deploying with tsx=on. Note that tsx flag shouldn't be needed anymore, but this https://bugzilla.redhat.com/show_bug.cgi?id=2002346 hasn't been backported to 16.1 so you should deploy with tsx flag set already. (In reply to Sofer Athlan-Guyot from comment #12) > Hi, > > so the difference between your job and update job is that we don't > modify the kernel argument list during update and you do, by adding > the tsx=on parameter. > > Then you hit a bug in the tripleo-ansible as described in > https://bugzilla.redhat.com/show_bug.cgi?id=2061319. Basically, the > compute node reboot during the udpate. > > This shouldn't happen as modifying kernel parameter during update > shouldn't reboot nodes. > > The fact that it doesn't recover is because isn't set to auto reboot > after the hypervizor's reboot: > > [stack@undercloud-0 ~]$ openstack --os-cloud overcloud server list > > +--------------------------------------+---------------------+---------+----- > ---------------------------------------------+-----------------------------+- > ----------------------+ > | ID | Name | Status | > Networks | Image > | Flavor | > > +--------------------------------------+---------------------+---------+----- > ---------------------------------------------+-----------------------------+- > ----------------------+ > | a5610a6e-27bb-4b57-86ff-5c21225d346e | instance_7c2a0e6555 | SHUTOFF | > internal_net_7c2a0e6555=192.168.0.50, 10.0.0.205 | > upgrade_workload_7c2a0e6555 | v1-512M-5G-7c2a0e6555 | > > +--------------------------------------+---------------------+---------+----- > ---------------------------------------------+-----------------------------+- > ----------------------+ > > Starting the vm back solve the ping. > > openstack --os-cloud overcloud server start instance_7c2a0e6555 > > [stack@undercloud-0 ~]$ ping 10.0.0.205 > PING 10.0.0.205 (10.0.0.205) 56(84) bytes of data. > 64 bytes from 10.0.0.205: icmp_seq=1 ttl=63 time=2.59 ms > ^C > --- 10.0.0.205 ping statistics --- > 1 packets transmitted, 1 received, 0% packet loss, time 0ms > rtt min/avg/max/mdev = 2.589/2.589/2.589/0.000 ms > > Bottom line, ovn update is working, but we have a OSP compute node > rebooted. > > Removing the tsx=on parameter during update should solve you issue, or > else deploying with tsx=on. > > Note that tsx flag shouldn't be needed anymore, but this > https://bugzilla.redhat.com/show_bug.cgi?id=2002346 hasn't been > backported to 16.1 so you should deploy with tsx flag set already. Following our IRC chat, we agreed that tsx should be set to on and the reason for reproducing the problem is because of missing backport: https://bugzilla.redhat.com/show_bug.cgi?id=2061319 So following my testing the update process is passing when we are not setting any Kernel arg. If we are going to set Kernel ARG so the issue will reproduce and that's because of another NOVA issue: https://bugzilla.redhat.com/show_bug.cgi?id=2061319 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.8 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0986 |