Hide Forgot
Description of problem: Minor update between latest_cdn puddle: RHOS-16.1-RHEL-8-20211126.n.1 to latest z2 candidate: RHOS-16.1-RHEL-8-20220203.n.1 is failing due to 75% of packets loss. 2022-02-08 14:05:08.529 | TASK [tripleo-upgrade : stop l3 agent connectivity check] ********************** 2022-02-08 14:05:08.532 | task path: /home/rhos-ci/jenkins/workspace/DFG-network-networking-ovn-update-16.1_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/infrared/plugins/tripleo-upgrade/infrared_plugin/roles/tripleo-upgrade/tasks/common/l3_agent_connectivity_check_stop_script.yml:2 2022-02-08 14:05:08.534 | Tuesday 08 February 2022 14:05:08 +0000 (0:23:23.283) 2:01:00.898 ****** 2022-02-08 14:05:08.876 | fatal: [undercloud-0]: FAILED! => { 2022-02-08 14:05:08.878 | "changed": true, 2022-02-08 14:05:08.881 | "cmd": "source /home/stack/overcloudrc\n/home/stack/l3_agent_stop_ping.sh 0\n", 2022-02-08 14:05:08.883 | "delta": "0:00:00.072742", 2022-02-08 14:05:08.886 | "end": "2022-02-08 14:05:08.850054", 2022-02-08 14:05:08.888 | "rc": 1, 2022-02-08 14:05:08.891 | "start": "2022-02-08 14:05:08.777312" 2022-02-08 14:05:08.893 | } 2022-02-08 14:05:08.896 | 2022-02-08 14:05:08.898 | STDOUT: 2022-02-08 14:05:08.901 | 2022-02-08 14:05:08.903 | 1374 packets transmitted, 337 received, +636 errors, 75.4731% packet loss, time 1685ms 2022-02-08 14:05:08.905 | rtt min/avg/max/mdev = 0.459/0.870/7.533/0.605 ms, pipe 4 2022-02-08 14:05:08.908 | Ping loss higher than 0 seconds detected (765 seconds) 2022-02-08 14:05:08.910 | 2022-02-08 14:05:08.912 | 2022-02-08 14:05:08.915 | MSG: 2022-02-08 14:05:08.917 | 2022-02-08 14:05:08.919 | non-zero return code 2022-02-08 14:05:08.921 | 2022-02-08 14:05:08.924 | PLAY RECAP ********************************************************************* 2022-02-08 14:05:08.926 | undercloud-0 : ok=100 changed=38 unreachable=0 failed=1 skipped=85 rescued=0 ignored=7 Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. run automation minor update job 2. 3. Actual results: Expected results: Additional info:
Hi, to test the above review you need to have that patch in Tripleo-upgrade https://review.opendev.org/c/openstack/tripleo-upgrade/+/829394 When using infrared downstream this is what is needed to the update job: OOO_UPGRADE_PLUGIN_GERRIT_CHANGE = 829394 This is so because the tht review need a new step during the update to work.
This is the 16.1 manifestation of https://bugzilla.redhat.com/show_bug.cgi?id=2050154 for 16.2.
(In reply to Sofer Athlan-Guyot from comment #2) > Hi, > > to test the above review you need to have that patch in Tripleo-upgrade > https://review.opendev.org/c/openstack/tripleo-upgrade/+/829394 > > When using infrared downstream this is what is needed to the update job: > > OOO_UPGRADE_PLUGIN_GERRIT_CHANGE = 829394 > > This is so because the tht review need a new step during the update to work. Do we have a patch for the OSP16.1 branch?
Hi, so the difference between your job and update job is that we don't modify the kernel argument list during update and you do, by adding the tsx=on parameter. Then you hit a bug in the tripleo-ansible as described in https://bugzilla.redhat.com/show_bug.cgi?id=2061319. Basically, the compute node reboot during the udpate. This shouldn't happen as modifying kernel parameter during update shouldn't reboot nodes. The fact that it doesn't recover is because isn't set to auto reboot after the hypervizor's reboot: [stack@undercloud-0 ~]$ openstack --os-cloud overcloud server list +--------------------------------------+---------------------+---------+--------------------------------------------------+-----------------------------+-----------------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+---------------------+---------+--------------------------------------------------+-----------------------------+-----------------------+ | a5610a6e-27bb-4b57-86ff-5c21225d346e | instance_7c2a0e6555 | SHUTOFF | internal_net_7c2a0e6555=192.168.0.50, 10.0.0.205 | upgrade_workload_7c2a0e6555 | v1-512M-5G-7c2a0e6555 | +--------------------------------------+---------------------+---------+--------------------------------------------------+-----------------------------+-----------------------+ Starting the vm back solve the ping. openstack --os-cloud overcloud server start instance_7c2a0e6555 [stack@undercloud-0 ~]$ ping 10.0.0.205 PING 10.0.0.205 (10.0.0.205) 56(84) bytes of data. 64 bytes from 10.0.0.205: icmp_seq=1 ttl=63 time=2.59 ms ^C --- 10.0.0.205 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 2.589/2.589/2.589/0.000 ms Bottom line, ovn update is working, but we have a OSP compute node rebooted. Removing the tsx=on parameter during update should solve you issue, or else deploying with tsx=on. Note that tsx flag shouldn't be needed anymore, but this https://bugzilla.redhat.com/show_bug.cgi?id=2002346 hasn't been backported to 16.1 so you should deploy with tsx flag set already.
(In reply to Sofer Athlan-Guyot from comment #12) > Hi, > > so the difference between your job and update job is that we don't > modify the kernel argument list during update and you do, by adding > the tsx=on parameter. > > Then you hit a bug in the tripleo-ansible as described in > https://bugzilla.redhat.com/show_bug.cgi?id=2061319. Basically, the > compute node reboot during the udpate. > > This shouldn't happen as modifying kernel parameter during update > shouldn't reboot nodes. > > The fact that it doesn't recover is because isn't set to auto reboot > after the hypervizor's reboot: > > [stack@undercloud-0 ~]$ openstack --os-cloud overcloud server list > > +--------------------------------------+---------------------+---------+----- > ---------------------------------------------+-----------------------------+- > ----------------------+ > | ID | Name | Status | > Networks | Image > | Flavor | > > +--------------------------------------+---------------------+---------+----- > ---------------------------------------------+-----------------------------+- > ----------------------+ > | a5610a6e-27bb-4b57-86ff-5c21225d346e | instance_7c2a0e6555 | SHUTOFF | > internal_net_7c2a0e6555=192.168.0.50, 10.0.0.205 | > upgrade_workload_7c2a0e6555 | v1-512M-5G-7c2a0e6555 | > > +--------------------------------------+---------------------+---------+----- > ---------------------------------------------+-----------------------------+- > ----------------------+ > > Starting the vm back solve the ping. > > openstack --os-cloud overcloud server start instance_7c2a0e6555 > > [stack@undercloud-0 ~]$ ping 10.0.0.205 > PING 10.0.0.205 (10.0.0.205) 56(84) bytes of data. > 64 bytes from 10.0.0.205: icmp_seq=1 ttl=63 time=2.59 ms > ^C > --- 10.0.0.205 ping statistics --- > 1 packets transmitted, 1 received, 0% packet loss, time 0ms > rtt min/avg/max/mdev = 2.589/2.589/2.589/0.000 ms > > Bottom line, ovn update is working, but we have a OSP compute node > rebooted. > > Removing the tsx=on parameter during update should solve you issue, or > else deploying with tsx=on. > > Note that tsx flag shouldn't be needed anymore, but this > https://bugzilla.redhat.com/show_bug.cgi?id=2002346 hasn't been > backported to 16.1 so you should deploy with tsx flag set already. Following our IRC chat, we agreed that tsx should be set to on and the reason for reproducing the problem is because of missing backport: https://bugzilla.redhat.com/show_bug.cgi?id=2061319
So following my testing the update process is passing when we are not setting any Kernel arg. If we are going to set Kernel ARG so the issue will reproduce and that's because of another NOVA issue: https://bugzilla.redhat.com/show_bug.cgi?id=2061319
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.8 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0986