Bug 2052411 - Minor update is failing due packets lost
Summary: Minor update is failing due packets lost
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z8
: 16.1 (Train on RHEL 8.2)
Assignee: Sofer Athlan-Guyot
QA Contact: Jason Grosso
URL:
Whiteboard:
Depends On: 2061319
Blocks: 2015920 2052576
TreeView+ depends on / blocked
 
Reported: 2022-02-09 08:36 UTC by Eran Kuris
Modified: 2022-03-24 11:03 UTC (History)
10 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.2-1.20220114223345.el8ost
Doc Type: Enhancement
Doc Text:
As of this release, the Red Hat supported method of updating OVN is aligned to the upstream OVN updgrade steps.
Clone Of:
: 2052576 (view as bug list)
Environment:
Last Closed: 2022-03-24 11:03:13 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 829393 0 None MERGED Update of OVN controllers as an external task. 2022-03-03 13:59:54 UTC
Red Hat Issue Tracker OSP-12563 0 None None None 2022-02-09 08:43:30 UTC
Red Hat Issue Tracker UPG-4981 0 None None None 2022-02-09 11:51:37 UTC
Red Hat Product Errata RHBA-2022:0986 0 None None None 2022-03-24 11:03:32 UTC

Description Eran Kuris 2022-02-09 08:36:25 UTC
Description of problem:
Minor update between latest_cdn puddle:  RHOS-16.1-RHEL-8-20211126.n.1
to latest z2 candidate: RHOS-16.1-RHEL-8-20220203.n.1 
is failing due to 75% of packets loss.


2022-02-08 14:05:08.529 | TASK [tripleo-upgrade : stop l3 agent connectivity check] **********************
2022-02-08 14:05:08.532 | task path: /home/rhos-ci/jenkins/workspace/DFG-network-networking-ovn-update-16.1_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/infrared/plugins/tripleo-upgrade/infrared_plugin/roles/tripleo-upgrade/tasks/common/l3_agent_connectivity_check_stop_script.yml:2
2022-02-08 14:05:08.534 | Tuesday 08 February 2022  14:05:08 +0000 (0:23:23.283)       2:01:00.898 ****** 
2022-02-08 14:05:08.876 | fatal: [undercloud-0]: FAILED! => {
2022-02-08 14:05:08.878 |     "changed": true,
2022-02-08 14:05:08.881 |     "cmd": "source /home/stack/overcloudrc\n/home/stack/l3_agent_stop_ping.sh 0\n",
2022-02-08 14:05:08.883 |     "delta": "0:00:00.072742",
2022-02-08 14:05:08.886 |     "end": "2022-02-08 14:05:08.850054",
2022-02-08 14:05:08.888 |     "rc": 1,
2022-02-08 14:05:08.891 |     "start": "2022-02-08 14:05:08.777312"
2022-02-08 14:05:08.893 | }
2022-02-08 14:05:08.896 | 
2022-02-08 14:05:08.898 | STDOUT:
2022-02-08 14:05:08.901 | 
2022-02-08 14:05:08.903 | 1374 packets transmitted, 337 received, +636 errors, 75.4731% packet loss, time 1685ms
2022-02-08 14:05:08.905 | rtt min/avg/max/mdev = 0.459/0.870/7.533/0.605 ms, pipe 4
2022-02-08 14:05:08.908 | Ping loss higher than 0 seconds detected (765 seconds)
2022-02-08 14:05:08.910 | 
2022-02-08 14:05:08.912 | 
2022-02-08 14:05:08.915 | MSG:
2022-02-08 14:05:08.917 | 
2022-02-08 14:05:08.919 | non-zero return code
2022-02-08 14:05:08.921 | 
2022-02-08 14:05:08.924 | PLAY RECAP *********************************************************************
2022-02-08 14:05:08.926 | undercloud-0               : ok=100  changed=38   unreachable=0    failed=1    skipped=85   rescued=0    ignored=7   
Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. run automation minor update job 
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Sofer Athlan-Guyot 2022-02-22 10:37:22 UTC
Hi,

to test the above review you need to have that patch in Tripleo-upgrade https://review.opendev.org/c/openstack/tripleo-upgrade/+/829394

When using infrared downstream this is what is needed to the update job:

  OOO_UPGRADE_PLUGIN_GERRIT_CHANGE = 829394

This is so because the tht review need a new step during the update to work.

Comment 3 Sofer Athlan-Guyot 2022-02-22 13:48:19 UTC
This is the 16.1 manifestation of https://bugzilla.redhat.com/show_bug.cgi?id=2050154 for 16.2.

Comment 8 Eran Kuris 2022-03-01 13:27:06 UTC
(In reply to Sofer Athlan-Guyot from comment #2)
> Hi,
> 
> to test the above review you need to have that patch in Tripleo-upgrade
> https://review.opendev.org/c/openstack/tripleo-upgrade/+/829394
> 
> When using infrared downstream this is what is needed to the update job:
> 
>   OOO_UPGRADE_PLUGIN_GERRIT_CHANGE = 829394
> 
> This is so because the tht review need a new step during the update to work.

Do we have a patch for the  OSP16.1  branch?

Comment 12 Sofer Athlan-Guyot 2022-03-07 12:04:28 UTC
Hi,

so the difference between your job and update job is that we don't
modify the kernel argument list during update and you do, by adding
the tsx=on parameter.

Then you hit a bug in the tripleo-ansible as described in
https://bugzilla.redhat.com/show_bug.cgi?id=2061319. Basically, the
compute node reboot during the udpate.

This shouldn't happen as modifying kernel parameter during update
shouldn't reboot nodes.

The fact that it doesn't recover is because isn't set to auto reboot
after the hypervizor's reboot:

   [stack@undercloud-0 ~]$ openstack --os-cloud overcloud server list 
   +--------------------------------------+---------------------+---------+--------------------------------------------------+-----------------------------+-----------------------+ 
   | ID                                   | Name                | Status  | Networks                                         | Image                       | Flavor                | 
   +--------------------------------------+---------------------+---------+--------------------------------------------------+-----------------------------+-----------------------+ 
   | a5610a6e-27bb-4b57-86ff-5c21225d346e | instance_7c2a0e6555 | SHUTOFF | internal_net_7c2a0e6555=192.168.0.50, 10.0.0.205 | upgrade_workload_7c2a0e6555 | v1-512M-5G-7c2a0e6555 | 
   +--------------------------------------+---------------------+---------+--------------------------------------------------+-----------------------------+-----------------------+ 

Starting the vm back solve the ping.

   openstack --os-cloud overcloud server start instance_7c2a0e6555
   
   [stack@undercloud-0 ~]$ ping 10.0.0.205 
   PING 10.0.0.205 (10.0.0.205) 56(84) bytes of data. 
   64 bytes from 10.0.0.205: icmp_seq=1 ttl=63 time=2.59 ms
   ^C 
   --- 10.0.0.205 ping statistics --- 
   1 packets transmitted, 1 received, 0% packet loss, time 0ms
   rtt min/avg/max/mdev = 2.589/2.589/2.589/0.000 ms

Bottom line, ovn update is working, but we have a OSP compute node
rebooted.

Removing the tsx=on parameter during update should solve you issue, or
else deploying with tsx=on.

Note that tsx flag shouldn't be needed anymore, but this
https://bugzilla.redhat.com/show_bug.cgi?id=2002346 hasn't been
backported to 16.1 so you should deploy with tsx flag set already.

Comment 13 Eran Kuris 2022-03-07 13:17:27 UTC
(In reply to Sofer Athlan-Guyot from comment #12)
> Hi,
> 
> so the difference between your job and update job is that we don't
> modify the kernel argument list during update and you do, by adding
> the tsx=on parameter.
> 
> Then you hit a bug in the tripleo-ansible as described in
> https://bugzilla.redhat.com/show_bug.cgi?id=2061319. Basically, the
> compute node reboot during the udpate.
> 
> This shouldn't happen as modifying kernel parameter during update
> shouldn't reboot nodes.
> 
> The fact that it doesn't recover is because isn't set to auto reboot
> after the hypervizor's reboot:
> 
>    [stack@undercloud-0 ~]$ openstack --os-cloud overcloud server list 
>   
> +--------------------------------------+---------------------+---------+-----
> ---------------------------------------------+-----------------------------+-
> ----------------------+ 
>    | ID                                   | Name                | Status  |
> Networks                                         | Image                    
> | Flavor                | 
>   
> +--------------------------------------+---------------------+---------+-----
> ---------------------------------------------+-----------------------------+-
> ----------------------+ 
>    | a5610a6e-27bb-4b57-86ff-5c21225d346e | instance_7c2a0e6555 | SHUTOFF |
> internal_net_7c2a0e6555=192.168.0.50, 10.0.0.205 |
> upgrade_workload_7c2a0e6555 | v1-512M-5G-7c2a0e6555 | 
>   
> +--------------------------------------+---------------------+---------+-----
> ---------------------------------------------+-----------------------------+-
> ----------------------+ 
> 
> Starting the vm back solve the ping.
> 
>    openstack --os-cloud overcloud server start instance_7c2a0e6555
>    
>    [stack@undercloud-0 ~]$ ping 10.0.0.205 
>    PING 10.0.0.205 (10.0.0.205) 56(84) bytes of data. 
>    64 bytes from 10.0.0.205: icmp_seq=1 ttl=63 time=2.59 ms
>    ^C 
>    --- 10.0.0.205 ping statistics --- 
>    1 packets transmitted, 1 received, 0% packet loss, time 0ms
>    rtt min/avg/max/mdev = 2.589/2.589/2.589/0.000 ms
> 
> Bottom line, ovn update is working, but we have a OSP compute node
> rebooted.
> 
> Removing the tsx=on parameter during update should solve you issue, or
> else deploying with tsx=on.
> 
> Note that tsx flag shouldn't be needed anymore, but this
> https://bugzilla.redhat.com/show_bug.cgi?id=2002346 hasn't been
> backported to 16.1 so you should deploy with tsx flag set already.

Following our IRC chat, we agreed that tsx should be set to on and the reason for reproducing the problem is because of missing backport: 
https://bugzilla.redhat.com/show_bug.cgi?id=2061319

Comment 14 Eran Kuris 2022-03-09 11:46:37 UTC
So following my testing the update process is passing when we are not setting any Kernel arg. 
If we are going to set Kernel ARG so the issue will reproduce and that's because of another NOVA issue: https://bugzilla.redhat.com/show_bug.cgi?id=2061319

Comment 23 errata-xmlrpc 2022-03-24 11:03:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.8 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0986


Note You need to log in before you can comment on or make changes to this bug.