2052411 – Minor update is failing due packets lost

Bug 2052411 - Minor update is failing due packets lost

Summary: Minor update is failing due packets lost

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	z8
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	Sofer Athlan-Guyot
QA Contact:	Jason Grosso
Docs Contact:
URL:
Whiteboard:
Depends On:	2061319
Blocks:	2015920 2052576
TreeView+	depends on / blocked

Reported:	2022-02-09 08:36 UTC by Eran Kuris
Modified:	2022-03-24 11:03 UTC (History)
CC List:	10 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-11.3.2-1.20220114223345.el8ost
Doc Type:	Enhancement
Doc Text:	As of this release, the Red Hat supported method of updating OVN is aligned to the upstream OVN updgrade steps.
Clone Of:
Clones:	2052576 (view as bug list)
Environment:
Last Closed:	2022-03-24 11:03:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	829393	None	MERGED	Update of OVN controllers as an external task.	2022-03-03 13:59:54 UTC
Red Hat Issue Tracker	OSP-12563	None	None	None	2022-02-09 08:43:30 UTC
Red Hat Issue Tracker	UPG-4981	None	None	None	2022-02-09 11:51:37 UTC
Red Hat Product Errata	RHBA-2022:0986	None	None	None	2022-03-24 11:03:32 UTC

Description Eran Kuris 2022-02-09 08:36:25 UTC

Description of problem:
Minor update between latest_cdn puddle:  RHOS-16.1-RHEL-8-20211126.n.1
to latest z2 candidate: RHOS-16.1-RHEL-8-20220203.n.1 
is failing due to 75% of packets loss.


2022-02-08 14:05:08.529 | TASK [tripleo-upgrade : stop l3 agent connectivity check] **********************
2022-02-08 14:05:08.532 | task path: /home/rhos-ci/jenkins/workspace/DFG-network-networking-ovn-update-16.1_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/infrared/plugins/tripleo-upgrade/infrared_plugin/roles/tripleo-upgrade/tasks/common/l3_agent_connectivity_check_stop_script.yml:2
2022-02-08 14:05:08.534 | Tuesday 08 February 2022  14:05:08 +0000 (0:23:23.283)       2:01:00.898 ****** 
2022-02-08 14:05:08.876 | fatal: [undercloud-0]: FAILED! => {
2022-02-08 14:05:08.878 |     "changed": true,
2022-02-08 14:05:08.881 |     "cmd": "source /home/stack/overcloudrc\n/home/stack/l3_agent_stop_ping.sh 0\n",
2022-02-08 14:05:08.883 |     "delta": "0:00:00.072742",
2022-02-08 14:05:08.886 |     "end": "2022-02-08 14:05:08.850054",
2022-02-08 14:05:08.888 |     "rc": 1,
2022-02-08 14:05:08.891 |     "start": "2022-02-08 14:05:08.777312"
2022-02-08 14:05:08.893 | }
2022-02-08 14:05:08.896 | 
2022-02-08 14:05:08.898 | STDOUT:
2022-02-08 14:05:08.901 | 
2022-02-08 14:05:08.903 | 1374 packets transmitted, 337 received, +636 errors, 75.4731% packet loss, time 1685ms
2022-02-08 14:05:08.905 | rtt min/avg/max/mdev = 0.459/0.870/7.533/0.605 ms, pipe 4
2022-02-08 14:05:08.908 | Ping loss higher than 0 seconds detected (765 seconds)
2022-02-08 14:05:08.910 | 
2022-02-08 14:05:08.912 | 
2022-02-08 14:05:08.915 | MSG:
2022-02-08 14:05:08.917 | 
2022-02-08 14:05:08.919 | non-zero return code
2022-02-08 14:05:08.921 | 
2022-02-08 14:05:08.924 | PLAY RECAP *********************************************************************
2022-02-08 14:05:08.926 | undercloud-0               : ok=100  changed=38   unreachable=0    failed=1    skipped=85   rescued=0    ignored=7   
Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. run automation minor update job 
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Sofer Athlan-Guyot 2022-02-22 10:37:22 UTC

Hi,

to test the above review you need to have that patch in Tripleo-upgrade https://review.opendev.org/c/openstack/tripleo-upgrade/+/829394

When using infrared downstream this is what is needed to the update job:

  OOO_UPGRADE_PLUGIN_GERRIT_CHANGE = 829394

This is so because the tht review need a new step during the update to work.

Comment 3 Sofer Athlan-Guyot 2022-02-22 13:48:19 UTC

This is the 16.1 manifestation of https://bugzilla.redhat.com/show_bug.cgi?id=2050154 for 16.2.

Comment 8 Eran Kuris 2022-03-01 13:27:06 UTC

(In reply to Sofer Athlan-Guyot from comment #2)
> Hi,
> 
> to test the above review you need to have that patch in Tripleo-upgrade
> https://review.opendev.org/c/openstack/tripleo-upgrade/+/829394
> 
> When using infrared downstream this is what is needed to the update job:
> 
>   OOO_UPGRADE_PLUGIN_GERRIT_CHANGE = 829394
> 
> This is so because the tht review need a new step during the update to work.

Do we have a patch for the  OSP16.1  branch?

Comment 12 Sofer Athlan-Guyot 2022-03-07 12:04:28 UTC

Hi,

so the difference between your job and update job is that we don't
modify the kernel argument list during update and you do, by adding
the tsx=on parameter.

Then you hit a bug in the tripleo-ansible as described in
https://bugzilla.redhat.com/show_bug.cgi?id=2061319. Basically, the
compute node reboot during the udpate.

This shouldn't happen as modifying kernel parameter during update
shouldn't reboot nodes.

The fact that it doesn't recover is because isn't set to auto reboot
after the hypervizor's reboot:

   [stack@undercloud-0 ~]$ openstack --os-cloud overcloud server list 
   +--------------------------------------+---------------------+---------+--------------------------------------------------+-----------------------------+-----------------------+ 
   | ID                                   | Name                | Status  | Networks                                         | Image                       | Flavor                | 
   +--------------------------------------+---------------------+---------+--------------------------------------------------+-----------------------------+-----------------------+ 
   | a5610a6e-27bb-4b57-86ff-5c21225d346e | instance_7c2a0e6555 | SHUTOFF | internal_net_7c2a0e6555=192.168.0.50, 10.0.0.205 | upgrade_workload_7c2a0e6555 | v1-512M-5G-7c2a0e6555 | 
   +--------------------------------------+---------------------+---------+--------------------------------------------------+-----------------------------+-----------------------+ 

Starting the vm back solve the ping.

   openstack --os-cloud overcloud server start instance_7c2a0e6555
   
   [stack@undercloud-0 ~]$ ping 10.0.0.205 
   PING 10.0.0.205 (10.0.0.205) 56(84) bytes of data. 
   64 bytes from 10.0.0.205: icmp_seq=1 ttl=63 time=2.59 ms
   ^C 
   --- 10.0.0.205 ping statistics --- 
   1 packets transmitted, 1 received, 0% packet loss, time 0ms
   rtt min/avg/max/mdev = 2.589/2.589/2.589/0.000 ms

Bottom line, ovn update is working, but we have a OSP compute node
rebooted.

Removing the tsx=on parameter during update should solve you issue, or
else deploying with tsx=on.

Note that tsx flag shouldn't be needed anymore, but this
https://bugzilla.redhat.com/show_bug.cgi?id=2002346 hasn't been
backported to 16.1 so you should deploy with tsx flag set already.

Comment 13 Eran Kuris 2022-03-07 13:17:27 UTC

(In reply to Sofer Athlan-Guyot from comment #12)
> Hi,
> 
> so the difference between your job and update job is that we don't
> modify the kernel argument list during update and you do, by adding
> the tsx=on parameter.
> 
> Then you hit a bug in the tripleo-ansible as described in
> https://bugzilla.redhat.com/show_bug.cgi?id=2061319. Basically, the
> compute node reboot during the udpate.
> 
> This shouldn't happen as modifying kernel parameter during update
> shouldn't reboot nodes.
> 
> The fact that it doesn't recover is because isn't set to auto reboot
> after the hypervizor's reboot:
> 
>    [stack@undercloud-0 ~]$ openstack --os-cloud overcloud server list 
>   
> +--------------------------------------+---------------------+---------+-----
> ---------------------------------------------+-----------------------------+-
> ----------------------+ 
>    | ID                                   | Name                | Status  |
> Networks                                         | Image                    
> | Flavor                | 
>   
> +--------------------------------------+---------------------+---------+-----
> ---------------------------------------------+-----------------------------+-
> ----------------------+ 
>    | a5610a6e-27bb-4b57-86ff-5c21225d346e | instance_7c2a0e6555 | SHUTOFF |
> internal_net_7c2a0e6555=192.168.0.50, 10.0.0.205 |
> upgrade_workload_7c2a0e6555 | v1-512M-5G-7c2a0e6555 | 
>   
> +--------------------------------------+---------------------+---------+-----
> ---------------------------------------------+-----------------------------+-
> ----------------------+ 
> 
> Starting the vm back solve the ping.
> 
>    openstack --os-cloud overcloud server start instance_7c2a0e6555
>    
>    [stack@undercloud-0 ~]$ ping 10.0.0.205 
>    PING 10.0.0.205 (10.0.0.205) 56(84) bytes of data. 
>    64 bytes from 10.0.0.205: icmp_seq=1 ttl=63 time=2.59 ms
>    ^C 
>    --- 10.0.0.205 ping statistics --- 
>    1 packets transmitted, 1 received, 0% packet loss, time 0ms
>    rtt min/avg/max/mdev = 2.589/2.589/2.589/0.000 ms
> 
> Bottom line, ovn update is working, but we have a OSP compute node
> rebooted.
> 
> Removing the tsx=on parameter during update should solve you issue, or
> else deploying with tsx=on.
> 
> Note that tsx flag shouldn't be needed anymore, but this
> https://bugzilla.redhat.com/show_bug.cgi?id=2002346 hasn't been
> backported to 16.1 so you should deploy with tsx flag set already.

Following our IRC chat, we agreed that tsx should be set to on and the reason for reproducing the problem is because of missing backport: 
https://bugzilla.redhat.com/show_bug.cgi?id=2061319

Comment 14 Eran Kuris 2022-03-09 11:46:37 UTC

So following my testing the update process is passing when we are not setting any Kernel arg. 
If we are going to set Kernel ARG so the issue will reproduce and that's because of another NOVA issue: https://bugzilla.redhat.com/show_bug.cgi?id=2061319

Comment 23 errata-xmlrpc 2022-03-24 11:03:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.8 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0986

Note You need to log in before you can comment on or make changes to this bug.