Bug 2052411

Summary:	Minor update is failing due packets lost
Product:	Red Hat OpenStack	Reporter:	Eran Kuris <ekuris>
Component:	openstack-tripleo-heat-templates	Assignee:	Sofer Athlan-Guyot <sathlang>
Status:	CLOSED ERRATA	QA Contact:	Jason Grosso <jgrosso>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	16.1 (Train)	CC:	ccamacho, jamsmith, jgrosso, jpretori, mburns, mciecier, ramishra, rheslop, sathlang, spower
Target Milestone:	z8	Keywords:	Triaged
Target Release:	16.1 (Train on RHEL 8.2)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-11.3.2-1.20220114223345.el8ost	Doc Type:	Enhancement
Doc Text:	As of this release, the Red Hat supported method of updating OVN is aligned to the upstream OVN updgrade steps.	Story Points:	---
Clone Of:
Clones:	2052576 (view as bug list)		Environment:
Last Closed:	2022-03-24 11:03:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2061319
Bug Blocks:	2015920, 2052576

Description Eran Kuris 2022-02-09 08:36:25 UTC

Description of problem:
Minor update between latest_cdn puddle:  RHOS-16.1-RHEL-8-20211126.n.1
to latest z2 candidate: RHOS-16.1-RHEL-8-20220203.n.1 
is failing due to 75% of packets loss.


2022-02-08 14:05:08.529 | TASK [tripleo-upgrade : stop l3 agent connectivity check] **********************
2022-02-08 14:05:08.532 | task path: /home/rhos-ci/jenkins/workspace/DFG-network-networking-ovn-update-16.1_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/infrared/plugins/tripleo-upgrade/infrared_plugin/roles/tripleo-upgrade/tasks/common/l3_agent_connectivity_check_stop_script.yml:2
2022-02-08 14:05:08.534 | Tuesday 08 February 2022  14:05:08 +0000 (0:23:23.283)       2:01:00.898 ****** 
2022-02-08 14:05:08.876 | fatal: [undercloud-0]: FAILED! => {
2022-02-08 14:05:08.878 |     "changed": true,
2022-02-08 14:05:08.881 |     "cmd": "source /home/stack/overcloudrc\n/home/stack/l3_agent_stop_ping.sh 0\n",
2022-02-08 14:05:08.883 |     "delta": "0:00:00.072742",
2022-02-08 14:05:08.886 |     "end": "2022-02-08 14:05:08.850054",
2022-02-08 14:05:08.888 |     "rc": 1,
2022-02-08 14:05:08.891 |     "start": "2022-02-08 14:05:08.777312"
2022-02-08 14:05:08.893 | }
2022-02-08 14:05:08.896 | 
2022-02-08 14:05:08.898 | STDOUT:
2022-02-08 14:05:08.901 | 
2022-02-08 14:05:08.903 | 1374 packets transmitted, 337 received, +636 errors, 75.4731% packet loss, time 1685ms
2022-02-08 14:05:08.905 | rtt min/avg/max/mdev = 0.459/0.870/7.533/0.605 ms, pipe 4
2022-02-08 14:05:08.908 | Ping loss higher than 0 seconds detected (765 seconds)
2022-02-08 14:05:08.910 | 
2022-02-08 14:05:08.912 | 
2022-02-08 14:05:08.915 | MSG:
2022-02-08 14:05:08.917 | 
2022-02-08 14:05:08.919 | non-zero return code
2022-02-08 14:05:08.921 | 
2022-02-08 14:05:08.924 | PLAY RECAP *********************************************************************
2022-02-08 14:05:08.926 | undercloud-0               : ok=100  changed=38   unreachable=0    failed=1    skipped=85   rescued=0    ignored=7   
Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. run automation minor update job 
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Sofer Athlan-Guyot 2022-02-22 10:37:22 UTC

Hi,

to test the above review you need to have that patch in Tripleo-upgrade https://review.opendev.org/c/openstack/tripleo-upgrade/+/829394

When using infrared downstream this is what is needed to the update job:

  OOO_UPGRADE_PLUGIN_GERRIT_CHANGE = 829394

This is so because the tht review need a new step during the update to work.

Comment 3 Sofer Athlan-Guyot 2022-02-22 13:48:19 UTC

This is the 16.1 manifestation of https://bugzilla.redhat.com/show_bug.cgi?id=2050154 for 16.2.

Comment 8 Eran Kuris 2022-03-01 13:27:06 UTC

(In reply to Sofer Athlan-Guyot from comment #2)
> Hi,
> 
> to test the above review you need to have that patch in Tripleo-upgrade
> https://review.opendev.org/c/openstack/tripleo-upgrade/+/829394
> 
> When using infrared downstream this is what is needed to the update job:
> 
>   OOO_UPGRADE_PLUGIN_GERRIT_CHANGE = 829394
> 
> This is so because the tht review need a new step during the update to work.

Do we have a patch for the  OSP16.1  branch?

Comment 12 Sofer Athlan-Guyot 2022-03-07 12:04:28 UTC

Hi,

so the difference between your job and update job is that we don't
modify the kernel argument list during update and you do, by adding
the tsx=on parameter.

Then you hit a bug in the tripleo-ansible as described in
https://bugzilla.redhat.com/show_bug.cgi?id=2061319. Basically, the
compute node reboot during the udpate.

This shouldn't happen as modifying kernel parameter during update
shouldn't reboot nodes.

The fact that it doesn't recover is because isn't set to auto reboot
after the hypervizor's reboot:

   [stack@undercloud-0 ~]$ openstack --os-cloud overcloud server list 
   +--------------------------------------+---------------------+---------+--------------------------------------------------+-----------------------------+-----------------------+ 
   | ID                                   | Name                | Status  | Networks                                         | Image                       | Flavor                | 
   +--------------------------------------+---------------------+---------+--------------------------------------------------+-----------------------------+-----------------------+ 
   | a5610a6e-27bb-4b57-86ff-5c21225d346e | instance_7c2a0e6555 | SHUTOFF | internal_net_7c2a0e6555=192.168.0.50, 10.0.0.205 | upgrade_workload_7c2a0e6555 | v1-512M-5G-7c2a0e6555 | 
   +--------------------------------------+---------------------+---------+--------------------------------------------------+-----------------------------+-----------------------+ 

Starting the vm back solve the ping.

   openstack --os-cloud overcloud server start instance_7c2a0e6555
   
   [stack@undercloud-0 ~]$ ping 10.0.0.205 
   PING 10.0.0.205 (10.0.0.205) 56(84) bytes of data. 
   64 bytes from 10.0.0.205: icmp_seq=1 ttl=63 time=2.59 ms
   ^C 
   --- 10.0.0.205 ping statistics --- 
   1 packets transmitted, 1 received, 0% packet loss, time 0ms
   rtt min/avg/max/mdev = 2.589/2.589/2.589/0.000 ms

Bottom line, ovn update is working, but we have a OSP compute node
rebooted.

Removing the tsx=on parameter during update should solve you issue, or
else deploying with tsx=on.

Note that tsx flag shouldn't be needed anymore, but this
https://bugzilla.redhat.com/show_bug.cgi?id=2002346 hasn't been
backported to 16.1 so you should deploy with tsx flag set already.

Comment 13 Eran Kuris 2022-03-07 13:17:27 UTC

(In reply to Sofer Athlan-Guyot from comment #12)
> Hi,
> 
> so the difference between your job and update job is that we don't
> modify the kernel argument list during update and you do, by adding
> the tsx=on parameter.
> 
> Then you hit a bug in the tripleo-ansible as described in
> https://bugzilla.redhat.com/show_bug.cgi?id=2061319. Basically, the
> compute node reboot during the udpate.
> 
> This shouldn't happen as modifying kernel parameter during update
> shouldn't reboot nodes.
> 
> The fact that it doesn't recover is because isn't set to auto reboot
> after the hypervizor's reboot:
> 
>    [stack@undercloud-0 ~]$ openstack --os-cloud overcloud server list 
>   
> +--------------------------------------+---------------------+---------+-----
> ---------------------------------------------+-----------------------------+-
> ----------------------+ 
>    | ID                                   | Name                | Status  |
> Networks                                         | Image                    
> | Flavor                | 
>   
> +--------------------------------------+---------------------+---------+-----
> ---------------------------------------------+-----------------------------+-
> ----------------------+ 
>    | a5610a6e-27bb-4b57-86ff-5c21225d346e | instance_7c2a0e6555 | SHUTOFF |
> internal_net_7c2a0e6555=192.168.0.50, 10.0.0.205 |
> upgrade_workload_7c2a0e6555 | v1-512M-5G-7c2a0e6555 | 
>   
> +--------------------------------------+---------------------+---------+-----
> ---------------------------------------------+-----------------------------+-
> ----------------------+ 
> 
> Starting the vm back solve the ping.
> 
>    openstack --os-cloud overcloud server start instance_7c2a0e6555
>    
>    [stack@undercloud-0 ~]$ ping 10.0.0.205 
>    PING 10.0.0.205 (10.0.0.205) 56(84) bytes of data. 
>    64 bytes from 10.0.0.205: icmp_seq=1 ttl=63 time=2.59 ms
>    ^C 
>    --- 10.0.0.205 ping statistics --- 
>    1 packets transmitted, 1 received, 0% packet loss, time 0ms
>    rtt min/avg/max/mdev = 2.589/2.589/2.589/0.000 ms
> 
> Bottom line, ovn update is working, but we have a OSP compute node
> rebooted.
> 
> Removing the tsx=on parameter during update should solve you issue, or
> else deploying with tsx=on.
> 
> Note that tsx flag shouldn't be needed anymore, but this
> https://bugzilla.redhat.com/show_bug.cgi?id=2002346 hasn't been
> backported to 16.1 so you should deploy with tsx flag set already.

Following our IRC chat, we agreed that tsx should be set to on and the reason for reproducing the problem is because of missing backport: 
https://bugzilla.redhat.com/show_bug.cgi?id=2061319

Comment 14 Eran Kuris 2022-03-09 11:46:37 UTC

So following my testing the update process is passing when we are not setting any Kernel arg. 
If we are going to set Kernel ARG so the issue will reproduce and that's because of another NOVA issue: https://bugzilla.redhat.com/show_bug.cgi?id=2061319

Comment 23 errata-xmlrpc 2022-03-24 11:03:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.8 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0986