Bug 1786468 - [updates] packet loss after l3 connectivity check fails the overcloud update
Summary: [updates] packet loss after l3 connectivity check fails the overcloud update
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-openvswitch
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Sofer Athlan-Guyot
QA Contact: Ronnie Rasouli
URL:
Whiteboard:
Depends On:
Blocks: 1753533
TreeView+ depends on / blocked
 
Reported: 2019-12-25 14:09 UTC by Ronnie Rasouli
Modified: 2020-04-01 11:20 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-01 11:20:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Ronnie Rasouli 2019-12-25 14:09:28 UTC
Description of problem:

Testing update job from GA to latest IPv4 with OVN fails on L3 stop step fails the overcloud deployment due to high packet loss rate.

STDOUT:

7476 packets transmitted, 5887 received, +1559 errors, 21% packet loss, time 7481190ms
rtt min/avg/max/mdev = 0.635/1.126/20.496/0.738 ms, pipe 4
Ping loss higher than 1% detected


Version-Release number of selected component (if applicable):
core_puddle: 2018-06-21.2

puppet-ovn-12.4.0-3.el7ost.noarch
openstack-neutron-openvswitch-12.1.0-2.el7ost.noarch
openstack-neutron-common-12.1.0-2.el7ost.noarch


How reproducible:
Most likely

Steps to Reproduce:
1. Deploy RHOS13 GA
2. update the undercloud
3.update the overcloud
4. run ping test after l3 stop

Actual results:

There are 21% of packet loss

Expected results:
Less than 1% of packet loss



Additional info:

Comment 2 Roman Safronov 2019-12-29 07:48:13 UTC
OVN OSP13 minor update job also fails due to the same reason, e.g.
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/77/consoleFull

Note: minor update was done in this job from the latest z-release(z9 or 2019-11-04.1) to the latest good puddle (passed_phase1 or 2019-12-13.1)

Comment 5 Sofer Athlan-Guyot 2020-02-26 16:49:30 UTC
Hi,

so last test run I've seen that timeline:

 - ping loss start at  [1582667231.961964] From 10.0.0.13 icmp_seq=4261 Destination Host Unreachable which is 2020-02-25T21:47:11+00:00 
 - ping loss end at 2020-02-25T22:30:37+00:00 

which is a 43 min cut ... We basically ping a FIP from the undercloud (ping_results_202002251536.log)

For update we have that sequence of event (bootstrap nodeid is controller-0 - from overcloud_update_run_Controller.log)

 - start update controller-2: 2020-02-25 15:36:43 | included: /var/lib/mistral/00f26568-f469-46ef-a698-e8df2200fc47/update_steps_tasks.yaml for controller-2
 - start update paunch reconfig controller-2: 2020-02-25 15:55:51 | included: /var/lib/mistral/00f26568-f469-46ef-a698-e8df2200fc47/common_deploy_steps_tasks.yaml for controller-2
 - start update controller-1: 2020-02-25 16:13:28 | included: /var/lib/mistral/00f26568-f469-46ef-a698-e8df2200fc47/update_steps_tasks.yaml for controller-1
 - start update paunch reconfig controller-1: 2020-02-25 16:32:19 | included: /var/lib/mistral/00f26568-f469-46ef-a698-e8df2200fc47/common_deploy_steps_tasks.yaml for controller-1
 - start update controller-0: 2020-02-25 16:49:40 | included: /var/lib/mistral/00f26568-f469-46ef-a698-e8df2200fc47/update_steps_tasks.yaml for controller-0
 - start update paunch reconfig controller-0: 2020-02-25 17:12:55 | included: /var/lib/mistral/00f26568-f469-46ef-a698-e8df2200fc47/common_deploy_steps_tasks.yaml for controller-0

Given we need +5h, the cut started around the end of controller-1 update paunch reconfig, beginning of update of controller-0

This seems to correspond to that (but maybe unrelated, need neutron expertise here - in /var/log/openvswitch/ovs-vswitchd.log)

On ctl-2:
2020-02-25T21:10:17.017Z|00182|bridge|INFO|bridge br-ex: deleted interface patch-provnet-52bb1150-b45a-44cc-8bc3-17c34bb6c196-to-br-int on port 3
2020-02-25T22:30:37.943Z|00203|bridge|INFO|bridge br-ex: added interface patch-provnet-52bb1150-b45a-44cc-8bc3-17c34bb6c196-to-br-int on port 4

On ctl-1:
2020-02-25T21:46:31.154Z|00182|bridge|INFO|bridge br-ex: deleted interface patch-provnet-52bb1150-b45a-44cc-8bc3-17c34bb6c196-to-br-int on port 3
2020-02-25T22:30:33.209Z|00203|bridge|INFO|bridge br-ex: added interface patch-provnet-52bb1150-b45a-44cc-8bc3-17c34bb6c196-to-br-int on port 4

On ctl-0:
2020-02-25T22:30:34.180Z|00170|bridge|INFO|bridge br-ex: deleted interface patch-provnet-52bb1150-b45a-44cc-8bc3-17c34bb6c196-to-br-int on port 2
2020-02-25T22:39:54.247Z|00195|bridge|INFO|bridge br-ex: added interface patch-provnet-52bb1150-b45a-44cc-8bc3-17c34bb6c196-to-br-int on port 3

So the behavior on ctl-2 is "strange", it is stopped during 1h20, after its paunch reconfig and before paunch reconfig of ctl-1 and around the same time than ctl-1 br-ex re-attach (just before its paunch reconfig, so at the end of the update tasks)

The intersection in br-ex cut on ctl-2 with the br-ex cut in ctl-1 seems to match the ping cut.

Need help to debug further.


Note You need to log in before you can comment on or make changes to this bug.