Description of problem: Testing update job from GA to latest IPv4 with OVN fails on L3 stop step fails the overcloud deployment due to high packet loss rate. STDOUT: 7476 packets transmitted, 5887 received, +1559 errors, 21% packet loss, time 7481190ms rtt min/avg/max/mdev = 0.635/1.126/20.496/0.738 ms, pipe 4 Ping loss higher than 1% detected Version-Release number of selected component (if applicable): core_puddle: 2018-06-21.2 puppet-ovn-12.4.0-3.el7ost.noarch openstack-neutron-openvswitch-12.1.0-2.el7ost.noarch openstack-neutron-common-12.1.0-2.el7ost.noarch How reproducible: Most likely Steps to Reproduce: 1. Deploy RHOS13 GA 2. update the undercloud 3.update the overcloud 4. run ping test after l3 stop Actual results: There are 21% of packet loss Expected results: Less than 1% of packet loss Additional info:
OVN OSP13 minor update job also fails due to the same reason, e.g. https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/77/consoleFull Note: minor update was done in this job from the latest z-release(z9 or 2019-11-04.1) to the latest good puddle (passed_phase1 or 2019-12-13.1)
Hi, so last test run I've seen that timeline: - ping loss start at [1582667231.961964] From 10.0.0.13 icmp_seq=4261 Destination Host Unreachable which is 2020-02-25T21:47:11+00:00 - ping loss end at 2020-02-25T22:30:37+00:00 which is a 43 min cut ... We basically ping a FIP from the undercloud (ping_results_202002251536.log) For update we have that sequence of event (bootstrap nodeid is controller-0 - from overcloud_update_run_Controller.log) - start update controller-2: 2020-02-25 15:36:43 | included: /var/lib/mistral/00f26568-f469-46ef-a698-e8df2200fc47/update_steps_tasks.yaml for controller-2 - start update paunch reconfig controller-2: 2020-02-25 15:55:51 | included: /var/lib/mistral/00f26568-f469-46ef-a698-e8df2200fc47/common_deploy_steps_tasks.yaml for controller-2 - start update controller-1: 2020-02-25 16:13:28 | included: /var/lib/mistral/00f26568-f469-46ef-a698-e8df2200fc47/update_steps_tasks.yaml for controller-1 - start update paunch reconfig controller-1: 2020-02-25 16:32:19 | included: /var/lib/mistral/00f26568-f469-46ef-a698-e8df2200fc47/common_deploy_steps_tasks.yaml for controller-1 - start update controller-0: 2020-02-25 16:49:40 | included: /var/lib/mistral/00f26568-f469-46ef-a698-e8df2200fc47/update_steps_tasks.yaml for controller-0 - start update paunch reconfig controller-0: 2020-02-25 17:12:55 | included: /var/lib/mistral/00f26568-f469-46ef-a698-e8df2200fc47/common_deploy_steps_tasks.yaml for controller-0 Given we need +5h, the cut started around the end of controller-1 update paunch reconfig, beginning of update of controller-0 This seems to correspond to that (but maybe unrelated, need neutron expertise here - in /var/log/openvswitch/ovs-vswitchd.log) On ctl-2: 2020-02-25T21:10:17.017Z|00182|bridge|INFO|bridge br-ex: deleted interface patch-provnet-52bb1150-b45a-44cc-8bc3-17c34bb6c196-to-br-int on port 3 2020-02-25T22:30:37.943Z|00203|bridge|INFO|bridge br-ex: added interface patch-provnet-52bb1150-b45a-44cc-8bc3-17c34bb6c196-to-br-int on port 4 On ctl-1: 2020-02-25T21:46:31.154Z|00182|bridge|INFO|bridge br-ex: deleted interface patch-provnet-52bb1150-b45a-44cc-8bc3-17c34bb6c196-to-br-int on port 3 2020-02-25T22:30:33.209Z|00203|bridge|INFO|bridge br-ex: added interface patch-provnet-52bb1150-b45a-44cc-8bc3-17c34bb6c196-to-br-int on port 4 On ctl-0: 2020-02-25T22:30:34.180Z|00170|bridge|INFO|bridge br-ex: deleted interface patch-provnet-52bb1150-b45a-44cc-8bc3-17c34bb6c196-to-br-int on port 2 2020-02-25T22:39:54.247Z|00195|bridge|INFO|bridge br-ex: added interface patch-provnet-52bb1150-b45a-44cc-8bc3-17c34bb6c196-to-br-int on port 3 So the behavior on ctl-2 is "strange", it is stopped during 1h20, after its paunch reconfig and before paunch reconfig of ctl-1 and around the same time than ctl-1 br-ex re-attach (just before its paunch reconfig, so at the end of the update tasks) The intersection in br-ex cut on ctl-2 with the br-ex cut in ctl-1 seems to match the ping cut. Need help to debug further.