Bug 1786468
Summary: | [updates] packet loss after l3 connectivity check fails the overcloud update | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Ronnie Rasouli <rrasouli> |
Component: | rhosp-openvswitch | Assignee: | Sofer Athlan-Guyot <sathlang> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Ronnie Rasouli <rrasouli> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 13.0 (Queens) | CC: | dalvarez, jlibosva, jpretori, mburns, njohnston, rsafrono |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-04-01 11:20:53 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1753533 |
Description
Ronnie Rasouli
2019-12-25 14:09:28 UTC
OVN OSP13 minor update job also fails due to the same reason, e.g. https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/77/consoleFull Note: minor update was done in this job from the latest z-release(z9 or 2019-11-04.1) to the latest good puddle (passed_phase1 or 2019-12-13.1) Hi, so last test run I've seen that timeline: - ping loss start at [1582667231.961964] From 10.0.0.13 icmp_seq=4261 Destination Host Unreachable which is 2020-02-25T21:47:11+00:00 - ping loss end at 2020-02-25T22:30:37+00:00 which is a 43 min cut ... We basically ping a FIP from the undercloud (ping_results_202002251536.log) For update we have that sequence of event (bootstrap nodeid is controller-0 - from overcloud_update_run_Controller.log) - start update controller-2: 2020-02-25 15:36:43 | included: /var/lib/mistral/00f26568-f469-46ef-a698-e8df2200fc47/update_steps_tasks.yaml for controller-2 - start update paunch reconfig controller-2: 2020-02-25 15:55:51 | included: /var/lib/mistral/00f26568-f469-46ef-a698-e8df2200fc47/common_deploy_steps_tasks.yaml for controller-2 - start update controller-1: 2020-02-25 16:13:28 | included: /var/lib/mistral/00f26568-f469-46ef-a698-e8df2200fc47/update_steps_tasks.yaml for controller-1 - start update paunch reconfig controller-1: 2020-02-25 16:32:19 | included: /var/lib/mistral/00f26568-f469-46ef-a698-e8df2200fc47/common_deploy_steps_tasks.yaml for controller-1 - start update controller-0: 2020-02-25 16:49:40 | included: /var/lib/mistral/00f26568-f469-46ef-a698-e8df2200fc47/update_steps_tasks.yaml for controller-0 - start update paunch reconfig controller-0: 2020-02-25 17:12:55 | included: /var/lib/mistral/00f26568-f469-46ef-a698-e8df2200fc47/common_deploy_steps_tasks.yaml for controller-0 Given we need +5h, the cut started around the end of controller-1 update paunch reconfig, beginning of update of controller-0 This seems to correspond to that (but maybe unrelated, need neutron expertise here - in /var/log/openvswitch/ovs-vswitchd.log) On ctl-2: 2020-02-25T21:10:17.017Z|00182|bridge|INFO|bridge br-ex: deleted interface patch-provnet-52bb1150-b45a-44cc-8bc3-17c34bb6c196-to-br-int on port 3 2020-02-25T22:30:37.943Z|00203|bridge|INFO|bridge br-ex: added interface patch-provnet-52bb1150-b45a-44cc-8bc3-17c34bb6c196-to-br-int on port 4 On ctl-1: 2020-02-25T21:46:31.154Z|00182|bridge|INFO|bridge br-ex: deleted interface patch-provnet-52bb1150-b45a-44cc-8bc3-17c34bb6c196-to-br-int on port 3 2020-02-25T22:30:33.209Z|00203|bridge|INFO|bridge br-ex: added interface patch-provnet-52bb1150-b45a-44cc-8bc3-17c34bb6c196-to-br-int on port 4 On ctl-0: 2020-02-25T22:30:34.180Z|00170|bridge|INFO|bridge br-ex: deleted interface patch-provnet-52bb1150-b45a-44cc-8bc3-17c34bb6c196-to-br-int on port 2 2020-02-25T22:39:54.247Z|00195|bridge|INFO|bridge br-ex: added interface patch-provnet-52bb1150-b45a-44cc-8bc3-17c34bb6c196-to-br-int on port 3 So the behavior on ctl-2 is "strange", it is stopped during 1h20, after its paunch reconfig and before paunch reconfig of ctl-1 and around the same time than ctl-1 br-ex re-attach (just before its paunch reconfig, so at the end of the update tasks) The intersection in br-ex cut on ctl-2 with the br-ex cut in ctl-1 seems to match the ping cut. Need help to debug further. |