Bug 1764203

Summary: FFU OSP10->13(ovs 2.11) failed on FFU Controller stage. No ping to default gw. OVS is down.
Product: Red Hat OpenStack Reporter: Roman Safronov <rsafrono>
Component: openvswitchAssignee: Open vSwitch development team <ovs-team>
Status: CLOSED DUPLICATE QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: unspecified    
Version: 13.0 (Queens)CC: apevec, chrisw, jlibosva, rhos-maint
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-22 14:11:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Roman Safronov 2019-10-22 13:05:49 UTC
Description of problem:
FFU from latest OSP10 to OSP13z9_plus_2.11 failed. 
Used the following job:
https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-upgrades-ffu-ffu-upgrade-10-13_director-rhel-virthost-3cont_1comp-ipv4-vxlan-minimal/10/




TASK [Output for sync deployment ControllerAllNodesValidationDeployment] *******
Tuesday 22 October 2019  08:16:37 -0400 (0:00:00.271)       0:11:37.546 ******* 
fatal: [controller-2]: FAILED! => {
    "msg": [
        {
            "stderr": [
                "[2019-10-22 12:06:05,751] (heat-config) [DEBUG] Running /usr/libexec/heat-config/hooks/script < /var/lib/heat-config/deployed/a29de065-9697-4bc3-89ad-83c97a56f47d.json", 
                "[2019-10-22 12:16:35,902] (heat-config) [INFO] {\"deploy_stdout\": \"Trying to ping default gateway 10.0.0.1...Ping to 10.0.0.1 failed. Retrying...\\nPing to 10.0.0.1 failed
. Retrying...\\nPing to 10.0.0.1 failed. Retrying...\\nPing to 10.0.0.1 failed. Retrying...\\nPing to 10.0.0.1 failed. Retrying...\\nPing to 10.0.0.1 failed. Retrying...\\nPing to 10.0.0.1 f
ailed. Retrying...\\nPing to 10.0.0.1 failed. Retrying...\\nPing to 10.0.0.1 failed. Retrying...\\nPing to 10.0.0.1 failed. Retrying...\\nFAILURE\\n10.0.0.1 is not pingable.\\n\", \"deploy_s
tderr\": \"\", \"deploy_status_code\": 1}", 
                "[2019-10-22 12:16:35,902] (heat-config) [DEBUG] [2019-10-22 12:06:05,787] (heat-config) [INFO] ping_test_ips=172.17.3.20 172.17.4.16 172.17.1.14 172.17.2.16 10.0.0.113", 
                "[2019-10-22 12:06:05,787] (heat-config) [INFO] validate_fqdn=False", 
                "[2019-10-22 12:06:05,787] (heat-config) [INFO] validate_ntp=True", 
                "[2019-10-22 12:06:05,787] (heat-config) [INFO] validate_controllers_icmp=True", 
                "[2019-10-22 12:06:05,787] (heat-config) [INFO] validate_gateways_icmp=True", 
                "[2019-10-22 12:06:05,787] (heat-config) [INFO] deploy_server_id=f9641f04-7a18-43d3-b43f-7b355d974f91", 
                "[2019-10-22 12:06:05,787] (heat-config) [INFO] deploy_action=CREATE", 
                "[2019-10-22 12:06:05,787] (heat-config) [INFO] deploy_stack_id=overcloud-ControllerAllNodesValidationDeployment-4cqttzw5xuq4-2-pse6hldvrfbv/21321991-a26d-42e9-ad31-420dc8db2
f19", 
                "[2019-10-22 12:06:05,787] (heat-config) [INFO] deploy_resource_name=TripleOSoftwareDeployment", 
                "[2019-10-22 12:06:05,787] (heat-config) [INFO] deploy_signal_transport=NO_SIGNAL", 
                "[2019-10-22 12:06:05,788] (heat-config) [DEBUG] Running /var/lib/heat-config/heat-config-script/a29de065-9697-4bc3-89ad-83c97a56f47d", 
                "[2019-10-22 12:16:35,897] (heat-config) [INFO] Trying to ping default gateway 10.0.0.1...Ping to 10.0.0.1 failed. Retrying...", 
                "Ping to 10.0.0.1 failed. Retrying...", 
                "Ping to 10.0.0.1 failed. Retrying...", 
                "Ping to 10.0.0.1 failed. Retrying...", 
                "Ping to 10.0.0.1 failed. Retrying...", 
                "Ping to 10.0.0.1 failed. Retrying...", 
                "Ping to 10.0.0.1 failed. Retrying...", 
                "Ping to 10.0.0.1 failed. Retrying...", 
                "Ping to 10.0.0.1 failed. Retrying...", 
                "Ping to 10.0.0.1 failed. Retrying...", 
                "FAILURE", 
                "10.0.0.1 is not pingable.", 
                "", 
                "[2019-10-22 12:16:35,898] (heat-config) [DEBUG] ", 
                "[2019-10-22 12:16:35,898] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-script/a29de065-9697-4bc3-89ad-83c97a56f47d. [1]", 
                "", 
                "", 



NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
controller-0               : ok=69   changed=23   unreachable=0    failed=1   
controller-1               : ok=69   changed=23   unreachable=0    failed=1   
controller-2               : ok=70   changed=23   unreachable=0    failed=1  





Version-Release number of selected component (if applicable):
FFU from 10.0-RHEL-7/2019-10-10.1  to 13.0-RHEL-7/2019-10-18.1


How reproducible:
Made single attempt of FFU and the issue happened.

Steps to Reproduce:
1. Run the job: Run the following job:
https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-upgrades-ffu-ffu-upgrade-10-13_director-rhel-virthost-3cont_1comp-ipv4-vxlan-minimal/
(make sure you specify  FFU_UPGRADE_TO 13z9_plus_2.11, IR_PROVISION_HOST ,  I also had to specify NODE_NAME  qe-generic && tlv , also recommend to set JOB_TIMEOUT to 660 or even more)

Actual results:
No ping from Controllers to default gateway (10.0.0.1) on FFU controller stage and validation fails.
FFU job failed.


Expected results:
controllers are able to ping default gateway on validation stage during FFU Controller stage.
FFU job succeeds.

Additional info:


[heat-admin@controller-0 ~]$ sudo systemctl status openvswitch
● openvswitch.service - Open vSwitch
   Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; disabled; vendor preset: disabled)
   Active: inactive (dead) since Tue 2019-10-22 12:01:06 UTC; 48min ago
 Main PID: 2104 (code=exited, status=0/SUCCESS)

Oct 22 09:28:40 controller-0 systemd[1]: Starting Open vSwitch...
Oct 22 09:28:40 controller-0 systemd[1]: Started Open vSwitch.
Oct 22 12:01:06 controller-0 systemd[1]: Stopping Open vSwitch...
Oct 22 12:01:06 controller-0 systemd[1]: Stopped Open vSwitch.
[heat-admin@controller-0 ~]$ 
[heat-admin@controller-0 ~]$ 
[heat-admin@controller-0 ~]$ rpm -qa | grep openvswitch
openvswitch-selinux-extra-policy-1.0-9.el7fdp.noarch
python-rhosp-openvswitch-2.11-0.1.el7ost.noarch
rhosp-openvswitch-ovn-common-2.11-0.1.el7ost.noarch
openvswitch2.11-2.11.0-21.el7fdp.x86_64
openstack-neutron-openvswitch-12.1.0-2.el7ost.noarch
python-openvswitch2.11-2.11.0-21.el7fdp.x86_64
rhosp-openvswitch-ovn-host-2.11-0.1.el7ost.noarch
rhosp-openvswitch-2.11-0.1.el7ost.noarch
rhosp-openvswitch-ovn-central-2.11-0.1.el7ost.noarch
[heat-admin@controller-0 ~]$ 
[heat-admin@controller-0 ~]$ 
[heat-admin@controller-0 ~]$ ip r
default via 10.0.0.1 dev br-ex 
10.0.0.0/24 dev br-ex proto kernel scope link src 10.0.0.113 
169.254.169.254 via 192.168.24.1 dev eth0 
172.17.1.0/24 dev vlan20 proto kernel scope link src 172.17.1.14 
172.17.2.0/24 dev vlan50 proto kernel scope link src 172.17.2.16 
172.17.3.0/24 dev vlan30 proto kernel scope link src 172.17.3.20 
172.17.4.0/24 dev vlan40 proto kernel scope link src 172.17.4.16 
192.168.24.0/24 dev eth0 proto kernel scope link src 192.168.24.18 
[heat-admin@controller-0 ~]$ 
[heat-admin@controller-0 ~]$ 
[heat-admin@controller-0 ~]$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
From 10.0.0.113 icmp_seq=1 Destination Host Unreachable
From 10.0.0.113 icmp_seq=2 Destination Host Unreachable
From 10.0.0.113 icmp_seq=3 Destination Host Unreachable
From 10.0.0.113 icmp_seq=4 Destination Host Unreachable
From 10.0.0.113 icmp_seq=5 Destination Host Unreachable
From 10.0.0.113 icmp_seq=6 Destination Host Unreachable
From 10.0.0.113 icmp_seq=7 Destination Host Unreachable
From 10.0.0.113 icmp_seq=8 Destination Host Unreachable
^C
--- 10.0.0.1 ping statistics ---
8 packets transmitted, 0 received, +8 errors, 100% packet loss, time 7001ms

Comment 1 Roman Safronov 2019-10-22 13:13:06 UTC
I will attach logs soon. For now logs are available directly from titan16.lab.eng.tlv2.redhat.com

Comment 2 Jakub Libosvar 2019-10-22 14:11:40 UTC
There is a known issue when installing new OVS 2.11 that restarts OVS, which causes dataplane disruption. Should be fixed by bug 1763902.

*** This bug has been marked as a duplicate of bug 1763902 ***