Bug 1589684

Summary:	OSP13-HA minor update: router created before the update doesn't have external connectivity
Product:	Red Hat OpenStack	Reporter:	Eran Kuris <ekuris>
Component:	openstack-tripleo-heat-templates	Assignee:	Brent Eagles <beagles>
Status:	CLOSED WORKSFORME	QA Contact:	Toni Freger <tfreger>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	13.0 (Queens)	CC:	amuller, asimonel, bcafarel, beagles, bhaley, chjones, chrisw, ekuris, jschluet, jstransk, kgiusti, majopela, mburns, mcornea, nyechiel, sasha, sclewis, srevivo
Target Milestone:	ga	Keywords:	Triaged
Target Release:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	update / upgrade unsupported for the Release Candidate	Story Points:	---
Clone Of:
Clones:	1594367 (view as bug list)		Environment:
Last Closed:	2018-06-22 17:49:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1594367

Description Eran Kuris 2018-06-11 07:57:30 UTC

Description of problem:
Created Router with VM's before processing minor update of OSP13 sriov deployment and after minor update finished without any errors I saw that I don't have external connectivity to & from the router.


I noticed that I have 2 controllers in active status:
 neutron l3-agent-list-hosting-router  Router_eNet
neutron CLI is deprecated and will be removed in the future. Use OpenStack CLI instead.
+--------------------------------------+--------------------------+----------------+-------+----------+
| id                                   | host                     | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 68592610-7eb2-4869-b320-6bb0b5bd4680 | controller-1.localdomain | True           | :-)   | active   |
| 8f10ff7c-0deb-4311-a742-d133f1b31bf9 | controller-2.localdomain | True           | :-)   | standby  |
| 1054a7b8-f771-4041-b431-2e481365d560 | controller-0.localdomain | True           | :-)   | active   |
+--------------------------------------+--------------------------+----------------+-------+----------+
[root@controller-1 ~]# ip net exec exec qrouter-90413d3e-2dc1-4240-87c3-edfef4671020 ip a 
Controller-0 & 1 are the active nodes and contain the external IP address :

115: qg-38e39d71-54: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:f9:58:87 brd ff:ff:ff:ff:ff:ff
    inet 10.35.166.84/24 scope global qg-38e39d71-54
       valid_lft forever preferred_lft forever
    inet 10.35.166.87/32 scope global qg-38e39d71-54
       valid_lft forever preferred_lft forever
    inet 10.35.166.90/32 scope global qg-38e39d71-54
       valid_lft forever preferred_lft forever
    inet 10.35.166.92/32 scope global qg-38e39d71-54
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fef9:5887/64 scope link nodad 
       valid_lft forever preferred_lft forever

controller2 standby:
[root@controller-2 ~]# ip net exec qrouter-90413d3e-2dc1-4240-87c3-edfef4671020 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
113: ha-8c8c54db-50: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:81:65:3f brd ff:ff:ff:ff:ff:ff
    inet 169.254.192.5/18 brd 169.254.255.255 scope global ha-8c8c54db-50
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe81:653f/64 scope link 
       valid_lft forever preferred_lft forever
114: qr-afae9da2-61: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:60:5a:61 brd ff:ff:ff:ff:ff:ff
115: qg-38e39d71-54: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:f9:58:87 brd ff:ff:ff:ff:ff:ff



/var/log/messages

un 10 11:18:27 controller-0 dockerd-current: time="2018-06-10T11:18:27.003985012Z" level=error msg="Handler for POST /v1.26/containers/neutron-keepalived-qrouter-90413d3e-2dc1-4240-87c3-edfef4671020/stop returned error: No such container: neutron-keepalived-qrouter-90413d3e-2dc1-4240-87c3-edfef4671020"
Jun 10 11:18:27 controller-0 dockerd-current: time="2018-06-10T11:18:27.004385092Z" level=error msg="Handler for POST /v1.26/containers/neutron-keepalived-qrouter-90413d3e-2dc1-4240-87c3-edfef4671020/stop returned error: No such container: neutron-keepalived-qrouter-90413d3e-2dc1-4240-87c3-edfef4671020"
Jun 10 11:18:27 controller-0 dockerd-current: time="2018-06-10T11:18:27.090846376Z" level=error msg="Handler for DELETE /v1.26/containers/neutron-keepalived-qrouter-90413d3e-2dc1-4240-87c3-edfef4671020?force=1 returned error: No such container: neutron-keepalived-qrouter-90413d3e-2dc1-4240-87c3-edfef4671020"
Jun 10 11:18:27 controller-0 dockerd-current: time="2018-06-10T11:18:27.091241153Z" level=error msg="Handler for DELETE /v1.26/containers/neutron-keepalived-qrouter-90413d3e-2dc1-4240-87c3-edfef4671020 returned error: No such container: neutron-keepalived-qrouter-90413d3e-2dc1-4240-87c3-edfef4671020"
(overcloud) [root@controller-0 ~]# openstack router list
+--------------------------------------+-------------+--------+-------+-------------+------+----------------------------------+
| ID                                   | Name        | Status | State | Distributed | HA   | Project                          |
+--------------------------------------+-------------+--------+-------+-------------+------+----------------------------------+
| 90413d3e-2dc1-4240-87c3-edfef4671020 | Router_eNet | ACTIVE | UP    | False       | True | ad804165fc554a2299bf1c4c761f1374 |
+--------------------------------------+-------------+--------+-------+-------------+------+----------------------------------+
Version-Release number of selected component (if applicable):
OSP13 beta to puddle 2018-06-08.3




Steps to Reproduce:
1. Deploy OSP13 beta  
2. Running Tempest SRIOV tests 

3. Add PF, VF & NORMAL ports and instances
wget http://file.tlv.redhat.com/~ekuris/custom_ci_image/rhel-guest-image-7.4-191.x86_64.qcow2
sudo yum -y install libguestfs-tools
sudo yum -y install libvirt && sudo systemctl start libvirtd
virt-customize -a rhel-guest-image-7.4-191.x86_64.qcow2 --root-password password:123456
virt-edit -a rhel-guest-image-7.4-191.x86_64.qcow2 -e 's/^disable_root: 1/disable_root: 0/' /etc/cloud/cloud.cfg
virt-edit -a rhel-guest-image-7.4-191.x86_64.qcow2 -e 's/^ssh_pwauth:\s+0/ssh_pwauth: 1/' /etc/cloud/cloud.cfg
openstack image create --container-format bare --disk-format qcow2 --public --file rhel-guest-image-7.4-191.x86_64.qcow2 rhel74
openstack network create net-64-1
openstack subnet create --subnet-range 10.0.1.0/24  --network net-64-1 --dhcp subnet_4_1
openstack router create Router_eNet
openstack router add subnet Router_eNet subnet_4_1
openstack router set --external-gateway nova Router_eNet
openstack flavor create --public m1.medium --id 3 --ram 1024 --disk 10 --vcpus 1
openstack port create --network net-64-1 --vnic-type direct direct_sriov
openstack port create --network net-64-1 --vnic-type direct-physical PF_sriov
openstack port create --network net-64-1 normal
openstack server create --flavor 3 --image rhel74 --nic port-id=direct_sriov  VF
openstack server create --flavor 3 --image rhel74 --nic port-id=PF_sriov  PF
openstack server create --flavor 3 --image rhel74 --nic port-id=normal  Normal
openstack floating ip create nova
openstack floating ip create nova
openstack floating ip create nova
openstack server add floating ip VF <fip>
openstack server add floating ip PF <fip>
openstack server add floating ip Normal <fip>
openstack security group rule create --protocol icmp --ingress --prefix 0.0.0.0/0 <sec-id>
openstack security group rule create --protocol tcp --ingress --prefix 0.0.0.0/0 <sec-id>

4.sudo rhos-release 13 -p 2018-06-08.3

5. openstack undercloud upgrade | tee undercloud_upgrade.log

6.add it to /etc/sysconfig/docker

sudo vi /etc/sysconfig/docker
Added docker-registry.engineering.redhat.com 

7.sudo systemctl restart docker.service

7. source stackrc 

8. openstack overcloud container image prepare --namespace docker-registry.engineering.redhat.com/rhosp13 --tag  2018-06-08.3 --prefix openstack- -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services-docker/neutron-sriov.yaml -e /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/network-environment.yaml -r /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/roles_data.yaml --output-images-file /home/stack/update-container-images.yaml


9. sudo openstack overcloud container image upload --config-file ~/update-container-images.yaml --verbose



10.openstack overcloud container image prepare --namespace 192.168.24.1:8787/rhosp13 --tag  2018-06-08.3  --prefix openstack- -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services-docker/neutron-sriov.yaml -e /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/network-environment.yaml -r /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/roles_data.yaml--output-env-file ~/update-container-params.yaml



11. ansible -i /usr/bin/tripleo-ansible-inventory overcloud -m shell -b -a "yum localinstall -y http://rhos-release.virt.bos.redhat.com/repos/rhos-release/rhos-release-latest.noarch.rpm; rhos-release 13 -p 2018-06-08.3"



12.openstack overcloud update prepare --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services-docker/neutron-sriov.yaml -e /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/network-environment.yaml -r /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/roles_data.yaml --container-registry-file ~/update-container-params.yaml 



13. openstack overcloud update run --nodes Controller 

14. openstack overcloud update run --nodes computesriov-0
15. openstack overcloud update run --nodes computesriov-1


16. openstack overcloud update converge --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services-docker/neutron-sriov.yaml -e /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/network-environment.yaml -r /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/roles_data.yaml -e ~/update-container-params.yaml | tee update_converge.log


THT files: https://code.engineering.redhat.com/gerrit/gitweb?p=Neutron-QE.git;a=tree;f=BM_heat_template/ospd-13-vlan-multiple-nic-sriov-hybrid-ha;h=19d6eb0ceb99a4d79cb30d72f4aac093bbe0d041;hb=refs/heads/master



Additional info:
sos-report attached

Comment 2 Eran Kuris 2018-06-11 09:21:58 UTC

new object works well. 
I created new router with external access and it works well.  
The problem is with the router that created before the update process.

Comment 5 Brent Eagles 2018-06-15 13:10:00 UTC

I think we may be looking at two issues:

1. Sidecars were restarted at some point during the update! 
The output of the "docker ps" command provided in the sosreports indicates that the l3-agent's supporting containers for keepalived, haproxy, etc are running with the new container image (the runtimes also align) and with approximately the same runtime as the agent. This indicates that pre-update sidecars were killed for some reason. The killing of keepalived containers could result in qrouter namespaces containing duplicate data on multiple hosts as was reported. This is very serious and we need to find the root cause for the restart.

2. keepalived should have righted itself!
Even with the keepalived restart, I would expect that once keepalived was running on all of the controllers again it would have resolved the incorrect IP/routing configurations in the qrouter namespace. This sometimes indicates that the keepalived instances cannot communicate with each other. This could be happening because of some side-effect of having invalid network configuration in the router's namespace (e.g. duplicate IPs) or some other issue that is breaking the required networking, e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1590651. While we should identify why this isn't working properly, it is arguably less critical than the sidecar restart.

Unfortunately the data in /var/log/messages in the sosreports starts long after the problem occurred and should have contained key data.

Comment 6 Brent Eagles 2018-06-15 13:32:24 UTC

I may have found root cause. Apparently we stop all containers if we predict that it is necessary to restart the docker daemon. We do this if the docker package is updated or it's configuration has changed during the update. This used to be necessary to deal with a docker bug but should no longer be necessary. I recommend that we a.) either remove this from the update process, or if it is felt that is too risky, b.) have the 'docker stop'.

It would good if we can rerun this or any other minor update test job where a router is meant to persist over the update so we can confirm that this is actually what is happening.

Comment 7 Brent Eagles 2018-06-18 14:37:48 UTC

option b.) from comment 6 should have read to have "docker stop" only run on deployed services and skip the containers neutron started.

Comment 8 Brent Eagles 2018-06-18 21:17:19 UTC

After further discussion and investigation, while not stopping the neutron sidecar containers during minor updates may side-step this issue, it is unlikely the root cause.

On closer inspection, the timing around the apparent failover is suspect:

- 2018-06-10 05:27:48.949 router initially instantiated on controller 0. 
- 2018-06-10 10:23:44.453 l3 agent on controller 1 is started
- 2018-06-10 11:17:10.459 router transitions to master on controller 1
- 2018-06-10 11:18:00.049 l3 agent on controller 0 is started 
- 2018-06-10 11:18:11.165 router transitions to master on controller 0

The router does not transition to backup on controller 1 after it transitions to master on controller 0. It appears that some issue with the system is preventing keepalived processes from reconciling the system changes.

On controller 1, the openvswitch-agent.log is reporting:

"2018-06-10 11:18:22.575 233154 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-062966ee-311d-47f3-8408-f01dbe5d4f21 - - - - -] Port 'qg-38e39d71-54' has lost its vlan tag '45'!"

Which may be indicating that something has gone wrong with br-ext-int bridge on that node. This might be related to recently fixed os-net-config issue rhbz 1590651 or neutron issue rhbz 1576286.

Where I have had no luck in reproducing this bug - would it be possible for the reporter or someone else to recreate a system upgraded using a similar process to see if this recurs and we can log in to get a better idea of what is going on?

Comment 9 Eran Kuris 2018-06-19 05:37:13 UTC

(In reply to Brent Eagles from comment #8)
> After further discussion and investigation, while not stopping the neutron
> sidecar containers during minor updates may side-step this issue, it is
> unlikely the root cause.
> 
> On closer inspection, the timing around the apparent failover is suspect:
> 
> - 2018-06-10 05:27:48.949 router initially instantiated on controller 0. 
> - 2018-06-10 10:23:44.453 l3 agent on controller 1 is started
> - 2018-06-10 11:17:10.459 router transitions to master on controller 1
> - 2018-06-10 11:18:00.049 l3 agent on controller 0 is started 
> - 2018-06-10 11:18:11.165 router transitions to master on controller 0
> 
> The router does not transition to backup on controller 1 after it
> transitions to master on controller 0. It appears that some issue with the
> system is preventing keepalived processes from reconciling the system
> changes.
> 
> On controller 1, the openvswitch-agent.log is reporting:
> 
> "2018-06-10 11:18:22.575 233154 INFO
> neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
> [req-062966ee-311d-47f3-8408-f01dbe5d4f21 - - - - -] Port 'qg-38e39d71-54'
> has lost its vlan tag '45'!"
> 
> Which may be indicating that something has gone wrong with br-ext-int bridge
> on that node. This might be related to recently fixed os-net-config issue
> rhbz 1590651 or neutron issue rhbz 1576286.
> 
> Where I have had no luck in reproducing this bug - would it be possible for
> the reporter or someone else to recreate a system upgraded using a similar
> process to see if this recurs and we can log in to get a better idea of what
> is going on?

 I cant reproduce it at the moment because Nova & NFV DEV are working on my setup. I will try to get to it when I get my setup back.

Comment 10 Eran Kuris 2018-06-19 06:13:37 UTC

It looks like someone else saw this issue :
 https://bugzilla.redhat.com/show_bug.cgi?id=1592528

Comment 11 Brent Eagles 2018-06-21 17:43:24 UTC

It occurred to me that neutron sidecar containers were not in the beta puddle, so this would be an update from agents-without-sidecars to agents-with-sidecars. There are some implications:

 - the docker stop on update may or may not have run
 - processes like keepalived would have definitely been killed when paunch updated and restarted the agent containers with the new images
 - the keepalived process that would've been running in the l3 agent before it was updated would not have been cleanly shutdown unless the l3 agent passes sigterm to its child (which I do not think that it would).

Do we need to support minor updates from releases that have neutron agents-without-sidecars to agents-with-sidecars?

Comment 12 Assaf Muller 2018-06-21 18:20:55 UTC

(In reply to Brent Eagles from comment #11)

<.. snip ..>

> 
> Do we need to support minor updates from releases that have neutron
> agents-without-sidecars to agents-with-sidecars?

In short: No. The only release outside of RH that didn't have sidecar containers for neutron agents was the beta release, and we don't "support" (in the full sense of the word) an update from beta to any other build. This massive delta between beta and later builds also means that there's no point in testing updates from beta to current puddles, because there's nothing we can learn from it, and issues we find may not be relevant.

Comment 13 Eran Kuris 2018-06-21 20:00:21 UTC

Ok so if there's nothing we can learn from updating the system from beta to latest puddle I am stopping this test that I started to run.
Today I tried to reproduce the issue from RC to latest puddle and it did not reproduce. I don't know if it means that the issue was solved and I hope this bug is not hiding somewhere.

Comment 14 Assaf Muller 2018-06-22 17:49:34 UTC

A fine group of folks met today on the 'OSP 13 GA
Update/Upgrade/Reboot Network Issues' daily to discuss two OSP 13 GA
blockers. One of which was the one this thread is about:
https://bugzilla.redhat.com/show_bug.cgi?id=1589684 - "minor update:
router created before the update doesn't have external connectivity"

The conclusion on the call was that the bug must have been caused by a
network interruption during the update, which was resolved by the
os-net-config fix [1]. No one can reproduce this issue if updating
from more previous puddles (newer than beta), including Eran who
reported this bug originally. There were two remaining loose ends that
we finalized on the call:

1) What if a network interruption happens during the update for some
other reason, and we're back to a place where routers break again?
As you can see from Damien's earlier reply on this thread, Damien and
others simulated all sorts of failures and couldn't spot any issues
related to HA routers. This gives me sufficient confidence with
respect to any hidden underlying issues.
2) We know that Director minor-updates restarts the Docker service and
all containers in the case that Docker is updated or needs to be
reconfigured. This is done one controller at a time, and because we
run keepalived (that drives each HA router) in a container, this
operation incurs a needless HA router failover, and that causes a few
seconds of downtime to your VMs' floating IPs. This is a regression.

Putting (1) and (2) together, the decision was to close this RHBZ, and
open 1594367 to more narrowly capture the issue discussed in (2). We
also discussed whether this specific issue is a release blocker on its
own, and the decision was to explicitly target it to z-stream. Note
that when we fix it in z-stream the update will be driven by the
TripleO code shipped in that z-stream, and not by the code we shipped
in GA, which allows us to fix this issue in z-stream in the first
place.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1590651