Description of problem: Created Router with VM's before processing minor update of OSP13 sriov deployment and after minor update finished without any errors I saw that I don't have external connectivity to & from the router. I noticed that I have 2 controllers in active status: neutron l3-agent-list-hosting-router Router_eNet neutron CLI is deprecated and will be removed in the future. Use OpenStack CLI instead. +--------------------------------------+--------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+--------------------------+----------------+-------+----------+ | 68592610-7eb2-4869-b320-6bb0b5bd4680 | controller-1.localdomain | True | :-) | active | | 8f10ff7c-0deb-4311-a742-d133f1b31bf9 | controller-2.localdomain | True | :-) | standby | | 1054a7b8-f771-4041-b431-2e481365d560 | controller-0.localdomain | True | :-) | active | +--------------------------------------+--------------------------+----------------+-------+----------+ [root@controller-1 ~]# ip net exec exec qrouter-90413d3e-2dc1-4240-87c3-edfef4671020 ip a Controller-0 & 1 are the active nodes and contain the external IP address : 115: qg-38e39d71-54: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether fa:16:3e:f9:58:87 brd ff:ff:ff:ff:ff:ff inet 10.35.166.84/24 scope global qg-38e39d71-54 valid_lft forever preferred_lft forever inet 10.35.166.87/32 scope global qg-38e39d71-54 valid_lft forever preferred_lft forever inet 10.35.166.90/32 scope global qg-38e39d71-54 valid_lft forever preferred_lft forever inet 10.35.166.92/32 scope global qg-38e39d71-54 valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fef9:5887/64 scope link nodad valid_lft forever preferred_lft forever controller2 standby: [root@controller-2 ~]# ip net exec qrouter-90413d3e-2dc1-4240-87c3-edfef4671020 ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 113: ha-8c8c54db-50: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether fa:16:3e:81:65:3f brd ff:ff:ff:ff:ff:ff inet 169.254.192.5/18 brd 169.254.255.255 scope global ha-8c8c54db-50 valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fe81:653f/64 scope link valid_lft forever preferred_lft forever 114: qr-afae9da2-61: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether fa:16:3e:60:5a:61 brd ff:ff:ff:ff:ff:ff 115: qg-38e39d71-54: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether fa:16:3e:f9:58:87 brd ff:ff:ff:ff:ff:ff /var/log/messages un 10 11:18:27 controller-0 dockerd-current: time="2018-06-10T11:18:27.003985012Z" level=error msg="Handler for POST /v1.26/containers/neutron-keepalived-qrouter-90413d3e-2dc1-4240-87c3-edfef4671020/stop returned error: No such container: neutron-keepalived-qrouter-90413d3e-2dc1-4240-87c3-edfef4671020" Jun 10 11:18:27 controller-0 dockerd-current: time="2018-06-10T11:18:27.004385092Z" level=error msg="Handler for POST /v1.26/containers/neutron-keepalived-qrouter-90413d3e-2dc1-4240-87c3-edfef4671020/stop returned error: No such container: neutron-keepalived-qrouter-90413d3e-2dc1-4240-87c3-edfef4671020" Jun 10 11:18:27 controller-0 dockerd-current: time="2018-06-10T11:18:27.090846376Z" level=error msg="Handler for DELETE /v1.26/containers/neutron-keepalived-qrouter-90413d3e-2dc1-4240-87c3-edfef4671020?force=1 returned error: No such container: neutron-keepalived-qrouter-90413d3e-2dc1-4240-87c3-edfef4671020" Jun 10 11:18:27 controller-0 dockerd-current: time="2018-06-10T11:18:27.091241153Z" level=error msg="Handler for DELETE /v1.26/containers/neutron-keepalived-qrouter-90413d3e-2dc1-4240-87c3-edfef4671020 returned error: No such container: neutron-keepalived-qrouter-90413d3e-2dc1-4240-87c3-edfef4671020" (overcloud) [root@controller-0 ~]# openstack router list +--------------------------------------+-------------+--------+-------+-------------+------+----------------------------------+ | ID | Name | Status | State | Distributed | HA | Project | +--------------------------------------+-------------+--------+-------+-------------+------+----------------------------------+ | 90413d3e-2dc1-4240-87c3-edfef4671020 | Router_eNet | ACTIVE | UP | False | True | ad804165fc554a2299bf1c4c761f1374 | +--------------------------------------+-------------+--------+-------+-------------+------+----------------------------------+ Version-Release number of selected component (if applicable): OSP13 beta to puddle 2018-06-08.3 Steps to Reproduce: 1. Deploy OSP13 beta 2. Running Tempest SRIOV tests 3. Add PF, VF & NORMAL ports and instances wget http://file.tlv.redhat.com/~ekuris/custom_ci_image/rhel-guest-image-7.4-191.x86_64.qcow2 sudo yum -y install libguestfs-tools sudo yum -y install libvirt && sudo systemctl start libvirtd virt-customize -a rhel-guest-image-7.4-191.x86_64.qcow2 --root-password password:123456 virt-edit -a rhel-guest-image-7.4-191.x86_64.qcow2 -e 's/^disable_root: 1/disable_root: 0/' /etc/cloud/cloud.cfg virt-edit -a rhel-guest-image-7.4-191.x86_64.qcow2 -e 's/^ssh_pwauth:\s+0/ssh_pwauth: 1/' /etc/cloud/cloud.cfg openstack image create --container-format bare --disk-format qcow2 --public --file rhel-guest-image-7.4-191.x86_64.qcow2 rhel74 openstack network create net-64-1 openstack subnet create --subnet-range 10.0.1.0/24 --network net-64-1 --dhcp subnet_4_1 openstack router create Router_eNet openstack router add subnet Router_eNet subnet_4_1 openstack router set --external-gateway nova Router_eNet openstack flavor create --public m1.medium --id 3 --ram 1024 --disk 10 --vcpus 1 openstack port create --network net-64-1 --vnic-type direct direct_sriov openstack port create --network net-64-1 --vnic-type direct-physical PF_sriov openstack port create --network net-64-1 normal openstack server create --flavor 3 --image rhel74 --nic port-id=direct_sriov VF openstack server create --flavor 3 --image rhel74 --nic port-id=PF_sriov PF openstack server create --flavor 3 --image rhel74 --nic port-id=normal Normal openstack floating ip create nova openstack floating ip create nova openstack floating ip create nova openstack server add floating ip VF <fip> openstack server add floating ip PF <fip> openstack server add floating ip Normal <fip> openstack security group rule create --protocol icmp --ingress --prefix 0.0.0.0/0 <sec-id> openstack security group rule create --protocol tcp --ingress --prefix 0.0.0.0/0 <sec-id> 4.sudo rhos-release 13 -p 2018-06-08.3 5. openstack undercloud upgrade | tee undercloud_upgrade.log 6.add it to /etc/sysconfig/docker sudo vi /etc/sysconfig/docker Added docker-registry.engineering.redhat.com 7.sudo systemctl restart docker.service 7. source stackrc 8. openstack overcloud container image prepare --namespace docker-registry.engineering.redhat.com/rhosp13 --tag 2018-06-08.3 --prefix openstack- -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services-docker/neutron-sriov.yaml -e /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/network-environment.yaml -r /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/roles_data.yaml --output-images-file /home/stack/update-container-images.yaml 9. sudo openstack overcloud container image upload --config-file ~/update-container-images.yaml --verbose 10.openstack overcloud container image prepare --namespace 192.168.24.1:8787/rhosp13 --tag 2018-06-08.3 --prefix openstack- -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services-docker/neutron-sriov.yaml -e /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/network-environment.yaml -r /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/roles_data.yaml--output-env-file ~/update-container-params.yaml 11. ansible -i /usr/bin/tripleo-ansible-inventory overcloud -m shell -b -a "yum localinstall -y http://rhos-release.virt.bos.redhat.com/repos/rhos-release/rhos-release-latest.noarch.rpm; rhos-release 13 -p 2018-06-08.3" 12.openstack overcloud update prepare --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services-docker/neutron-sriov.yaml -e /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/network-environment.yaml -r /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/roles_data.yaml --container-registry-file ~/update-container-params.yaml 13. openstack overcloud update run --nodes Controller 14. openstack overcloud update run --nodes computesriov-0 15. openstack overcloud update run --nodes computesriov-1 16. openstack overcloud update converge --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services-docker/neutron-sriov.yaml -e /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/network-environment.yaml -r /home/stack/ospd-13-vlan-multiple-nic-sriov-hybrid-ha/roles_data.yaml -e ~/update-container-params.yaml | tee update_converge.log THT files: https://code.engineering.redhat.com/gerrit/gitweb?p=Neutron-QE.git;a=tree;f=BM_heat_template/ospd-13-vlan-multiple-nic-sriov-hybrid-ha;h=19d6eb0ceb99a4d79cb30d72f4aac093bbe0d041;hb=refs/heads/master Additional info: sos-report attached
new object works well. I created new router with external access and it works well. The problem is with the router that created before the update process.
I think we may be looking at two issues: 1. Sidecars were restarted at some point during the update! The output of the "docker ps" command provided in the sosreports indicates that the l3-agent's supporting containers for keepalived, haproxy, etc are running with the new container image (the runtimes also align) and with approximately the same runtime as the agent. This indicates that pre-update sidecars were killed for some reason. The killing of keepalived containers could result in qrouter namespaces containing duplicate data on multiple hosts as was reported. This is very serious and we need to find the root cause for the restart. 2. keepalived should have righted itself! Even with the keepalived restart, I would expect that once keepalived was running on all of the controllers again it would have resolved the incorrect IP/routing configurations in the qrouter namespace. This sometimes indicates that the keepalived instances cannot communicate with each other. This could be happening because of some side-effect of having invalid network configuration in the router's namespace (e.g. duplicate IPs) or some other issue that is breaking the required networking, e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1590651. While we should identify why this isn't working properly, it is arguably less critical than the sidecar restart. Unfortunately the data in /var/log/messages in the sosreports starts long after the problem occurred and should have contained key data.
I may have found root cause. Apparently we stop all containers if we predict that it is necessary to restart the docker daemon. We do this if the docker package is updated or it's configuration has changed during the update. This used to be necessary to deal with a docker bug but should no longer be necessary. I recommend that we a.) either remove this from the update process, or if it is felt that is too risky, b.) have the 'docker stop'. It would good if we can rerun this or any other minor update test job where a router is meant to persist over the update so we can confirm that this is actually what is happening.
option b.) from comment 6 should have read to have "docker stop" only run on deployed services and skip the containers neutron started.
After further discussion and investigation, while not stopping the neutron sidecar containers during minor updates may side-step this issue, it is unlikely the root cause. On closer inspection, the timing around the apparent failover is suspect: - 2018-06-10 05:27:48.949 router initially instantiated on controller 0. - 2018-06-10 10:23:44.453 l3 agent on controller 1 is started - 2018-06-10 11:17:10.459 router transitions to master on controller 1 - 2018-06-10 11:18:00.049 l3 agent on controller 0 is started - 2018-06-10 11:18:11.165 router transitions to master on controller 0 The router does not transition to backup on controller 1 after it transitions to master on controller 0. It appears that some issue with the system is preventing keepalived processes from reconciling the system changes. On controller 1, the openvswitch-agent.log is reporting: "2018-06-10 11:18:22.575 233154 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-062966ee-311d-47f3-8408-f01dbe5d4f21 - - - - -] Port 'qg-38e39d71-54' has lost its vlan tag '45'!" Which may be indicating that something has gone wrong with br-ext-int bridge on that node. This might be related to recently fixed os-net-config issue rhbz 1590651 or neutron issue rhbz 1576286. Where I have had no luck in reproducing this bug - would it be possible for the reporter or someone else to recreate a system upgraded using a similar process to see if this recurs and we can log in to get a better idea of what is going on?
(In reply to Brent Eagles from comment #8) > After further discussion and investigation, while not stopping the neutron > sidecar containers during minor updates may side-step this issue, it is > unlikely the root cause. > > On closer inspection, the timing around the apparent failover is suspect: > > - 2018-06-10 05:27:48.949 router initially instantiated on controller 0. > - 2018-06-10 10:23:44.453 l3 agent on controller 1 is started > - 2018-06-10 11:17:10.459 router transitions to master on controller 1 > - 2018-06-10 11:18:00.049 l3 agent on controller 0 is started > - 2018-06-10 11:18:11.165 router transitions to master on controller 0 > > The router does not transition to backup on controller 1 after it > transitions to master on controller 0. It appears that some issue with the > system is preventing keepalived processes from reconciling the system > changes. > > On controller 1, the openvswitch-agent.log is reporting: > > "2018-06-10 11:18:22.575 233154 INFO > neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent > [req-062966ee-311d-47f3-8408-f01dbe5d4f21 - - - - -] Port 'qg-38e39d71-54' > has lost its vlan tag '45'!" > > Which may be indicating that something has gone wrong with br-ext-int bridge > on that node. This might be related to recently fixed os-net-config issue > rhbz 1590651 or neutron issue rhbz 1576286. > > Where I have had no luck in reproducing this bug - would it be possible for > the reporter or someone else to recreate a system upgraded using a similar > process to see if this recurs and we can log in to get a better idea of what > is going on? I cant reproduce it at the moment because Nova & NFV DEV are working on my setup. I will try to get to it when I get my setup back.
It looks like someone else saw this issue : https://bugzilla.redhat.com/show_bug.cgi?id=1592528
It occurred to me that neutron sidecar containers were not in the beta puddle, so this would be an update from agents-without-sidecars to agents-with-sidecars. There are some implications: - the docker stop on update may or may not have run - processes like keepalived would have definitely been killed when paunch updated and restarted the agent containers with the new images - the keepalived process that would've been running in the l3 agent before it was updated would not have been cleanly shutdown unless the l3 agent passes sigterm to its child (which I do not think that it would). Do we need to support minor updates from releases that have neutron agents-without-sidecars to agents-with-sidecars?
(In reply to Brent Eagles from comment #11) <.. snip ..> > > Do we need to support minor updates from releases that have neutron > agents-without-sidecars to agents-with-sidecars? In short: No. The only release outside of RH that didn't have sidecar containers for neutron agents was the beta release, and we don't "support" (in the full sense of the word) an update from beta to any other build. This massive delta between beta and later builds also means that there's no point in testing updates from beta to current puddles, because there's nothing we can learn from it, and issues we find may not be relevant.
Ok so if there's nothing we can learn from updating the system from beta to latest puddle I am stopping this test that I started to run. Today I tried to reproduce the issue from RC to latest puddle and it did not reproduce. I don't know if it means that the issue was solved and I hope this bug is not hiding somewhere.
A fine group of folks met today on the 'OSP 13 GA Update/Upgrade/Reboot Network Issues' daily to discuss two OSP 13 GA blockers. One of which was the one this thread is about: https://bugzilla.redhat.com/show_bug.cgi?id=1589684 - "minor update: router created before the update doesn't have external connectivity" The conclusion on the call was that the bug must have been caused by a network interruption during the update, which was resolved by the os-net-config fix [1]. No one can reproduce this issue if updating from more previous puddles (newer than beta), including Eran who reported this bug originally. There were two remaining loose ends that we finalized on the call: 1) What if a network interruption happens during the update for some other reason, and we're back to a place where routers break again? As you can see from Damien's earlier reply on this thread, Damien and others simulated all sorts of failures and couldn't spot any issues related to HA routers. This gives me sufficient confidence with respect to any hidden underlying issues. 2) We know that Director minor-updates restarts the Docker service and all containers in the case that Docker is updated or needs to be reconfigured. This is done one controller at a time, and because we run keepalived (that drives each HA router) in a container, this operation incurs a needless HA router failover, and that causes a few seconds of downtime to your VMs' floating IPs. This is a regression. Putting (1) and (2) together, the decision was to close this RHBZ, and open 1594367 to more narrowly capture the issue discussed in (2). We also discussed whether this specific issue is a release blocker on its own, and the decision was to explicitly target it to z-stream. Note that when we fix it in z-stream the update will be driven by the TripleO code shipped in that z-stream, and not by the code we shipped in GA, which allows us to fix this issue in z-stream in the first place. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1590651