Description of problem: ----------------------- During upgrading RHOS-10 to RHOS-11, workload on oc became inaccessible after major-upgrade-composable-steps step. The issue seems to be the same as described in bz1499201 Version-Release number of selected component (if applicable): ------------------------------------------------------------- openstack-tripleo-heat-templates-6.2.12-2.el7ost.noarch puppet-tripleo-6.5.10-3.el7ost.noarch Steps to Reproduce: ------------------- 1. Upgrade RHOS-9 to RHOS-10 2. Launch VMs on oc 3. Start upgrde to RHOS-11
As I checked on testing environment, all nova and neutron services from control plane were "duplicated". Services on nodes like controller-{0,1,2} were down and services on controller-{0,1,2}.localdomain were up. Because of that, e.g. routers (HA) were scheduled to L3 agents which were down so there router wasn't configured on those nodes at all. After manually moving router to "new" L3 agents FIP was again accessible.
This is because neutron::host parameter change during osp10/11 upgrade. It's due to a change in default during deployment. In osp9 it was undef, in osp10 we prevent it from changing, in osp11 it takes the new default. See bz#1499201 for more.
Hey Sofer, from our Sep 6th daily meeting we were speaking about this BZ. Can you update it with the information we have about this BZ to document it and move it to a docs fix?
Hi, the patches here are only POC, and should not be used, given that the maintenance window for osp11 is coming to a end and that this issue usually doesn't happen in production env (where host is usually the fqdn) ... and that we have a workaround, the urgency for this bug may be lowered. so this is the same symptom than for bz#1499201. The host configuration change in neutron.conf, make the agent change their "uuid" (the host parameter). So old one were: /etc/neutron/neutron.conf/DEFAULT/host = foo the new one are: /etc/neutron/neutron.conf/DEFAULT/host = foo.bar the floating ip are attached to the the l3 agent with uuid foo, making them unreachable. The fix for bz#1499201 cannot work it, but the workaround can. I repeat it here for clarity: ------8<-------- workaround start This is how you can bring everything back working: ssh undercloud . overcloudrc curl -o reschedule-l3-routers.sh https://bugzilla.redhat.com/attachment.cgi?id=1421308 bash -x ./reschedule-l3-routers.sh After a little while (between one and two minutes) everything should come back alive. One can check with ping test and checking the state of a particular router is done like this: ssh undercloud . overcloudrc neutron router-list # pick one and then: neutron l3-agent-list-hosting-router 903195f0-c361-46a4-8b71-9a9b9bde572c +--------------------------------------+--------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+--------------------------+----------------+-------+----------+ | 54ffd13f-ab05-4b5f-a884-a5016dcdd512 | controller-1.localdomain | True | :-) | standby | | 17311ec7-2db0-440d-922d-06bc633cc2a8 | controller-2.localdomain | True | :-) | standby | | 3174da98-564f-4449-a2c3-704d799f6558 | controller-0.localdomain | True | :-) | active | +--------------------------------------+--------------------------+----------------+-------+----------+ You may have all three in standby at first, not to worry, it will come back to active and during that time, the ping (and everything else) should work. When everything has settled, you can cleanup the dead the l3 agent: ssh undercloud . overcloudrc curl -o cleanup-non-alive-agents.sh https://bugzilla.redhat.com/attachment.cgi?id=1421315 bash -x ./cleanup-non-alive-agents.sh ------8<-------- workaround end One can check beforehand if he/she's going to suffer from that bug by checking the current host parameter in neutron (and should do it for nova as well) grep -v '^host=' /etc/neutron/neutron.conf if you have a fqdn as the host parameter then you should be fine. Else you will hit that issue. The best course of action then would be to sync with eng, but here's a outline of what should be done *before* ugprade. Change the host parameter and restart the neutron on all three controller, then apply the workaround above. You will have a small cut in connectivity, but the maintenance will be short. Then you can upgrade as usual.
Hi, as osp11 is EOL since May 18, it's hard to justify spending time to solve this one. I'm closing it, especially since there are workarounds. Please don't hesitate to re-open it if I missed something here.