1596571 – [UPGRADES][9->10->11] Workload not accessible after major-upgrade-composable-steps

Bug 1596571 - [UPGRADES][9->10->11] Workload not accessible after major-upgrade-composable-steps

Summary: [UPGRADES][9->10->11] Workload not accessible after major-upgrade-composable-...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-tripleo
Sub Component:
Version:	11.0 (Ocata)
Hardware:	All
OS:	All
Priority:	medium
Severity:	medium
Target Milestone:	zstream
Target Release:	11.0 (Ocata)
Assignee:	Sofer Athlan-Guyot
QA Contact:	Yurii Prokulevych
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-29 09:30 UTC by Yurii Prokulevych
Modified:	2018-09-12 22:18 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-12 22:18:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	555732	None	ABANDONED	Make host param for nova/neutron/ceilometer immutable during upgrade	2021-01-20 17:19:26 UTC
OpenStack gerrit	579183	None	MERGED	Making immutable config setting when using <_IMMUTABLE_>.	2021-01-20 17:19:27 UTC
Red Hat Bugzilla	1499201	urgent	CLOSED	OSP9 -> OSP10: workloads created before upgrade are not reachable anymore after rebooting controller nodes	2022-08-02 18:03:19 UTC

Internal Links: 1499201

Description Yurii Prokulevych 2018-06-29 09:30:07 UTC

Description of problem:
-----------------------
During upgrading RHOS-10 to RHOS-11, workload on oc became inaccessible after major-upgrade-composable-steps step.

The issue seems to be the same as described in bz1499201


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
openstack-tripleo-heat-templates-6.2.12-2.el7ost.noarch
puppet-tripleo-6.5.10-3.el7ost.noarch

Steps to Reproduce:
-------------------
1. Upgrade RHOS-9 to RHOS-10
2. Launch VMs on oc
3. Start upgrde to RHOS-11

Comment 1 Slawek Kaplonski 2018-06-29 09:44:33 UTC

As I checked on testing environment, all nova and neutron services from control plane were "duplicated". Services on nodes like controller-{0,1,2} were down and services on controller-{0,1,2}.localdomain were up.
Because of that, e.g. routers (HA) were scheduled to L3 agents which were down so there  router wasn't configured on those nodes at all.
After manually moving router to "new" L3 agents FIP was again accessible.

Comment 2 Sofer Athlan-Guyot 2018-06-29 13:28:09 UTC

This is because neutron::host parameter change during osp10/11 upgrade.  It's due to a change in default during deployment.  In osp9 it was undef, in osp10 we prevent it from changing, in osp11 it takes the new default.  See bz#1499201 for more.

Comment 3 Carlos Camacho 2018-09-07 11:18:40 UTC

Hey Sofer, from our Sep 6th daily meeting we were speaking about this BZ. Can you update it with the information we have about this BZ to document it and move it to a docs fix?

Comment 4 Sofer Athlan-Guyot 2018-09-07 12:43:53 UTC

Hi,

the patches here are only POC, and should not be used, given that the maintenance window for osp11 is coming to a end and that this issue usually doesn't happen in production env (where host is usually the fqdn) ... and that we have a workaround, the urgency for this bug may be lowered.

so this is the same symptom than for bz#1499201. The host configuration change in neutron.conf, make the agent change their "uuid" (the host parameter).

So old one were:
/etc/neutron/neutron.conf/DEFAULT/host = foo

the new one are:

/etc/neutron/neutron.conf/DEFAULT/host = foo.bar

the floating ip are attached to the the l3 agent with uuid foo, making them unreachable.

The fix for bz#1499201 cannot work it, but the workaround can.

I repeat it here for clarity:

------8<-------- workaround start

This is how you can bring everything back working:

ssh undercloud
. overcloudrc
curl -o reschedule-l3-routers.sh https://bugzilla.redhat.com/attachment.cgi?id=1421308
bash -x ./reschedule-l3-routers.sh

After a little while (between one and two minutes) everything should come back alive.

One can check with ping test and checking the state of a particular router is done like this:

ssh undercloud
. overcloudrc

You may have all three in standby at first, not to worry, it will come back to active and during that time, the ping (and everything else) should work.

When everything has settled, you can cleanup the dead the l3 agent:

ssh undercloud
. overcloudrc

curl -o cleanup-non-alive-agents.sh https://bugzilla.redhat.com/attachment.cgi?id=1421315
bash -x ./cleanup-non-alive-agents.sh

------8<-------- workaround end

One can check beforehand if he/she's going to suffer from that bug by checking the current host parameter in neutron (and should do it for nova as well)

grep -v '^host=' /etc/neutron/neutron.conf

if you have a fqdn as the host parameter then you should be fine. Else you will hit that issue.

The best course of action then would be to sync with eng, but here's a outline of what should be done *before* ugprade.

Change the host parameter and restart the neutron on all three controller, then apply the workaround above. You will have a small cut in connectivity, but the maintenance will be short.

Then you can upgrade as usual.

Comment 5 Sofer Athlan-Guyot 2018-09-12 22:18:35 UTC

Hi,

as osp11 is EOL since May 18, it's hard to justify spending time to solve this one.  I'm closing it, especially since there are workarounds.

Please don't hesitate to re-open it if I missed something here.

Note You need to log in before you can comment on or make changes to this bug.