Bug 1596571 - [UPGRADES][9->10->11] Workload not accessible after major-upgrade-composable-steps
Summary: [UPGRADES][9->10->11] Workload not accessible after major-upgrade-composable-...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo
Version: 11.0 (Ocata)
Hardware: All
OS: All
medium
medium
Target Milestone: zstream
: 11.0 (Ocata)
Assignee: Sofer Athlan-Guyot
QA Contact: Yurii Prokulevych
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-29 09:30 UTC by Yurii Prokulevych
Modified: 2018-09-12 22:18 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-12 22:18:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 555732 0 None ABANDONED Make host param for nova/neutron/ceilometer immutable during upgrade 2021-01-20 17:19:26 UTC
OpenStack gerrit 579183 0 None MERGED Making immutable config setting when using <_IMMUTABLE_>. 2021-01-20 17:19:27 UTC
Red Hat Bugzilla 1499201 0 urgent CLOSED OSP9 -> OSP10: workloads created before upgrade are not reachable anymore after rebooting controller nodes 2022-08-02 18:03:19 UTC

Internal Links: 1499201

Description Yurii Prokulevych 2018-06-29 09:30:07 UTC
Description of problem:
-----------------------
During upgrading RHOS-10 to RHOS-11, workload on oc became inaccessible after major-upgrade-composable-steps step.

The issue seems to be the same as described in bz1499201


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
openstack-tripleo-heat-templates-6.2.12-2.el7ost.noarch
puppet-tripleo-6.5.10-3.el7ost.noarch

Steps to Reproduce:
-------------------
1. Upgrade RHOS-9 to RHOS-10
2. Launch VMs on oc
3. Start upgrde to RHOS-11

Comment 1 Slawek Kaplonski 2018-06-29 09:44:33 UTC
As I checked on testing environment, all nova and neutron services from control plane were "duplicated". Services on nodes like controller-{0,1,2} were down and services on controller-{0,1,2}.localdomain were up.
Because of that, e.g. routers (HA) were scheduled to L3 agents which were down so there  router wasn't configured on those nodes at all.
After manually moving router to "new" L3 agents FIP was again accessible.

Comment 2 Sofer Athlan-Guyot 2018-06-29 13:28:09 UTC
This is because neutron::host parameter change during osp10/11 upgrade.  It's due to a change in default during deployment.  In osp9 it was undef, in osp10 we prevent it from changing, in osp11 it takes the new default.  See bz#1499201 for more.

Comment 3 Carlos Camacho 2018-09-07 11:18:40 UTC
Hey Sofer, from our Sep 6th daily meeting we were speaking about this BZ. Can you update it with the information we have about this BZ to document it and move it to a docs fix?

Comment 4 Sofer Athlan-Guyot 2018-09-07 12:43:53 UTC
Hi,

the patches here are only POC, and should not be used, given that the maintenance window for osp11 is coming to a end and that this issue usually doesn't happen in production env (where host is usually the fqdn) ... and that we have a workaround, the urgency for this bug may be lowered.

so this is the same symptom than for bz#1499201.  The host configuration change in neutron.conf, make the agent change their "uuid" (the host parameter).

So old one were:
  /etc/neutron/neutron.conf/DEFAULT/host = foo

the new one are:

  /etc/neutron/neutron.conf/DEFAULT/host = foo.bar

the floating ip are attached to the the l3 agent with uuid foo, making them unreachable.

The fix for bz#1499201 cannot work it, but the workaround can.

I repeat it here for clarity:

------8<-------- workaround start

This is how you can bring everything back working:

   ssh undercloud
   . overcloudrc
   curl -o reschedule-l3-routers.sh https://bugzilla.redhat.com/attachment.cgi?id=1421308  
   bash -x ./reschedule-l3-routers.sh

After a little while (between one and two minutes) everything should come back alive.

One can check with ping test and checking the state of a particular router is done like this:

  ssh undercloud
  . overcloudrc

  neutron router-list
  # pick one and then:
  neutron l3-agent-list-hosting-router 903195f0-c361-46a4-8b71-9a9b9bde572c
+--------------------------------------+--------------------------+----------------+-------+----------+
| id                                   | host                     | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 54ffd13f-ab05-4b5f-a884-a5016dcdd512 | controller-1.localdomain | True           | :-)   | standby  |
| 17311ec7-2db0-440d-922d-06bc633cc2a8 | controller-2.localdomain | True           | :-)   | standby  |
| 3174da98-564f-4449-a2c3-704d799f6558 | controller-0.localdomain | True           | :-)   | active   |
+--------------------------------------+--------------------------+----------------+-------+----------+

You may have all three in standby at first, not to worry, it will come back to active and during that time, the ping (and everything else) should work.


When everything has settled, you can cleanup the dead the l3 agent:

  ssh undercloud
  . overcloudrc

  curl -o cleanup-non-alive-agents.sh https://bugzilla.redhat.com/attachment.cgi?id=1421315
  bash -x ./cleanup-non-alive-agents.sh


------8<-------- workaround end

One can check beforehand if he/she's going to suffer from that bug by checking the current host parameter in neutron (and should do it for nova as well)


 grep -v '^host=' /etc/neutron/neutron.conf

if you have a fqdn as the host parameter then you should be fine.  Else you will hit that issue.

The best course of action then would be to sync with eng, but here's a outline of what should be done *before* ugprade.

Change the host parameter and restart the neutron on all three controller, then apply the workaround above.  You will have a small cut in connectivity, but the maintenance will be short.

Then you can upgrade as usual.

Comment 5 Sofer Athlan-Guyot 2018-09-12 22:18:35 UTC
Hi,

as osp11 is EOL since May 18, it's hard to justify spending time to solve this one.  I'm closing it, especially since there are workarounds.

Please don't hesitate to re-open it if I missed something here.


Note You need to log in before you can comment on or make changes to this bug.