Bug 1499201 - OSP9 -> OSP10: workloads created before upgrade are not reachable anymore after rebooting controller nodes
Summary: OSP9 -> OSP10: workloads created before upgrade are not reachable anymore aft...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: async
: 10.0 (Newton)
Assignee: Sofer Athlan-Guyot
QA Contact: Yurii Prokulevych
URL:
Whiteboard:
Depends On:
Blocks: 1434621
TreeView+ depends on / blocked
 
Reported: 2017-10-06 11:21 UTC by Marius Cornea
Modified: 2022-08-02 17:52 UTC (History)
17 users (show)

Fixed In Version: puppet-tripleo-5.6.8-7.el7ost.noarch,openstack-tripleo-heat-templates-5.3.10-5.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-27 23:30:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Workaround to get l3ha routers rescheduled (330 bytes, application/x-shellscript)
2018-04-13 10:12 UTC, Miguel Angel Ajo
no flags Details
Script to cleanup dead agents (on the wrong host id) (186 bytes, application/x-shellscript)
2018-04-13 10:34 UTC, Miguel Angel Ajo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1771324 0 None None None 2018-05-15 11:04:17 UTC
OpenStack gerrit 560855 0 None MERGED [NEWTON ONLY] Adjust NeutronAllowL3AgentFailover to new default. 2021-02-12 11:22:01 UTC
OpenStack gerrit 562542 0 None MERGED [Newton only] Do not overwrite current {neutron,nova}::host value. 2021-02-12 11:22:01 UTC
OpenStack gerrit 568552 0 None MERGED [Newton Only] Fix getting live value of nova/neutron during update. 2021-02-12 11:22:01 UTC
Red Hat Bugzilla 1474092 0 medium CLOSED host=localhost in /etc/nova/nova.conf and /etc/neutron/neutron.conf on the compute nodes 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1596571 0 medium CLOSED [UPGRADES][9->10->11] Workload not accessible after major-upgrade-composable-steps 2021-02-22 00:41:40 UTC
Red Hat Issue Tracker OSP-4717 0 None None None 2022-08-02 17:52:39 UTC
Red Hat Issue Tracker UPG-3071 0 None None None 2021-09-09 12:42:44 UTC
Red Hat Product Errata RHBA-2018:2101 0 None None None 2018-06-27 23:32:36 UTC

Internal Links: 1474092 1596571

Description Marius Cornea 2017-10-06 11:21:05 UTC
Description of problem:
OSP9 -> OSP10: workloads created before upgrade are not reachable anymore after rebooting controller nodes:



Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-5.3.0-6.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy latest OSP9
2. Launch workloads
3. Upgrade to OSP10

Actual results:
Workloads are not reacheable anymore.

Expected results:
Workloads are reacheable.

Additional info:

It looks like in OSP10 the services got the domain name appended while it was not there in OSP9:

+--------------------------------------+------------------------+----------------+-------+----------+
| id                                   | host                   | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------+----------------+-------+----------+
| ec13520a-dcc2-4b34-bbfc-4a6c76466379 | overcloud-controller-2 | True           | xxx   | standby  |
| 87172b40-265c-4b24-a44f-ae7c5f2bb116 | overcloud-controller-0 | True           | xxx   | standby  |
| d6af24ef-2b49-4477-923d-b29bc7e13e86 | overcloud-controller-1 | True           | xxx   | active   |
+--------------------------------------+------------------------+----------------+-------+----------+
[stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router 19a4da15-3135-4099-861f-8a9b34815f56
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:303: SubjectAltNameWarning: C



neutron agent-list
+--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                               | availability_zone | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+
| 0d87b463-d27c-4b90-b43c-4420d367a0bb | Open vSwitch agent | overcloud-controller-2.localdomain |                   | :-)   | True           | neutron-openvswitch-agent |
| 1f21aad1-0688-41df-94fc-afffbc6ad639 | Metadata agent     | overcloud-controller-1             |                   | xxx   | True           | neutron-metadata-agent    |
| 22c57edf-8015-4172-acdd-5ca30fe9d2fd | L3 agent           | overcloud-controller-1.localdomain | nova              | :-)   | True           | neutron-l3-agent          |
| 2552feaf-1429-495b-a25f-1a492e5a6668 | Metadata agent     | overcloud-controller-2             |                   | xxx   | True           | neutron-metadata-agent    |
| 340db352-474c-4a31-a62a-e9a0f4406bd1 | DHCP agent         | overcloud-controller-0             | nova              | xxx   | True           | neutron-dhcp-agent        |
| 53b444b0-abbc-4825-b1ad-8622c77aa36e | L3 agent           | overcloud-controller-0.localdomain | nova              | :-)   | True           | neutron-l3-agent          |
| 54caa3f7-53ec-4f27-9252-a774b78c06c9 | Open vSwitch agent | overcloud-controller-1.localdomain |                   | :-)   | True           | neutron-openvswitch-agent |
| 6caf3e06-c3f5-4e50-99fa-c4f6ae4bdbb5 | DHCP agent         | overcloud-controller-2             | nova              | xxx   | True           | neutron-dhcp-agent        |
| 6e362eb0-678b-434f-b2bd-746107610114 | DHCP agent         | overcloud-controller-1.localdomain | nova              | :-)   | True           | neutron-dhcp-agent        |
| 82286c37-7c71-446a-a4b1-73647834944f | Metadata agent     | overcloud-controller-0             |                   | xxx   | True           | neutron-metadata-agent    |
| 83117afc-c8f7-4b5d-b9d5-859f960c677c | Metadata agent     | overcloud-controller-0.localdomain |                   | :-)   | True           | neutron-metadata-agent    |
| 844ef54d-69db-4728-b203-869136ef4368 | Open vSwitch agent | overcloud-controller-1             |                   | xxx   | True           | neutron-openvswitch-agent |
| 84c43451-4890-4448-a746-f4cab94cc767 | Open vSwitch agent | overcloud-controller-2             |                   | xxx   | True           | neutron-openvswitch-agent |
| 85330667-84b7-4bbf-93be-9dadd0736eea | Open vSwitch agent | overcloud-controller-0.localdomain |                   | :-)   | True           | neutron-openvswitch-agent |
| 87172b40-265c-4b24-a44f-ae7c5f2bb116 | L3 agent           | overcloud-controller-0             | nova              | xxx   | True           | neutron-l3-agent          |
| 88ea82ef-22e8-46dd-850b-5f34efd83bf5 | Metadata agent     | overcloud-controller-2.localdomain |                   | :-)   | True           | neutron-metadata-agent    |
| 8b30b03a-9c32-4c03-bf44-2ac1fd4492fe | DHCP agent         | overcloud-controller-0.localdomain | nova              | :-)   | True           | neutron-dhcp-agent        |
| c448cc63-29d8-4a41-a71d-97e499958aef | Metadata agent     | overcloud-controller-1.localdomain |                   | :-)   | True           | neutron-metadata-agent    |
| d17095af-7799-4024-abbf-b7c01efee452 | DHCP agent         | overcloud-controller-1             | nova              | xxx   | True           | neutron-dhcp-agent        |
| d1f664e1-6539-41a3-9686-1e828b9258af | Open vSwitch agent | overcloud-controller-0             |                   | xxx   | True           | neutron-openvswitch-agent |
| d5f26fbc-02ab-4866-945c-c798e80de94f | L3 agent           | overcloud-controller-2.localdomain | nova              | :-)   | True           | neutron-l3-agent          |
| d6af24ef-2b49-4477-923d-b29bc7e13e86 | L3 agent           | overcloud-controller-1             | nova              | xxx   | True           | neutron-l3-agent          |
| d71fbc5a-a3da-4eb2-bf76-f06c6130c895 | DHCP agent         | overcloud-controller-2.localdomain | nova              | :-)   | True           | neutron-dhcp-agent        |
| de1c1c12-3e2f-4ebf-9daa-f2b0b3eb3b38 | Open vSwitch agent | overcloud-compute-1.localdomain    |                   | :-)   | True           | neutron-openvswitch-agent |
| ec13520a-dcc2-4b34-bbfc-4a6c76466379 | L3 agent           | overcloud-controller-2             | nova              | xxx   | True           | neutron-l3-agent          |
| f24abfbf-3c42-45cf-9d39-d2eb11feb6e9 | Open vSwitch agent | overcloud-compute-0.localdomain    |                   | :-)   | True           | neutron-openvswitch-agent |
+--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+
[stack@undercloud-0 ~]$ 

[stack@undercloud-0 ~]$ openstack compute service list
+-----+------------------+------------------------------------+----------+---------+-------+----------------------------+
|  ID | Binary           | Host                               | Zone     | Status  | State | Updated At                 |
+-----+------------------+------------------------------------+----------+---------+-------+----------------------------+
|   2 | nova-scheduler   | overcloud-controller-0             | internal | enabled | down  | 2017-10-05T23:49:45.000000 |
|   5 | nova-scheduler   | overcloud-controller-2             | internal | enabled | down  | 2017-10-05T23:48:23.000000 |
|   8 | nova-scheduler   | overcloud-controller-1             | internal | enabled | down  | 2017-10-05T23:48:21.000000 |
|  68 | nova-consoleauth | overcloud-controller-2             | internal | enabled | down  | 2017-10-05T23:48:08.000000 |
|  71 | nova-consoleauth | overcloud-controller-1             | internal | enabled | down  | 2017-10-05T23:48:14.000000 |
|  74 | nova-consoleauth | overcloud-controller-0             | internal | enabled | down  | 2017-10-05T23:48:30.000000 |
|  77 | nova-conductor   | overcloud-controller-1             | internal | enabled | down  | 2017-10-05T23:48:30.000000 |
|  86 | nova-conductor   | overcloud-controller-2             | internal | enabled | down  | 2017-10-05T23:48:31.000000 |
|  98 | nova-conductor   | overcloud-controller-0             | internal | enabled | down  | 2017-10-05T23:49:55.000000 |
| 101 | nova-compute     | overcloud-compute-1.localdomain    | nova     | enabled | up    | 2017-10-06T11:00:10.000000 |
| 104 | nova-compute     | overcloud-compute-0.localdomain    | nova     | enabled | up    | 2017-10-06T11:00:06.000000 |
| 105 | nova-consoleauth | overcloud-controller-2.localdomain | internal | enabled | up    | 2017-10-06T11:00:11.000000 |
| 108 | nova-consoleauth | overcloud-controller-1.localdomain | internal | enabled | up    | 2017-10-06T11:00:05.000000 |
| 111 | nova-scheduler   | overcloud-controller-1.localdomain | internal | enabled | up    | 2017-10-06T11:00:10.000000 |
| 114 | nova-scheduler   | overcloud-controller-2.localdomain | internal | enabled | up    | 2017-10-06T11:00:05.000000 |
| 117 | nova-conductor   | overcloud-controller-1.localdomain | internal | enabled | up    | 2017-10-06T11:00:09.000000 |
| 123 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up    | 2017-10-06T11:00:06.000000 |
| 126 | nova-conductor   | overcloud-controller-2.localdomain | internal | enabled | up    | 2017-10-06T11:00:10.000000 |
| 129 | nova-scheduler   | overcloud-controller-0.localdomain | internal | enabled | up    | 2017-10-06T11:00:04.000000 |
| 132 | nova-conductor   | overcloud-controller-0.localdomain | internal | enabled | up    | 2017-10-06T11:00:06.000000 |
+-----+------------------+------------------------------------+----------+---------+-------+----------------------------+

Comment 1 Sofer Athlan-Guyot 2017-10-06 11:30:55 UTC
So the host parameter was added in neutron.conf:

    +host=overcloud-controller-0.localdomain

and in nova.conf:

    +host=overcloud-controller-0.localdomain

I guess that before the default was taken and it was the hostname, not the fqdn.

We basically change all the service definition.  The first side effect found is that the router created before this change and after reboot of the controller are unreachable.  Their l3 agent are down:


[stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router a281c931-d8f2-4a5f-9991-7594bf408cf9
+--------------------------------------+------------------------+----------------+-------+----------+
| id                                   | host                   | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------+----------------+-------+----------+
| ec13520a-dcc2-4b34-bbfc-4a6c76466379 | overcloud-controller-2 | True           | xxx   | standby  |
| 87172b40-265c-4b24-a44f-ae7c5f2bb116 | overcloud-controller-0 | True           | xxx   | standby  |
| d6af24ef-2b49-4477-923d-b29bc7e13e86 | overcloud-controller-1 | True           | xxx   | active   |
+--------------------------------------+------------------------+----------------+-------+----------+

as they are associated with the old host parameters.

====> This basically makes the floating ip not reachable anymore which is bad.

Note that it seems you need to reboot the controller for the problem to occur, but waiting on Marius if you can confirm this point.

Comment 2 Sofer Athlan-Guyot 2017-10-06 17:07:42 UTC
So to summarize before osp10 the nova and neutron defaulted to
socket.gethostname.  From osp10 on, we explicitely set this value
to fqdn[1].

It appears that osp9/rhel7.4 are configured in such a way that
socket.gethostname returns (correctly) controller-X.

But, we found one build where osp9 returned the fqdn
controller-X.localdomain.  That build was based on osp9/rhel7.3.  As
it is when we log to such an env, the hostname command (wrongly)
returns the fqdn.

The clouddomain variable on the undercloud is empty is all
environement (working and non working) so it doesn't seem relevant.

TL;DR

An easy way to check this problem is to run those command:

[stack@undercloud-0 ~]$ neutron agent-list
+--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                               | availability_zone | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+
| 0d87b463-d27c-4b90-b43c-4420d367a0bb | Open vSwitch agent | overcloud-controller-2.localdomain |                   | :-)   | True           | neutron-openvswitch-agent |
| 1f21aad1-0688-41df-94fc-afffbc6ad639 | Metadata agent     | overcloud-controller-1             |                   | xxx   | True           | neutron-metadata-agent    |
| 22c57edf-8015-4172-acdd-5ca30fe9d2fd | L3 agent           | overcloud-controller-1.localdomain | nova              | :-)   | True           | neutron-l3-agent          |
| 2552feaf-1429-495b-a25f-1a492e5a6668 | Metadata agent     | overcloud-controller-2             |                   | xxx   | True           | neutron-metadata-agent    |
| 340db352-474c-4a31-a62a-e9a0f4406bd1 | DHCP agent         | overcloud-controller-0             | nova              | xxx   | True           | neutron-dhcp-agent        |
| 53b444b0-abbc-4825-b1ad-8622c77aa36e | L3 agent           | overcloud-controller-0.localdomain | nova              | :-)   | True           | neutron-l3-agent          |
| 54caa3f7-53ec-4f27-9252-a774b78c06c9 | Open vSwitch agent | overcloud-controller-1.localdomain |                   | :-)   | True           | neutron-openvswitch-agent |
| 6caf3e06-c3f5-4e50-99fa-c4f6ae4bdbb5 | DHCP agent         | overcloud-controller-2             | nova              | xxx   | True           | neutron-dhcp-agent        |
| 6e362eb0-678b-434f-b2bd-746107610114 | DHCP agent         | overcloud-controller-1.localdomain | nova              | :-)   | True           | neutron-dhcp-agent        |
| 82286c37-7c71-446a-a4b1-73647834944f | Metadata agent     | overcloud-controller-0             |                   | xxx   | True           | neutron-metadata-agent    |
| 83117afc-c8f7-4b5d-b9d5-859f960c677c | Metadata agent     | overcloud-controller-0.localdomain |                   | :-)   | True           | neutron-metadata-agent    |
| 844ef54d-69db-4728-b203-869136ef4368 | Open vSwitch agent | overcloud-controller-1             |                   | xxx   | True           | neutron-openvswitch-agent |
| 84c43451-4890-4448-a746-f4cab94cc767 | Open vSwitch agent | overcloud-controller-2             |                   | xxx   | True           | neutron-openvswitch-agent |
| 85330667-84b7-4bbf-93be-9dadd0736eea | Open vSwitch agent | overcloud-controller-0.localdomain |                   | :-)   | True           | neutron-openvswitch-agent |
| 87172b40-265c-4b24-a44f-ae7c5f2bb116 | L3 agent           | overcloud-controller-0             | nova              | xxx   | True           | neutron-l3-agent          |
| 88ea82ef-22e8-46dd-850b-5f34efd83bf5 | Metadata agent     | overcloud-controller-2.localdomain |                   | :-)   | True           | neutron-metadata-agent    |
| 8b30b03a-9c32-4c03-bf44-2ac1fd4492fe | DHCP agent         | overcloud-controller-0.localdomain | nova              | :-)   | True           | neutron-dhcp-agent        |
| c448cc63-29d8-4a41-a71d-97e499958aef | Metadata agent     | overcloud-controller-1.localdomain |                   | :-)   | True           | neutron-metadata-agent    |
| d17095af-7799-4024-abbf-b7c01efee452 | DHCP agent         | overcloud-controller-1             | nova              | xxx   | True           | neutron-dhcp-agent        |
| d1f664e1-6539-41a3-9686-1e828b9258af | Open vSwitch agent | overcloud-controller-0             |                   | xxx   | True           | neutron-openvswitch-agent |
| d5f26fbc-02ab-4866-945c-c798e80de94f | L3 agent           | overcloud-controller-2.localdomain | nova              | :-)   | True           | neutron-l3-agent          |
| d6af24ef-2b49-4477-923d-b29bc7e13e86 | L3 agent           | overcloud-controller-1             | nova              | xxx   | True           | neutron-l3-agent          |
| d71fbc5a-a3da-4eb2-bf76-f06c6130c895 | DHCP agent         | overcloud-controller-2.localdomain | nova              | :-)   | True           | neutron-dhcp-agent        |
| de1c1c12-3e2f-4ebf-9daa-f2b0b3eb3b38 | Open vSwitch agent | overcloud-compute-1.localdomain    |                   | :-)   | True           | neutron-openvswitch-agent |
| ec13520a-dcc2-4b34-bbfc-4a6c76466379 | L3 agent           | overcloud-controller-2             | nova              | xxx   | True           | neutron-l3-agent          |
| f24abfbf-3c42-45cf-9d39-d2eb11feb6e9 | Open vSwitch agent | overcloud-compute-0.localdomain    |                   | :-)   | True           | neutron-openvswitch-agent |
+--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+
[stack@undercloud-0 ~]$

[stack@undercloud-0 ~]$ openstack compute service list
+-----+------------------+------------------------------------+----------+---------+-------+----------------------------+
|  ID | Binary           | Host                               | Zone     | Status  | State | Updated At                 |
+-----+------------------+------------------------------------+----------+---------+-------+----------------------------+
|   2 | nova-scheduler   | overcloud-controller-0             | internal | enabled | down  | 2017-10-05T23:49:45.000000 |
|   5 | nova-scheduler   | overcloud-controller-2             | internal | enabled | down  | 2017-10-05T23:48:23.000000 |
|   8 | nova-scheduler   | overcloud-controller-1             | internal | enabled | down  | 2017-10-05T23:48:21.000000 |
|  68 | nova-consoleauth | overcloud-controller-2             | internal | enabled | down  | 2017-10-05T23:48:08.000000 |
|  71 | nova-consoleauth | overcloud-controller-1             | internal | enabled | down  | 2017-10-05T23:48:14.000000 |
|  74 | nova-consoleauth | overcloud-controller-0             | internal | enabled | down  | 2017-10-05T23:48:30.000000 |
|  77 | nova-conductor   | overcloud-controller-1             | internal | enabled | down  | 2017-10-05T23:48:30.000000 |
|  86 | nova-conductor   | overcloud-controller-2             | internal | enabled | down  | 2017-10-05T23:48:31.000000 |
|  98 | nova-conductor   | overcloud-controller-0             | internal | enabled | down  | 2017-10-05T23:49:55.000000 |
| 101 | nova-compute     | overcloud-compute-1.localdomain    | nova     | enabled | up    | 2017-10-06T11:00:10.000000 |
| 104 | nova-compute     | overcloud-compute-0.localdomain    | nova     | enabled | up    | 2017-10-06T11:00:06.000000 |
| 105 | nova-consoleauth | overcloud-controller-2.localdomain | internal | enabled | up    | 2017-10-06T11:00:11.000000 |
| 108 | nova-consoleauth | overcloud-controller-1.localdomain | internal | enabled | up    | 2017-10-06T11:00:05.000000 |
| 111 | nova-scheduler   | overcloud-controller-1.localdomain | internal | enabled | up    | 2017-10-06T11:00:10.000000 |
| 114 | nova-scheduler   | overcloud-controller-2.localdomain | internal | enabled | up    | 2017-10-06T11:00:05.000000 |
| 117 | nova-conductor   | overcloud-controller-1.localdomain | internal | enabled | up    | 2017-10-06T11:00:09.000000 |
| 123 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up    | 2017-10-06T11:00:06.000000 |
| 126 | nova-conductor   | overcloud-controller-2.localdomain | internal | enabled | up    | 2017-10-06T11:00:10.000000 |
| 129 | nova-scheduler   | overcloud-controller-0.localdomain | internal | enabled | up    | 2017-10-06T11:00:04.000000 |
| 132 | nova-conductor   | overcloud-controller-0.localdomain | internal | enabled | up    | 2017-10-06T11:00:06.000000 |
+-----+------------------+------------------------------------+----------+---------+-------+----------------------------+

[1]
https://github.com/openstack/puppet-neutron/commit/c93d5a342d50d820f3922a97b3224be2e9747472
and
https://github.com//openstack/tripleo-heat-templates/commit/056ce2374851e4e96dd3fd822de9da76b35e1eb7

Comment 3 Marius Cornea 2017-10-06 18:11:17 UTC
Minor update OSP9 2017-06-01.1 build - overcloud nodes on RHEL 7.3 to latest OSP9, RHEL 7.4.

After minor update completed:

[stack@undercloud-0 ~]$ nova service-list
ne/usr/lib/python2.7/site-packages/keyring/backends/Gnome.py:6: PyGIWarning: GnomeKeyring was imported without specifying a version first. Use gi.require_version('GnomeKeyring', '1.0') before import to ensure that the right version gets loaded.
  from gi.repository import GnomeKeyring
utro+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
| Id | Binary           | Host                     | Zone     | Status  | State | Updated_at                 | Disabled Reason |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
| 1  | nova-scheduler   | controller-0.localdomain | internal | enabled | up    | 2017-10-06T17:40:18.000000 | -               |
| 4  | nova-consoleauth | controller-0.localdomain | internal | enabled | up    | 2017-10-06T17:40:15.000000 | -               |
| 5  | nova-conductor   | controller-0.localdomain | internal | enabled | up    | 2017-10-06T17:40:21.000000 | -               |
| 6  | nova-compute     | compute-0.localdomain    | nova     | enabled | up    | 2017-10-06T17:40:15.000000 | -               |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
[stack@undercloud-0 ~]$ neutron agent-list
+--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                     | availability_zone | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+
| 190fde25-fcdb-4b22-aaa7-e1cb55444914 | Metadata agent     | controller-0.localdomain |                   | :-)   | True           | neutron-metadata-agent    |
| 50ac64d2-a759-47d8-a525-2d766cbeae04 | Open vSwitch agent | compute-0.localdomain    |                   | :-)   | True           | neutron-openvswitch-agent |
| e7903a51-c80d-40b7-b9c1-6c7d66b46619 | L3 agent           | controller-0.localdomain | nova              | :-)   | True           | neutron-l3-agent          |
| fad33d95-13d6-4bec-bfe4-97d054bedf89 | DHCP agent         | controller-0.localdomain | nova              | :-)   | True           | neutron-dhcp-agent        |
| ffcbef93-5f8f-4822-84de-05488df6bb0b | Open vSwitch agent | controller-0.localdomain |                   | :-)   | True           | neutron-openvswitch-agent |
+--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+


After reboot:

[stack@undercloud-0 ~]$ nova service-list
/usr/lib/python2.7/site-packages/keyring/backends/Gnome.py:6: PyGIWarning: GnomeKeyring was imported without specifying a version first. Use gi.require_version('GnomeKeyring', '1.0') before import to ensure that the right version gets loaded.
  from gi.repository import GnomeKeyring
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
| Id | Binary           | Host                     | Zone     | Status  | State | Updated_at                 | Disabled Reason |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
| 1  | nova-scheduler   | controller-0.localdomain | internal | enabled | up    | 2017-10-06T17:46:37.000000 | -               |
| 4  | nova-consoleauth | controller-0.localdomain | internal | enabled | up    | 2017-10-06T17:46:38.000000 | -               |
| 5  | nova-conductor   | controller-0.localdomain | internal | enabled | up    | 2017-10-06T17:46:30.000000 | -               |
| 6  | nova-compute     | compute-0.localdomain    | nova     | enabled | up    | 2017-10-06T17:46:38.000000 | -               |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
[stack@undercloud-0 ~]$ neutron agent-list
+--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                     | availability_zone | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+
| 190fde25-fcdb-4b22-aaa7-e1cb55444914 | Metadata agent     | controller-0.localdomain |                   | :-)   | True           | neutron-metadata-agent    |
| 50ac64d2-a759-47d8-a525-2d766cbeae04 | Open vSwitch agent | compute-0.localdomain    |                   | :-)   | True           | neutron-openvswitch-agent |
| e7903a51-c80d-40b7-b9c1-6c7d66b46619 | L3 agent           | controller-0.localdomain | nova              | :-)   | True           | neutron-l3-agent          |
| fad33d95-13d6-4bec-bfe4-97d054bedf89 | DHCP agent         | controller-0.localdomain | nova              | :-)   | True           | neutron-dhcp-agent        |
| ffcbef93-5f8f-4822-84de-05488df6bb0b | Open vSwitch agent | controller-0.localdomain |                   | :-)   | True           | neutron-openvswitch-agent |
+--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+



But if we rerun the overcloud deploy command:

[stack@undercloud-0 ~]$ nova service-list
/usr/lib/python2.7/site-packages/keyring/backends/Gnome.py:6: PyGIWarning: GnomeKeyring was imported without specifying a version first. Use gi.require_version('GnomeKeyring', '1.0') before import to ensure that the right version gets loaded.
  from gi.repository import GnomeKeyring
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
| Id | Binary           | Host                     | Zone     | Status  | State | Updated_at                 | Disabled Reason |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
| 1  | nova-scheduler   | controller-0.localdomain | internal | enabled | up    | 2017-10-06T18:04:51.000000 | -               |
| 4  | nova-consoleauth | controller-0.localdomain | internal | enabled | up    | 2017-10-06T18:04:53.000000 | -               |
| 5  | nova-conductor   | controller-0.localdomain | internal | enabled | up    | 2017-10-06T18:04:56.000000 | -               |
| 6  | nova-compute     | compute-0.localdomain    | nova     | enabled | up    | 2017-10-06T18:04:58.000000 | -               |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
[stack@undercloud-0 ~]$ neutron service-list
Unknown command [u'service-list']
[stack@undercloud-0 ~]$ neutron agent-list
+--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                     | availability_zone | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+
| 190fde25-fcdb-4b22-aaa7-e1cb55444914 | Metadata agent     | controller-0.localdomain |                   | :-)   | True           | neutron-metadata-agent    |
| 50ac64d2-a759-47d8-a525-2d766cbeae04 | Open vSwitch agent | compute-0.localdomain    |                   | :-)   | True           | neutron-openvswitch-agent |
| e7903a51-c80d-40b7-b9c1-6c7d66b46619 | L3 agent           | controller-0.localdomain | nova              | :-)   | True           | neutron-l3-agent          |
| fad33d95-13d6-4bec-bfe4-97d054bedf89 | DHCP agent         | controller-0.localdomain | nova              | :-)   | True           | neutron-dhcp-agent        |
| ffcbef93-5f8f-4822-84de-05488df6bb0b | Open vSwitch agent | controller-0.localdomain |                   | :-)   | True           | neutron-openvswitch-agent |
+--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+

[stack@undercloud-0 ~]$ nova service-list
/usr/lib/python2.7/site-packages/keyring/backends/Gnome.py:6: PyGIWarning: GnomeKeyring was imported without specifying a version first. Use gi.require_version('GnomeKeyring', '1.0') before import to ensure that the right version gets loaded.
  from gi.repository import GnomeKeyring
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
| Id | Binary           | Host                     | Zone     | Status  | State | Updated_at                 | Disabled Reason |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
| 1  | nova-scheduler   | controller-0.localdomain | internal | enabled | up    | 2017-10-06T18:09:19.000000 | -               |
| 4  | nova-consoleauth | controller-0.localdomain | internal | enabled | up    | 2017-10-06T18:09:17.000000 | -               |
| 5  | nova-conductor   | controller-0.localdomain | internal | enabled | up    | 2017-10-06T18:09:12.000000 | -               |
| 6  | nova-compute     | compute-0.localdomain    | nova     | enabled | up    | 2017-10-06T18:09:20.000000 | -               |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
[stack@undercloud-0 ~]$ neutron agent-list
+--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                     | availability_zone | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+
| 190fde25-fcdb-4b22-aaa7-e1cb55444914 | Metadata agent     | controller-0.localdomain |                   | :-)   | True           | neutron-metadata-agent    |
| 50ac64d2-a759-47d8-a525-2d766cbeae04 | Open vSwitch agent | compute-0.localdomain    |                   | :-)   | True           | neutron-openvswitch-agent |
| e7903a51-c80d-40b7-b9c1-6c7d66b46619 | L3 agent           | controller-0.localdomain | nova              | :-)   | True           | neutron-l3-agent          |
| fad33d95-13d6-4bec-bfe4-97d054bedf89 | DHCP agent         | controller-0.localdomain | nova              | :-)   | True           | neutron-dhcp-agent        |
| ffcbef93-5f8f-4822-84de-05488df6bb0b | Open vSwitch agent | controller-0.localdomain |                   | :-)   | True           | neutron-openvswitch-agent |
+--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+

[root@controller-0 heat-admin]# python
Python 2.7.5 (default, May  3 2017, 07:55:04) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-14)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> socket.gethostname
<built-in function gethostname>
>>> socket.gethostname()
'controller-0.localdomain'

Comment 4 Sofer Athlan-Guyot 2018-04-11 19:15:34 UTC
Hi,

changing

   allow_automatic_l3agent_failover=False 

to 

   allow_automatic_l3agent_failover=True

should prevent the issue as the "new" l3 agent would take over the previous workload.

We already have 

   allow_automatic_dhcp_failover = true

(it's commented, but it's the default value)

Testing that workaround, it should only need a restart neutron-server on each controller.

Comment 6 Sofer Athlan-Guyot 2018-04-13 08:51:22 UTC
Hi,

so the test was unsuccessful. Next step are:
 1. getting more help from networking;
 2. Trying this https://review.openstack.org/#/q/I8f075a5ad869ef0dc72a700dcb7be0b6efca787a which strive to never change the host id.

TL;DR

Even with

   allow_automatic_l3agent_failover=True

configured in neutron.conf in the three controllers, the router stay
on the failed l3 agents:

  neutron l3-agent-list-hosting-router 903195f0-c361-46a4-8b71-9a9b9bde572c
  +--------------------------------------+--------------+----------------+-------+----------+
  | id                                   | host         | admin_state_up | alive | ha_state |
  +--------------------------------------+--------------+----------------+-------+----------+
  | e5267f02-5b5f-44ec-ab0d-ae0c2fa42b6f | controller-0 | True           | xxx   | standby  |
  | c17f7b3a-22c4-4b5c-ba66-ad5ab85bd1ee | controller-1 | True           | xxx   | active   |
  | d839c597-f68e-4c18-b6e6-6ef7f44e643f | controller-2 | True           | xxx   | standby  |
  +--------------------------------------+--------------+----------------+-------+----------+

There are not migrated to the live agent:
  
  neutron agent-list | grep L3
  | 17311ec7-2db0-440d-922d-06bc633cc2a8 | L3 agent           | controller-2.localdomain | nova              | :-)   | True           | neutron-l3-agent          |
  | 3174da98-564f-4449-a2c3-704d799f6558 | L3 agent           | controller-0.localdomain | nova              | :-)   | True           | neutron-l3-agent          |
  | 54ffd13f-ab05-4b5f-a884-a5016dcdd512 | L3 agent           | controller-1.localdomain | nova              | :-)   | True           | neutron-l3-agent          |
  | c17f7b3a-22c4-4b5c-ba66-ad5ab85bd1ee | L3 agent           | controller-1             | nova              | xxx   | True           | neutron-l3-agent          |
  | d839c597-f68e-4c18-b6e6-6ef7f44e643f | L3 agent           | controller-2             | nova              | xxx   | True           | neutron-l3-agent          |
  | e5267f02-5b5f-44ec-ab0d-ae0c2fa42b6f | L3 agent           | controller-0             | nova              | xxx   | True           | neutron-l3-agent          |


The puppet log during the converge show that puppet did its job:

  Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Server/Neutron_config[DEFAULT/allow_automatic_l3agent_failover]/value: value changed ['False'] to ['True'] 
  Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Server/Neutron_config[DEFAULT/api_workers]/value: value changed ['0'] to ['4']
  Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Server/Oslo::Middleware[neutron_config]/Neutron_config[oslo_middleware/enable_proxy_headers_parsing]/ensure: created
  Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Server/Neutron_config[DEFAULT/router_distributed]/ensure: created
  Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Db/Oslo::Db[neutron_config]/Neutron_config[database/db_max_retries]/ensure: created
  Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Db/Oslo::Db[neutron_config]/Neutron_config[database/connection]/value: value changed '[old secret redacted]' to '[new secret redacted]'
  Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Server/Neutron_api_config[filter:authtoken/admin_tenant_name]/ensure: removed
  Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Deps/Anchor[neutron::config::end]: Triggered 'refresh' from 22 events
  Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Deps/Anchor[neutron::service::begin]: Triggered 'refresh' from 1 events
  Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Server/Service[neutron-server]: Triggered 'refresh' from 1 events
  Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Agents::Ml2::Ovs/Service[neutron-ovs-agent-service]: Triggered 'refresh' from 1 events
  Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Agents::Dhcp/Service[neutron-dhcp-service]: Triggered 'refresh' from 1 events
  Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Agents::L3/Service[neutron-l3]: Triggered 'refresh' from 1 events
  Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Agents::Metadata/Service[neutron-metadata]: Triggered 'refresh' from 1 events
  Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Deps/Anchor[neutron::service::end]: Triggered 'refresh' from 5 events

but in all three servers we can see:

  2018-04-12 21:54:30.669 82738 ERROR neutron.db.agentschedulers_db [req-67cd6a0d-ca1c-42c7-9656-dd30bd335979 - - - - -] Exception encountered during router rescheduling.                                           
  2018-04-12 21:54:30.669 82738 ERROR neutron.db.agentschedulers_db Traceback (most recent call last):     
  2018-04-12 21:54:30.669 82738 ERROR neutron.db.agentschedulers_db File "/usr/lib/python2.7/site-packages/neutron/db/agentschedulers_db.py", line 215, in reschedule_resources_from_down_agents 

which correspond to;

        context = ncontext.get_admin_context()
        try:
            down_bindings = get_down_bindings(context, agent_dead_limit)

            agents_back_online = set()
            for binding in down_bindings:
                binding_agent_id = getattr(binding, agent_id_attr)
                binding_resource_id = getattr(binding, resource_id_attr)
                if binding_agent_id in agents_back_online:

it fails because:
2018-04-12 21:54:30.669 82738 ERROR neutron.db.agentschedulers_db DBConnectionError: (pymysql.err.OperationalError) 2003, "Can't connect to MySQL server on '172.17.1.11' ([Errno 111] ECONNREFUSED)"

It must be related to pacemaker service being restarted at that time.

So it seems that upon restart after it tries to do the right thing,
fails and don't try again.

Restarting the all neutron-servers post upgrade doesn't reschedule the
router on the new l3 agents neither.  It keeps saying:


  Checking if agent starts up and giving it additional 0:00:00 agent_starting_up /usr/lib/python2.7/site-packages/neutron/db/agentschedulers_db.py:309

and do nothing even though it detects there is an issue:

    WARNING neutron.db.agents_db [req-d7916013-93df-44c0-9d58-2af3cdcd26f4 - - - - -] Agent healthcheck: found 12 dead agents out of 25:
                Type       Last heartbeat host
      Metadata agent  2018-04-12 21:46:33 controller-1
  Open vSwitch agent  2018-04-12 21:45:25 controller-2
  Open vSwitch agent  2018-04-12 21:46:54 controller-0
          DHCP agent  2018-04-12 21:46:42 controller-0
          DHCP agent  2018-04-12 21:46:29 controller-1
      Metadata agent  2018-04-12 21:45:19 controller-2
  Open vSwitch agent  2018-04-12 21:46:13 controller-1
            L3 agent  2018-04-12 21:46:38 controller-1
      Metadata agent  2018-04-12 21:47:06 controller-0
            L3 agent  2018-04-12 21:45:17 controller-2
          DHCP agent  2018-04-12 21:45:15 controller-2
            L3 agent  2018-04-12 21:46:39 controller-0

Comment 7 Sofer Athlan-Guyot 2018-04-13 08:54:30 UTC
Adding that bug as it seems related.

Comment 8 Miguel Angel Ajo 2018-04-13 10:12:49 UTC
Created attachment 1421308 [details]
Workaround to get l3ha routers rescheduled

Comment 9 Miguel Angel Ajo 2018-04-13 10:14:09 UTC
The automatic router failover mechanism only works for non-l3ha routers, using the attached script it's possible to force neutron to clean the l3ha schedulings, and schedule to new hosts.

Comment 10 Miguel Angel Ajo 2018-04-13 10:17:23 UTC
The host= parameter should not be changed.

If this happened by admin intervention, that should not be done.

If this happened because the upgrade mechanism did that, there was a bug related to this, and I beleive it was being addressed.

Comment 11 Miguel Angel Ajo 2018-04-13 10:34:53 UTC
Created attachment 1421315 [details]
Script to cleanup dead agents (on the wrong host id)

Comment 12 Sofer Athlan-Guyot 2018-04-13 11:39:11 UTC
(In reply to Miguel Angel Ajo from comment #10)
Thanks a lot ajo for the workaround here.

For this to work we need the review https://review.openstack.org/560855 applied before the converge step.  If that's not the case, then you have to manually set:

  allow_automatic_l3agent_failover=True

in neutron.conf of each controller and restart the neutron-server.

Even with the patch applied, after the converge we have a cut in fip reachability.  This is how you can bring everything back working:

   ssh undercloud
   . overcloudrc
   curl -o reschedule-l3-routers.sh https://bugzilla.redhat.com/attachment.cgi?id=1421308  
   bash -x ./reschedule-l3-routers.sh

After a little while (between one and two minutes) everything should come back alive.

One can check with ping test and checking the state of a particular router is done like this:

  ssh undercloud
  . overcloudrc

  neutron router-list
  # pick one and then:
  neutron l3-agent-list-hosting-router 903195f0-c361-46a4-8b71-9a9b9bde572c
+--------------------------------------+--------------------------+----------------+-------+----------+
| id                                   | host                     | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 54ffd13f-ab05-4b5f-a884-a5016dcdd512 | controller-1.localdomain | True           | :-)   | standby  |
| 17311ec7-2db0-440d-922d-06bc633cc2a8 | controller-2.localdomain | True           | :-)   | standby  |
| 3174da98-564f-4449-a2c3-704d799f6558 | controller-0.localdomain | True           | :-)   | active   |
+--------------------------------------+--------------------------+----------------+-------+----------+

You may have all three in standby at first, not to worry, it will come back to active and during that time, the ping (and everything else) should work.


When everything has settled, you can cleanup the dead the l3 agent:

  ssh undercloud
  . overcloudrc

  curl -o cleanup-non-alive-agents.sh https://bugzilla.redhat.com/attachment.cgi?id=1421315
  bash -x ./cleanup-non-alive-agents.sh

Comment 13 Sofer Athlan-Guyot 2018-04-13 11:40:02 UTC
(In reply to Miguel Angel Ajo from comment #10)
> The host= parameter should not be changed.
> 
> If this happened by admin intervention, that should not be done.
> 
> If this happened because the upgrade mechanism did that, there was a bug
> related to this, and I beleive it was being addressed.

> The host= parameter should not be changed.
> 
> If this happened by admin intervention, that should not be done.
> 
> If this happened because the upgrade mechanism did that, there was a bug
> related to this, and I beleive it was being addressed.

So this is happening because we change how that parameter is set during deployment.

OSP9 and before that parameter was unset, so it get whatever socket.gethostname was returning.  That means that changes in /etc/hosts, dhcp, cloud-init, and certainly other things makes that function returning either the hostname or the fqdn.

But it seems to have been pretty consistent in returning the hostname with rhel-7.5.

OSP10 ... we explicitly set that parameter to whatever "facter fqdn" is returning, which is most of the time a fqdn (here again some misconfiguration of /etc/hosts and so on could change that)

So no admin intervention, no upgrade mechanism and yes that parameter should never change.  That's why the final fix should be to make sure that puppet never change it.  There is a WIP there to implement that[1]

[1] https://review.openstack.org/#/c/561079/1

Comment 14 Sofer Athlan-Guyot 2018-04-18 13:54:37 UTC
Adding current master review ... should clone this bz to all release.

Comment 15 Sofer Athlan-Guyot 2018-04-26 13:30:55 UTC
The last review is not strictly necessary for newton

Comment 19 Homero Pawlowski 2018-05-29 20:58:54 UTC
We're on the "z8" release of RHOSP10.

Stack updates currently fail with:

Could not retrieve fact='current_nova_host', resolution='<anonymous>': uninitialized constant Tempfile

[heat-admin@compute-0 ~]$ rpm -qa | grep tripleo
puppet-tripleo-5.6.8-6.el7ost.noarch

Looks like we're missing patch: https://review.openstack.org/568552 (which is included in the tracker for this BZ) 

The patch needs to be pushed out ASAP as all stack updates will fail without it.

Comment 20 Homero Pawlowski 2018-05-29 21:00:25 UTC
Confirmed that stack update succeeds after manually applying https://review.openstack.org/568552 to all Overcloud nodes.

Comment 25 errata-xmlrpc 2018-06-27 23:30:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2101


Note You need to log in before you can comment on or make changes to this bug.