Description of problem: OSP9 -> OSP10: workloads created before upgrade are not reachable anymore after rebooting controller nodes: Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-5.3.0-6.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy latest OSP9 2. Launch workloads 3. Upgrade to OSP10 Actual results: Workloads are not reacheable anymore. Expected results: Workloads are reacheable. Additional info: It looks like in OSP10 the services got the domain name appended while it was not there in OSP9: +--------------------------------------+------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------+----------------+-------+----------+ | ec13520a-dcc2-4b34-bbfc-4a6c76466379 | overcloud-controller-2 | True | xxx | standby | | 87172b40-265c-4b24-a44f-ae7c5f2bb116 | overcloud-controller-0 | True | xxx | standby | | d6af24ef-2b49-4477-923d-b29bc7e13e86 | overcloud-controller-1 | True | xxx | active | +--------------------------------------+------------------------+----------------+-------+----------+ [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router 19a4da15-3135-4099-861f-8a9b34815f56 /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:303: SubjectAltNameWarning: C neutron agent-list +--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+ | id | agent_type | host | availability_zone | alive | admin_state_up | binary | +--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+ | 0d87b463-d27c-4b90-b43c-4420d367a0bb | Open vSwitch agent | overcloud-controller-2.localdomain | | :-) | True | neutron-openvswitch-agent | | 1f21aad1-0688-41df-94fc-afffbc6ad639 | Metadata agent | overcloud-controller-1 | | xxx | True | neutron-metadata-agent | | 22c57edf-8015-4172-acdd-5ca30fe9d2fd | L3 agent | overcloud-controller-1.localdomain | nova | :-) | True | neutron-l3-agent | | 2552feaf-1429-495b-a25f-1a492e5a6668 | Metadata agent | overcloud-controller-2 | | xxx | True | neutron-metadata-agent | | 340db352-474c-4a31-a62a-e9a0f4406bd1 | DHCP agent | overcloud-controller-0 | nova | xxx | True | neutron-dhcp-agent | | 53b444b0-abbc-4825-b1ad-8622c77aa36e | L3 agent | overcloud-controller-0.localdomain | nova | :-) | True | neutron-l3-agent | | 54caa3f7-53ec-4f27-9252-a774b78c06c9 | Open vSwitch agent | overcloud-controller-1.localdomain | | :-) | True | neutron-openvswitch-agent | | 6caf3e06-c3f5-4e50-99fa-c4f6ae4bdbb5 | DHCP agent | overcloud-controller-2 | nova | xxx | True | neutron-dhcp-agent | | 6e362eb0-678b-434f-b2bd-746107610114 | DHCP agent | overcloud-controller-1.localdomain | nova | :-) | True | neutron-dhcp-agent | | 82286c37-7c71-446a-a4b1-73647834944f | Metadata agent | overcloud-controller-0 | | xxx | True | neutron-metadata-agent | | 83117afc-c8f7-4b5d-b9d5-859f960c677c | Metadata agent | overcloud-controller-0.localdomain | | :-) | True | neutron-metadata-agent | | 844ef54d-69db-4728-b203-869136ef4368 | Open vSwitch agent | overcloud-controller-1 | | xxx | True | neutron-openvswitch-agent | | 84c43451-4890-4448-a746-f4cab94cc767 | Open vSwitch agent | overcloud-controller-2 | | xxx | True | neutron-openvswitch-agent | | 85330667-84b7-4bbf-93be-9dadd0736eea | Open vSwitch agent | overcloud-controller-0.localdomain | | :-) | True | neutron-openvswitch-agent | | 87172b40-265c-4b24-a44f-ae7c5f2bb116 | L3 agent | overcloud-controller-0 | nova | xxx | True | neutron-l3-agent | | 88ea82ef-22e8-46dd-850b-5f34efd83bf5 | Metadata agent | overcloud-controller-2.localdomain | | :-) | True | neutron-metadata-agent | | 8b30b03a-9c32-4c03-bf44-2ac1fd4492fe | DHCP agent | overcloud-controller-0.localdomain | nova | :-) | True | neutron-dhcp-agent | | c448cc63-29d8-4a41-a71d-97e499958aef | Metadata agent | overcloud-controller-1.localdomain | | :-) | True | neutron-metadata-agent | | d17095af-7799-4024-abbf-b7c01efee452 | DHCP agent | overcloud-controller-1 | nova | xxx | True | neutron-dhcp-agent | | d1f664e1-6539-41a3-9686-1e828b9258af | Open vSwitch agent | overcloud-controller-0 | | xxx | True | neutron-openvswitch-agent | | d5f26fbc-02ab-4866-945c-c798e80de94f | L3 agent | overcloud-controller-2.localdomain | nova | :-) | True | neutron-l3-agent | | d6af24ef-2b49-4477-923d-b29bc7e13e86 | L3 agent | overcloud-controller-1 | nova | xxx | True | neutron-l3-agent | | d71fbc5a-a3da-4eb2-bf76-f06c6130c895 | DHCP agent | overcloud-controller-2.localdomain | nova | :-) | True | neutron-dhcp-agent | | de1c1c12-3e2f-4ebf-9daa-f2b0b3eb3b38 | Open vSwitch agent | overcloud-compute-1.localdomain | | :-) | True | neutron-openvswitch-agent | | ec13520a-dcc2-4b34-bbfc-4a6c76466379 | L3 agent | overcloud-controller-2 | nova | xxx | True | neutron-l3-agent | | f24abfbf-3c42-45cf-9d39-d2eb11feb6e9 | Open vSwitch agent | overcloud-compute-0.localdomain | | :-) | True | neutron-openvswitch-agent | +--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+ [stack@undercloud-0 ~]$ [stack@undercloud-0 ~]$ openstack compute service list +-----+------------------+------------------------------------+----------+---------+-------+----------------------------+ | ID | Binary | Host | Zone | Status | State | Updated At | +-----+------------------+------------------------------------+----------+---------+-------+----------------------------+ | 2 | nova-scheduler | overcloud-controller-0 | internal | enabled | down | 2017-10-05T23:49:45.000000 | | 5 | nova-scheduler | overcloud-controller-2 | internal | enabled | down | 2017-10-05T23:48:23.000000 | | 8 | nova-scheduler | overcloud-controller-1 | internal | enabled | down | 2017-10-05T23:48:21.000000 | | 68 | nova-consoleauth | overcloud-controller-2 | internal | enabled | down | 2017-10-05T23:48:08.000000 | | 71 | nova-consoleauth | overcloud-controller-1 | internal | enabled | down | 2017-10-05T23:48:14.000000 | | 74 | nova-consoleauth | overcloud-controller-0 | internal | enabled | down | 2017-10-05T23:48:30.000000 | | 77 | nova-conductor | overcloud-controller-1 | internal | enabled | down | 2017-10-05T23:48:30.000000 | | 86 | nova-conductor | overcloud-controller-2 | internal | enabled | down | 2017-10-05T23:48:31.000000 | | 98 | nova-conductor | overcloud-controller-0 | internal | enabled | down | 2017-10-05T23:49:55.000000 | | 101 | nova-compute | overcloud-compute-1.localdomain | nova | enabled | up | 2017-10-06T11:00:10.000000 | | 104 | nova-compute | overcloud-compute-0.localdomain | nova | enabled | up | 2017-10-06T11:00:06.000000 | | 105 | nova-consoleauth | overcloud-controller-2.localdomain | internal | enabled | up | 2017-10-06T11:00:11.000000 | | 108 | nova-consoleauth | overcloud-controller-1.localdomain | internal | enabled | up | 2017-10-06T11:00:05.000000 | | 111 | nova-scheduler | overcloud-controller-1.localdomain | internal | enabled | up | 2017-10-06T11:00:10.000000 | | 114 | nova-scheduler | overcloud-controller-2.localdomain | internal | enabled | up | 2017-10-06T11:00:05.000000 | | 117 | nova-conductor | overcloud-controller-1.localdomain | internal | enabled | up | 2017-10-06T11:00:09.000000 | | 123 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up | 2017-10-06T11:00:06.000000 | | 126 | nova-conductor | overcloud-controller-2.localdomain | internal | enabled | up | 2017-10-06T11:00:10.000000 | | 129 | nova-scheduler | overcloud-controller-0.localdomain | internal | enabled | up | 2017-10-06T11:00:04.000000 | | 132 | nova-conductor | overcloud-controller-0.localdomain | internal | enabled | up | 2017-10-06T11:00:06.000000 | +-----+------------------+------------------------------------+----------+---------+-------+----------------------------+
So the host parameter was added in neutron.conf: +host=overcloud-controller-0.localdomain and in nova.conf: +host=overcloud-controller-0.localdomain I guess that before the default was taken and it was the hostname, not the fqdn. We basically change all the service definition. The first side effect found is that the router created before this change and after reboot of the controller are unreachable. Their l3 agent are down: [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router a281c931-d8f2-4a5f-9991-7594bf408cf9 +--------------------------------------+------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------+----------------+-------+----------+ | ec13520a-dcc2-4b34-bbfc-4a6c76466379 | overcloud-controller-2 | True | xxx | standby | | 87172b40-265c-4b24-a44f-ae7c5f2bb116 | overcloud-controller-0 | True | xxx | standby | | d6af24ef-2b49-4477-923d-b29bc7e13e86 | overcloud-controller-1 | True | xxx | active | +--------------------------------------+------------------------+----------------+-------+----------+ as they are associated with the old host parameters. ====> This basically makes the floating ip not reachable anymore which is bad. Note that it seems you need to reboot the controller for the problem to occur, but waiting on Marius if you can confirm this point.
So to summarize before osp10 the nova and neutron defaulted to socket.gethostname. From osp10 on, we explicitely set this value to fqdn[1]. It appears that osp9/rhel7.4 are configured in such a way that socket.gethostname returns (correctly) controller-X. But, we found one build where osp9 returned the fqdn controller-X.localdomain. That build was based on osp9/rhel7.3. As it is when we log to such an env, the hostname command (wrongly) returns the fqdn. The clouddomain variable on the undercloud is empty is all environement (working and non working) so it doesn't seem relevant. TL;DR An easy way to check this problem is to run those command: [stack@undercloud-0 ~]$ neutron agent-list +--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+ | id | agent_type | host | availability_zone | alive | admin_state_up | binary | +--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+ | 0d87b463-d27c-4b90-b43c-4420d367a0bb | Open vSwitch agent | overcloud-controller-2.localdomain | | :-) | True | neutron-openvswitch-agent | | 1f21aad1-0688-41df-94fc-afffbc6ad639 | Metadata agent | overcloud-controller-1 | | xxx | True | neutron-metadata-agent | | 22c57edf-8015-4172-acdd-5ca30fe9d2fd | L3 agent | overcloud-controller-1.localdomain | nova | :-) | True | neutron-l3-agent | | 2552feaf-1429-495b-a25f-1a492e5a6668 | Metadata agent | overcloud-controller-2 | | xxx | True | neutron-metadata-agent | | 340db352-474c-4a31-a62a-e9a0f4406bd1 | DHCP agent | overcloud-controller-0 | nova | xxx | True | neutron-dhcp-agent | | 53b444b0-abbc-4825-b1ad-8622c77aa36e | L3 agent | overcloud-controller-0.localdomain | nova | :-) | True | neutron-l3-agent | | 54caa3f7-53ec-4f27-9252-a774b78c06c9 | Open vSwitch agent | overcloud-controller-1.localdomain | | :-) | True | neutron-openvswitch-agent | | 6caf3e06-c3f5-4e50-99fa-c4f6ae4bdbb5 | DHCP agent | overcloud-controller-2 | nova | xxx | True | neutron-dhcp-agent | | 6e362eb0-678b-434f-b2bd-746107610114 | DHCP agent | overcloud-controller-1.localdomain | nova | :-) | True | neutron-dhcp-agent | | 82286c37-7c71-446a-a4b1-73647834944f | Metadata agent | overcloud-controller-0 | | xxx | True | neutron-metadata-agent | | 83117afc-c8f7-4b5d-b9d5-859f960c677c | Metadata agent | overcloud-controller-0.localdomain | | :-) | True | neutron-metadata-agent | | 844ef54d-69db-4728-b203-869136ef4368 | Open vSwitch agent | overcloud-controller-1 | | xxx | True | neutron-openvswitch-agent | | 84c43451-4890-4448-a746-f4cab94cc767 | Open vSwitch agent | overcloud-controller-2 | | xxx | True | neutron-openvswitch-agent | | 85330667-84b7-4bbf-93be-9dadd0736eea | Open vSwitch agent | overcloud-controller-0.localdomain | | :-) | True | neutron-openvswitch-agent | | 87172b40-265c-4b24-a44f-ae7c5f2bb116 | L3 agent | overcloud-controller-0 | nova | xxx | True | neutron-l3-agent | | 88ea82ef-22e8-46dd-850b-5f34efd83bf5 | Metadata agent | overcloud-controller-2.localdomain | | :-) | True | neutron-metadata-agent | | 8b30b03a-9c32-4c03-bf44-2ac1fd4492fe | DHCP agent | overcloud-controller-0.localdomain | nova | :-) | True | neutron-dhcp-agent | | c448cc63-29d8-4a41-a71d-97e499958aef | Metadata agent | overcloud-controller-1.localdomain | | :-) | True | neutron-metadata-agent | | d17095af-7799-4024-abbf-b7c01efee452 | DHCP agent | overcloud-controller-1 | nova | xxx | True | neutron-dhcp-agent | | d1f664e1-6539-41a3-9686-1e828b9258af | Open vSwitch agent | overcloud-controller-0 | | xxx | True | neutron-openvswitch-agent | | d5f26fbc-02ab-4866-945c-c798e80de94f | L3 agent | overcloud-controller-2.localdomain | nova | :-) | True | neutron-l3-agent | | d6af24ef-2b49-4477-923d-b29bc7e13e86 | L3 agent | overcloud-controller-1 | nova | xxx | True | neutron-l3-agent | | d71fbc5a-a3da-4eb2-bf76-f06c6130c895 | DHCP agent | overcloud-controller-2.localdomain | nova | :-) | True | neutron-dhcp-agent | | de1c1c12-3e2f-4ebf-9daa-f2b0b3eb3b38 | Open vSwitch agent | overcloud-compute-1.localdomain | | :-) | True | neutron-openvswitch-agent | | ec13520a-dcc2-4b34-bbfc-4a6c76466379 | L3 agent | overcloud-controller-2 | nova | xxx | True | neutron-l3-agent | | f24abfbf-3c42-45cf-9d39-d2eb11feb6e9 | Open vSwitch agent | overcloud-compute-0.localdomain | | :-) | True | neutron-openvswitch-agent | +--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+ [stack@undercloud-0 ~]$ [stack@undercloud-0 ~]$ openstack compute service list +-----+------------------+------------------------------------+----------+---------+-------+----------------------------+ | ID | Binary | Host | Zone | Status | State | Updated At | +-----+------------------+------------------------------------+----------+---------+-------+----------------------------+ | 2 | nova-scheduler | overcloud-controller-0 | internal | enabled | down | 2017-10-05T23:49:45.000000 | | 5 | nova-scheduler | overcloud-controller-2 | internal | enabled | down | 2017-10-05T23:48:23.000000 | | 8 | nova-scheduler | overcloud-controller-1 | internal | enabled | down | 2017-10-05T23:48:21.000000 | | 68 | nova-consoleauth | overcloud-controller-2 | internal | enabled | down | 2017-10-05T23:48:08.000000 | | 71 | nova-consoleauth | overcloud-controller-1 | internal | enabled | down | 2017-10-05T23:48:14.000000 | | 74 | nova-consoleauth | overcloud-controller-0 | internal | enabled | down | 2017-10-05T23:48:30.000000 | | 77 | nova-conductor | overcloud-controller-1 | internal | enabled | down | 2017-10-05T23:48:30.000000 | | 86 | nova-conductor | overcloud-controller-2 | internal | enabled | down | 2017-10-05T23:48:31.000000 | | 98 | nova-conductor | overcloud-controller-0 | internal | enabled | down | 2017-10-05T23:49:55.000000 | | 101 | nova-compute | overcloud-compute-1.localdomain | nova | enabled | up | 2017-10-06T11:00:10.000000 | | 104 | nova-compute | overcloud-compute-0.localdomain | nova | enabled | up | 2017-10-06T11:00:06.000000 | | 105 | nova-consoleauth | overcloud-controller-2.localdomain | internal | enabled | up | 2017-10-06T11:00:11.000000 | | 108 | nova-consoleauth | overcloud-controller-1.localdomain | internal | enabled | up | 2017-10-06T11:00:05.000000 | | 111 | nova-scheduler | overcloud-controller-1.localdomain | internal | enabled | up | 2017-10-06T11:00:10.000000 | | 114 | nova-scheduler | overcloud-controller-2.localdomain | internal | enabled | up | 2017-10-06T11:00:05.000000 | | 117 | nova-conductor | overcloud-controller-1.localdomain | internal | enabled | up | 2017-10-06T11:00:09.000000 | | 123 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up | 2017-10-06T11:00:06.000000 | | 126 | nova-conductor | overcloud-controller-2.localdomain | internal | enabled | up | 2017-10-06T11:00:10.000000 | | 129 | nova-scheduler | overcloud-controller-0.localdomain | internal | enabled | up | 2017-10-06T11:00:04.000000 | | 132 | nova-conductor | overcloud-controller-0.localdomain | internal | enabled | up | 2017-10-06T11:00:06.000000 | +-----+------------------+------------------------------------+----------+---------+-------+----------------------------+ [1] https://github.com/openstack/puppet-neutron/commit/c93d5a342d50d820f3922a97b3224be2e9747472 and https://github.com//openstack/tripleo-heat-templates/commit/056ce2374851e4e96dd3fd822de9da76b35e1eb7
Minor update OSP9 2017-06-01.1 build - overcloud nodes on RHEL 7.3 to latest OSP9, RHEL 7.4. After minor update completed: [stack@undercloud-0 ~]$ nova service-list ne/usr/lib/python2.7/site-packages/keyring/backends/Gnome.py:6: PyGIWarning: GnomeKeyring was imported without specifying a version first. Use gi.require_version('GnomeKeyring', '1.0') before import to ensure that the right version gets loaded. from gi.repository import GnomeKeyring utro+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+ | 1 | nova-scheduler | controller-0.localdomain | internal | enabled | up | 2017-10-06T17:40:18.000000 | - | | 4 | nova-consoleauth | controller-0.localdomain | internal | enabled | up | 2017-10-06T17:40:15.000000 | - | | 5 | nova-conductor | controller-0.localdomain | internal | enabled | up | 2017-10-06T17:40:21.000000 | - | | 6 | nova-compute | compute-0.localdomain | nova | enabled | up | 2017-10-06T17:40:15.000000 | - | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+ [stack@undercloud-0 ~]$ neutron agent-list +--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+ | id | agent_type | host | availability_zone | alive | admin_state_up | binary | +--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+ | 190fde25-fcdb-4b22-aaa7-e1cb55444914 | Metadata agent | controller-0.localdomain | | :-) | True | neutron-metadata-agent | | 50ac64d2-a759-47d8-a525-2d766cbeae04 | Open vSwitch agent | compute-0.localdomain | | :-) | True | neutron-openvswitch-agent | | e7903a51-c80d-40b7-b9c1-6c7d66b46619 | L3 agent | controller-0.localdomain | nova | :-) | True | neutron-l3-agent | | fad33d95-13d6-4bec-bfe4-97d054bedf89 | DHCP agent | controller-0.localdomain | nova | :-) | True | neutron-dhcp-agent | | ffcbef93-5f8f-4822-84de-05488df6bb0b | Open vSwitch agent | controller-0.localdomain | | :-) | True | neutron-openvswitch-agent | +--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+ After reboot: [stack@undercloud-0 ~]$ nova service-list /usr/lib/python2.7/site-packages/keyring/backends/Gnome.py:6: PyGIWarning: GnomeKeyring was imported without specifying a version first. Use gi.require_version('GnomeKeyring', '1.0') before import to ensure that the right version gets loaded. from gi.repository import GnomeKeyring +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+ | 1 | nova-scheduler | controller-0.localdomain | internal | enabled | up | 2017-10-06T17:46:37.000000 | - | | 4 | nova-consoleauth | controller-0.localdomain | internal | enabled | up | 2017-10-06T17:46:38.000000 | - | | 5 | nova-conductor | controller-0.localdomain | internal | enabled | up | 2017-10-06T17:46:30.000000 | - | | 6 | nova-compute | compute-0.localdomain | nova | enabled | up | 2017-10-06T17:46:38.000000 | - | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+ [stack@undercloud-0 ~]$ neutron agent-list +--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+ | id | agent_type | host | availability_zone | alive | admin_state_up | binary | +--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+ | 190fde25-fcdb-4b22-aaa7-e1cb55444914 | Metadata agent | controller-0.localdomain | | :-) | True | neutron-metadata-agent | | 50ac64d2-a759-47d8-a525-2d766cbeae04 | Open vSwitch agent | compute-0.localdomain | | :-) | True | neutron-openvswitch-agent | | e7903a51-c80d-40b7-b9c1-6c7d66b46619 | L3 agent | controller-0.localdomain | nova | :-) | True | neutron-l3-agent | | fad33d95-13d6-4bec-bfe4-97d054bedf89 | DHCP agent | controller-0.localdomain | nova | :-) | True | neutron-dhcp-agent | | ffcbef93-5f8f-4822-84de-05488df6bb0b | Open vSwitch agent | controller-0.localdomain | | :-) | True | neutron-openvswitch-agent | +--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+ But if we rerun the overcloud deploy command: [stack@undercloud-0 ~]$ nova service-list /usr/lib/python2.7/site-packages/keyring/backends/Gnome.py:6: PyGIWarning: GnomeKeyring was imported without specifying a version first. Use gi.require_version('GnomeKeyring', '1.0') before import to ensure that the right version gets loaded. from gi.repository import GnomeKeyring +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+ | 1 | nova-scheduler | controller-0.localdomain | internal | enabled | up | 2017-10-06T18:04:51.000000 | - | | 4 | nova-consoleauth | controller-0.localdomain | internal | enabled | up | 2017-10-06T18:04:53.000000 | - | | 5 | nova-conductor | controller-0.localdomain | internal | enabled | up | 2017-10-06T18:04:56.000000 | - | | 6 | nova-compute | compute-0.localdomain | nova | enabled | up | 2017-10-06T18:04:58.000000 | - | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+ [stack@undercloud-0 ~]$ neutron service-list Unknown command [u'service-list'] [stack@undercloud-0 ~]$ neutron agent-list +--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+ | id | agent_type | host | availability_zone | alive | admin_state_up | binary | +--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+ | 190fde25-fcdb-4b22-aaa7-e1cb55444914 | Metadata agent | controller-0.localdomain | | :-) | True | neutron-metadata-agent | | 50ac64d2-a759-47d8-a525-2d766cbeae04 | Open vSwitch agent | compute-0.localdomain | | :-) | True | neutron-openvswitch-agent | | e7903a51-c80d-40b7-b9c1-6c7d66b46619 | L3 agent | controller-0.localdomain | nova | :-) | True | neutron-l3-agent | | fad33d95-13d6-4bec-bfe4-97d054bedf89 | DHCP agent | controller-0.localdomain | nova | :-) | True | neutron-dhcp-agent | | ffcbef93-5f8f-4822-84de-05488df6bb0b | Open vSwitch agent | controller-0.localdomain | | :-) | True | neutron-openvswitch-agent | +--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+ [stack@undercloud-0 ~]$ nova service-list /usr/lib/python2.7/site-packages/keyring/backends/Gnome.py:6: PyGIWarning: GnomeKeyring was imported without specifying a version first. Use gi.require_version('GnomeKeyring', '1.0') before import to ensure that the right version gets loaded. from gi.repository import GnomeKeyring +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+ | 1 | nova-scheduler | controller-0.localdomain | internal | enabled | up | 2017-10-06T18:09:19.000000 | - | | 4 | nova-consoleauth | controller-0.localdomain | internal | enabled | up | 2017-10-06T18:09:17.000000 | - | | 5 | nova-conductor | controller-0.localdomain | internal | enabled | up | 2017-10-06T18:09:12.000000 | - | | 6 | nova-compute | compute-0.localdomain | nova | enabled | up | 2017-10-06T18:09:20.000000 | - | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+ [stack@undercloud-0 ~]$ neutron agent-list +--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+ | id | agent_type | host | availability_zone | alive | admin_state_up | binary | +--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+ | 190fde25-fcdb-4b22-aaa7-e1cb55444914 | Metadata agent | controller-0.localdomain | | :-) | True | neutron-metadata-agent | | 50ac64d2-a759-47d8-a525-2d766cbeae04 | Open vSwitch agent | compute-0.localdomain | | :-) | True | neutron-openvswitch-agent | | e7903a51-c80d-40b7-b9c1-6c7d66b46619 | L3 agent | controller-0.localdomain | nova | :-) | True | neutron-l3-agent | | fad33d95-13d6-4bec-bfe4-97d054bedf89 | DHCP agent | controller-0.localdomain | nova | :-) | True | neutron-dhcp-agent | | ffcbef93-5f8f-4822-84de-05488df6bb0b | Open vSwitch agent | controller-0.localdomain | | :-) | True | neutron-openvswitch-agent | +--------------------------------------+--------------------+--------------------------+-------------------+-------+----------------+---------------------------+ [root@controller-0 heat-admin]# python Python 2.7.5 (default, May 3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import socket >>> socket.gethostname <built-in function gethostname> >>> socket.gethostname() 'controller-0.localdomain'
Hi, changing allow_automatic_l3agent_failover=False to allow_automatic_l3agent_failover=True should prevent the issue as the "new" l3 agent would take over the previous workload. We already have allow_automatic_dhcp_failover = true (it's commented, but it's the default value) Testing that workaround, it should only need a restart neutron-server on each controller.
Hi, so the test was unsuccessful. Next step are: 1. getting more help from networking; 2. Trying this https://review.openstack.org/#/q/I8f075a5ad869ef0dc72a700dcb7be0b6efca787a which strive to never change the host id. TL;DR Even with allow_automatic_l3agent_failover=True configured in neutron.conf in the three controllers, the router stay on the failed l3 agents: neutron l3-agent-list-hosting-router 903195f0-c361-46a4-8b71-9a9b9bde572c +--------------------------------------+--------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+--------------+----------------+-------+----------+ | e5267f02-5b5f-44ec-ab0d-ae0c2fa42b6f | controller-0 | True | xxx | standby | | c17f7b3a-22c4-4b5c-ba66-ad5ab85bd1ee | controller-1 | True | xxx | active | | d839c597-f68e-4c18-b6e6-6ef7f44e643f | controller-2 | True | xxx | standby | +--------------------------------------+--------------+----------------+-------+----------+ There are not migrated to the live agent: neutron agent-list | grep L3 | 17311ec7-2db0-440d-922d-06bc633cc2a8 | L3 agent | controller-2.localdomain | nova | :-) | True | neutron-l3-agent | | 3174da98-564f-4449-a2c3-704d799f6558 | L3 agent | controller-0.localdomain | nova | :-) | True | neutron-l3-agent | | 54ffd13f-ab05-4b5f-a884-a5016dcdd512 | L3 agent | controller-1.localdomain | nova | :-) | True | neutron-l3-agent | | c17f7b3a-22c4-4b5c-ba66-ad5ab85bd1ee | L3 agent | controller-1 | nova | xxx | True | neutron-l3-agent | | d839c597-f68e-4c18-b6e6-6ef7f44e643f | L3 agent | controller-2 | nova | xxx | True | neutron-l3-agent | | e5267f02-5b5f-44ec-ab0d-ae0c2fa42b6f | L3 agent | controller-0 | nova | xxx | True | neutron-l3-agent | The puppet log during the converge show that puppet did its job: Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Server/Neutron_config[DEFAULT/allow_automatic_l3agent_failover]/value: value changed ['False'] to ['True'] Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Server/Neutron_config[DEFAULT/api_workers]/value: value changed ['0'] to ['4'] Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Server/Oslo::Middleware[neutron_config]/Neutron_config[oslo_middleware/enable_proxy_headers_parsing]/ensure: created Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Server/Neutron_config[DEFAULT/router_distributed]/ensure: created Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Db/Oslo::Db[neutron_config]/Neutron_config[database/db_max_retries]/ensure: created Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Db/Oslo::Db[neutron_config]/Neutron_config[database/connection]/value: value changed '[old secret redacted]' to '[new secret redacted]' Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Server/Neutron_api_config[filter:authtoken/admin_tenant_name]/ensure: removed Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Deps/Anchor[neutron::config::end]: Triggered 'refresh' from 22 events Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Deps/Anchor[neutron::service::begin]: Triggered 'refresh' from 1 events Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Server/Service[neutron-server]: Triggered 'refresh' from 1 events Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Agents::Ml2::Ovs/Service[neutron-ovs-agent-service]: Triggered 'refresh' from 1 events Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Agents::Dhcp/Service[neutron-dhcp-service]: Triggered 'refresh' from 1 events Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Agents::L3/Service[neutron-l3]: Triggered 'refresh' from 1 events Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Agents::Metadata/Service[neutron-metadata]: Triggered 'refresh' from 1 events Apr 12 21:50:32 controller-1 os-collect-config[2860]: Notice: /Stage[main]/Neutron::Deps/Anchor[neutron::service::end]: Triggered 'refresh' from 5 events but in all three servers we can see: 2018-04-12 21:54:30.669 82738 ERROR neutron.db.agentschedulers_db [req-67cd6a0d-ca1c-42c7-9656-dd30bd335979 - - - - -] Exception encountered during router rescheduling. 2018-04-12 21:54:30.669 82738 ERROR neutron.db.agentschedulers_db Traceback (most recent call last): 2018-04-12 21:54:30.669 82738 ERROR neutron.db.agentschedulers_db File "/usr/lib/python2.7/site-packages/neutron/db/agentschedulers_db.py", line 215, in reschedule_resources_from_down_agents which correspond to; context = ncontext.get_admin_context() try: down_bindings = get_down_bindings(context, agent_dead_limit) agents_back_online = set() for binding in down_bindings: binding_agent_id = getattr(binding, agent_id_attr) binding_resource_id = getattr(binding, resource_id_attr) if binding_agent_id in agents_back_online: it fails because: 2018-04-12 21:54:30.669 82738 ERROR neutron.db.agentschedulers_db DBConnectionError: (pymysql.err.OperationalError) 2003, "Can't connect to MySQL server on '172.17.1.11' ([Errno 111] ECONNREFUSED)" It must be related to pacemaker service being restarted at that time. So it seems that upon restart after it tries to do the right thing, fails and don't try again. Restarting the all neutron-servers post upgrade doesn't reschedule the router on the new l3 agents neither. It keeps saying: Checking if agent starts up and giving it additional 0:00:00 agent_starting_up /usr/lib/python2.7/site-packages/neutron/db/agentschedulers_db.py:309 and do nothing even though it detects there is an issue: WARNING neutron.db.agents_db [req-d7916013-93df-44c0-9d58-2af3cdcd26f4 - - - - -] Agent healthcheck: found 12 dead agents out of 25: Type Last heartbeat host Metadata agent 2018-04-12 21:46:33 controller-1 Open vSwitch agent 2018-04-12 21:45:25 controller-2 Open vSwitch agent 2018-04-12 21:46:54 controller-0 DHCP agent 2018-04-12 21:46:42 controller-0 DHCP agent 2018-04-12 21:46:29 controller-1 Metadata agent 2018-04-12 21:45:19 controller-2 Open vSwitch agent 2018-04-12 21:46:13 controller-1 L3 agent 2018-04-12 21:46:38 controller-1 Metadata agent 2018-04-12 21:47:06 controller-0 L3 agent 2018-04-12 21:45:17 controller-2 DHCP agent 2018-04-12 21:45:15 controller-2 L3 agent 2018-04-12 21:46:39 controller-0
Adding that bug as it seems related.
Created attachment 1421308 [details] Workaround to get l3ha routers rescheduled
The automatic router failover mechanism only works for non-l3ha routers, using the attached script it's possible to force neutron to clean the l3ha schedulings, and schedule to new hosts.
The host= parameter should not be changed. If this happened by admin intervention, that should not be done. If this happened because the upgrade mechanism did that, there was a bug related to this, and I beleive it was being addressed.
Created attachment 1421315 [details] Script to cleanup dead agents (on the wrong host id)
(In reply to Miguel Angel Ajo from comment #10) Thanks a lot ajo for the workaround here. For this to work we need the review https://review.openstack.org/560855 applied before the converge step. If that's not the case, then you have to manually set: allow_automatic_l3agent_failover=True in neutron.conf of each controller and restart the neutron-server. Even with the patch applied, after the converge we have a cut in fip reachability. This is how you can bring everything back working: ssh undercloud . overcloudrc curl -o reschedule-l3-routers.sh https://bugzilla.redhat.com/attachment.cgi?id=1421308 bash -x ./reschedule-l3-routers.sh After a little while (between one and two minutes) everything should come back alive. One can check with ping test and checking the state of a particular router is done like this: ssh undercloud . overcloudrc neutron router-list # pick one and then: neutron l3-agent-list-hosting-router 903195f0-c361-46a4-8b71-9a9b9bde572c +--------------------------------------+--------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+--------------------------+----------------+-------+----------+ | 54ffd13f-ab05-4b5f-a884-a5016dcdd512 | controller-1.localdomain | True | :-) | standby | | 17311ec7-2db0-440d-922d-06bc633cc2a8 | controller-2.localdomain | True | :-) | standby | | 3174da98-564f-4449-a2c3-704d799f6558 | controller-0.localdomain | True | :-) | active | +--------------------------------------+--------------------------+----------------+-------+----------+ You may have all three in standby at first, not to worry, it will come back to active and during that time, the ping (and everything else) should work. When everything has settled, you can cleanup the dead the l3 agent: ssh undercloud . overcloudrc curl -o cleanup-non-alive-agents.sh https://bugzilla.redhat.com/attachment.cgi?id=1421315 bash -x ./cleanup-non-alive-agents.sh
(In reply to Miguel Angel Ajo from comment #10) > The host= parameter should not be changed. > > If this happened by admin intervention, that should not be done. > > If this happened because the upgrade mechanism did that, there was a bug > related to this, and I beleive it was being addressed. > The host= parameter should not be changed. > > If this happened by admin intervention, that should not be done. > > If this happened because the upgrade mechanism did that, there was a bug > related to this, and I beleive it was being addressed. So this is happening because we change how that parameter is set during deployment. OSP9 and before that parameter was unset, so it get whatever socket.gethostname was returning. That means that changes in /etc/hosts, dhcp, cloud-init, and certainly other things makes that function returning either the hostname or the fqdn. But it seems to have been pretty consistent in returning the hostname with rhel-7.5. OSP10 ... we explicitly set that parameter to whatever "facter fqdn" is returning, which is most of the time a fqdn (here again some misconfiguration of /etc/hosts and so on could change that) So no admin intervention, no upgrade mechanism and yes that parameter should never change. That's why the final fix should be to make sure that puppet never change it. There is a WIP there to implement that[1] [1] https://review.openstack.org/#/c/561079/1
Adding current master review ... should clone this bz to all release.
The last review is not strictly necessary for newton
We're on the "z8" release of RHOSP10. Stack updates currently fail with: Could not retrieve fact='current_nova_host', resolution='<anonymous>': uninitialized constant Tempfile [heat-admin@compute-0 ~]$ rpm -qa | grep tripleo puppet-tripleo-5.6.8-6.el7ost.noarch Looks like we're missing patch: https://review.openstack.org/568552 (which is included in the tracker for this BZ) The patch needs to be pushed out ASAP as all stack updates will fail without it.
Confirmed that stack update succeeds after manually applying https://review.openstack.org/568552 to all Overcloud nodes.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2101