Bug 1600178
Summary: | Neutron routers become unavailable after rebooting networker nodes post minor update | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> | ||||||||
Component: | puppet-tripleo | Assignee: | Sofer Athlan-Guyot <sathlang> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Amit Ugol <augol> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | urgent | ||||||||||
Version: | 10.0 (Newton) | CC: | aguetta, amuller, bhaley, ccamacho, dbecker, gkumar, jamsmith, jfrancoa, jhardee, jjoyce, jpretori, jschluet, kiyyappa, majopela, mburns, mcornea, morazi, ojanas, pcaruana, pmorey, sathlang, slinaber, sputhenp, tvignaud | ||||||||
Target Milestone: | z10 | Keywords: | Triaged, ZStream | ||||||||
Target Release: | 10.0 (Newton) | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | puppet-tripleo-5.6.8-17.el7ost | Doc Type: | Bug Fix | ||||||||
Doc Text: |
Previously, neutron routers became unavailable when using networker role after a minor update. This was caused by a change in the way the neutron server node on the overcloud (the host parameter) was identified. The value could be be overwritten. As a result, the old l3 agent and attached FIP were unavailable.
With this update, the neutron host parameter does not change on the node implementing the networker role during the update and the FIP are available.
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2018-11-26 18:00:29 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Marius Cornea
2018-07-11 15:26:43 UTC
Created attachment 1458140 [details]
sosreport-networker-0
Created attachment 1458141 [details]
sosreport-networker-1
Some notes: the issue didn't reproduce on a fresh OSP10 latest deployment. So the odd thing I see in the l3-agent log on networker-1 is: 2018-07-11 15:07:47.266 3722 DEBUG neutron.agent.l3.agent [req-f1f57f60-e241-43fb-b06b-cebc838896f7 - - - - -] Starting fullsync periodic_sync_routers_task periodic_sync_routers_task [...] 2018-07-11 15:07:53.127 3722 DEBUG neutron.agent.l3.agent [req-f1f57f60-e241-43fb-b06b-cebc838896f7 - - - - -] periodic_sync_routers_task successfully completed fetch_and_sync_all_routers But there was no work done, and no message back to the server notifying it of the current state of routers. For example, right before the restart there was: 2018-07-11 14:56:12.411 249422 DEBUG neutron.agent.l3.ha [-] Updating server with HA routers states {'461ca428-e6de-4fc2-b571-507a96476a83': 'active', 'd8074e7e-b108-4c62-aa95-b49e72867562': 'active', '7c3dd049-cc73-43cf-9e8d-649da3ecacea': 'active', '03ca67c8-eece-44ee-9a51-fd7adff0d85f': 'active', 'edccfa87-1ea6-4937-a0ef-2f4c0d8551fe': 'active', '788fcbae-6bb4-4458-9298-81dadefe8dc5': 'active', 'a8b24af1-3d7a-4ff5-baf0-f8577ed77a23': 'active', '448cdab2-5bbe-41d7-a2e9-5ad60596b107': 'active', 'f4c26283-f4a9-4a65-be11-5b77cf330286': 'active', '7b9920fb-5299-4388-b74c-fe7ab6d068a4': 'active', '7b35e807-52a5-49cf-96f6-facf88456488': 'active', '6fc34db3-4b90-4556-98c2-d5366c0685d1': 'active', '75da816e-ac8b-4c45-b8fa-6f6f6158e6af': 'active'} notify_server I didn't see an l3-agent log in the networker-0 sosreport to correlate things. It's as if there were no routers returned in the full sync call. I reproduced the issue(it only shows up after minor update). After rebooting networker-1 node: [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router 4c373aff-7755-40c5-80b7-289475fd9008 +--------------------------------------+-------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+-------------------------+----------------+-------+----------+ | 3d07375f-3e6a-41bb-9712-f814c5008807 | networker-0.localdomain | True | :-) | active | | 61e270cb-586c-49df-b925-2d9d58bbe70d | networker-1.localdomain | True | xxx | standby | +--------------------------------------+-------------------------+----------------+-------+----------+ Attaching /var/log/neutron from networker-1. Created attachment 1458253 [details]
neutron.tar.gz
Note: this only seems to be happening when you have separate Networker nodes. I wasn't able to reproduce it with monolithic controllers. OK, so what happens is that after update and reboot the hostname for the agents changes(see networker-1.localdomain agents are down while networker-1 agents are up): [stack@undercloud-0 ~]$ neutron agent-list +--------------------------------------+--------------------+-------------------------+-------------------+-------+----------------+---------------------------+ | id | agent_type | host | availability_zone | alive | admin_state_up | binary | +--------------------------------------+--------------------+-------------------------+-------------------+-------+----------------+---------------------------+ | 03e37225-5d81-4acc-87dd-287afdec9e09 | Metadata agent | networker-1.localdomain | | xxx | True | neutron-metadata-agent | | 37d9c9a1-99ab-4744-aa78-5b3948743670 | Open vSwitch agent | networker-0.localdomain | | :-) | True | neutron-openvswitch-agent | | 3c591ad5-8a2d-40db-9eb6-2d3a1251e201 | DHCP agent | networker-0.localdomain | nova | :-) | True | neutron-dhcp-agent | | 402cb673-47a8-43d7-b81c-69330c8d4e45 | Open vSwitch agent | compute-0.localdomain | | :-) | True | neutron-openvswitch-agent | | 496f9a91-5b51-486f-81a1-31b526ccbf65 | L3 agent | networker-0.localdomain | nova | :-) | True | neutron-l3-agent | | 55da449a-3d94-4165-89fc-499025050f25 | DHCP agent | networker-1 | nova | :-) | True | neutron-dhcp-agent | | 88e7f6aa-3520-4c52-8f63-f3dd3fd08172 | Metadata agent | networker-0.localdomain | | :-) | True | neutron-metadata-agent | | 8a0c92ed-cdc7-4701-9fbf-86f3a4589fe9 | L3 agent | networker-1.localdomain | nova | xxx | True | neutron-l3-agent | | 94292a2f-d8e4-480a-bb21-e8df37f97154 | Open vSwitch agent | compute-4.localdomain | | :-) | True | neutron-openvswitch-agent | | 9816c1be-f4f9-46e3-b905-f8efaacb54b1 | Open vSwitch agent | networker-1.localdomain | | xxx | True | neutron-openvswitch-agent | | 9be3b9ba-5cb1-42b9-a1de-f8ce1748d444 | Open vSwitch agent | networker-1 | | :-) | True | neutron-openvswitch-agent | | bb3a5435-e9e2-4c01-b717-989d28c14486 | Open vSwitch agent | compute-2.localdomain | | :-) | True | neutron-openvswitch-agent | | d4783830-25d9-4a4e-955d-55e5670c5bb5 | DHCP agent | networker-1.localdomain | nova | xxx | True | neutron-dhcp-agent | | df386d89-d052-4052-8e23-2abe8ea03f4f | L3 agent | networker-1 | nova | :-) | True | neutron-l3-agent | | e24a5ce2-a0ee-4733-93ec-6c191190782f | Open vSwitch agent | compute-1.localdomain | | :-) | True | neutron-openvswitch-agent | | e27638e7-2b3c-4ba7-a293-3a79395ec47b | Metadata agent | networker-1 | | :-) | True | neutron-metadata-agent | | e295f9e6-6180-4531-985e-786e61f2b4d8 | Open vSwitch agent | compute-3.localdomain | | :-) | True | neutron-openvswitch-agent | +--------------------------------------+--------------------+-------------------------+-------------------+-------+----------------+---------------------------+ [root@networker-1 heat-admin]# grep -v ^# /etc/neutron/neutron.conf | grep -v ^$ [DEFAULT] auth_strategy=keystone core_plugin=ml2 service_plugins=router,qos,trunk allow_overlapping_ips=True host=networker-1 global_physnet_mtu=1496 dhcp_agents_per_network=2 debug=True log_dir=/var/log/neutron rpc_backend=rabbit control_exchange=neutron [agent] root_helper=sudo neutron-rootwrap /etc/neutron/rootwrap.conf [cors] [cors.subdomain] [database] [keystone_authtoken] [matchmaker_redis] [nova] [oslo_concurrency] lock_path=$state_path/lock [oslo_messaging_amqp] [oslo_messaging_notifications] [oslo_messaging_rabbit] rabbit_hosts=172.17.1.17:5672,172.17.1.18:5672,172.17.1.27:5672 rabbit_use_ssl=False rabbit_userid=guest rabbit_password=8rabVQE2unyVyvb3uwmhBXPuV rabbit_ha_queues=True heartbeat_timeout_threshold=60 [oslo_messaging_zmq] [oslo_middleware] [oslo_policy] [qos] [quotas] [ssl] Adding DFG:Upgrades as I think this is related to BZ#1499201 Moving Networking DFG to observer / secondary DFG. If the hostname changes on the agents that explains the problem, and seems to be more Upgrades DFG related. Marius do you think that RHBZ 1499201 might not have been resolved in all cases? (In reply to Assaf Muller from comment #11) > Moving Networking DFG to observer / secondary DFG. If the hostname changes > on the agents that explains the problem, and seems to be more Upgrades DFG > related. Marius do you think that RHBZ 1499201 might not have been resolved > in all cases? Yes, I could only spot this issue when using a Networker role so I suspect this is affecting only deployments involving custom roles(not the monolithic controllers). Hi, so, we definitively have the host change: Jul 11 01:10:16 networker-0 os-collect-config[3201]: [2018-07-11 01:10:15,354] (heat-config) [DEBUG] [2018-07-11 01:09:53,868] (heat-config) [DEBUG] Running FACTER_heat_outputs_path="/var/run/heat-config/heat-config-puppet/c0d7893b-22e4-4e0f-9099-fbe275ccb2d7" FACTER_fqdn="networker-0.localdomain" FACTER_deploy_config_name="NetworkerDeployment_Step3" puppet apply --detailed-exitcodes --logdest console --modulepath /etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --debug --logdest /var/log/puppet/heat-debug.log /var/lib/heat-config/heat-config-puppet/c0d7893b-22e4-4e0f-9099-fbe275ccb2d7.pp /Stage[main]/Neutron/Neutron_config[DEFAULT/host]/value: value changed ['networker-0.localdomain'] to ['networker-0']\ Jul 11 01:10:16 networker-0 os-collect-config[3201]: Debug: Loading facts from /usr/share/openstack-puppet/modules/tripleo/lib/facter/current_config_hosts.rb and the current_config_hosts.rb is loaded. So it output network-0 as a result of its "introspection", which is unexpected. I will need to deploy the same env to see what is happening in this context. As a workaround, the workaround scripts in https://bugzilla.redhat.com/show_bug.cgi?id=1499201 should work here as well (as temporary fix): - change host value in neutron.conf to the previous one,restart neutron on the networker role - use " Workaround to get l3ha routers rescheduled " to get the new value in. hi, so the networker replacement test doesn't seem to use overcloud stack update command but some form of added templates to a deploy command. So first I wonder if this change of protocol for testing invalidate the patch as we're still not sure that it's working or not for update. Then I need logs or better yet a environment to be able to determine if the problem is similar to what we saw during update. Lastly I would like to know which version of rhel we're using for osp10 deployment. Thanks, Hi, so here's the rpm build. If it's delivered as an hotfix, it has to be installed on all the overcloud nodes. One way to do that is: ------8<------- RPM_PATH=[path to the rpm downloaded on the undercloud] . ~/stackrc openstack server list -f json > ~/server.json jq -r '.[] | .Networks' ~/server.json | cut -d= -f2 > ~/ips.txt for ip in $(cat ~/ips.txt); do scp $RMP_PATH heat-admin@${ip}: ; done for ip in $(cat ~/ips.txt); do ssh heat-admin@${ip} yum install -y ./$(basename $RMP_PATH) ; done ------>8------- there are others, it's just given as an example. (In reply to Sofer Athlan-Guyot from comment #27) Hi, Mistake in the example script (missing sudo), sorry: ------8<------- RPM_PATH=[path to the rpm downloaded on the undercloud] . ~/stackrc openstack server list -f json > ~/server.json jq -r '.[] | .Networks' ~/server.json | cut -d= -f2 > ~/ips.txt for ip in $(cat ~/ips.txt); do scp $RMP_PATH heat-admin@${ip}: ; done for ip in $(cat ~/ips.txt); do ssh heat-admin@${ip} sudo yum install -y ./$(basename $RMP_PATH) ; done ------>8------- > Hi, > > so here's the rpm build. > > If it's delivered as an hotfix, it has to be installed on all the overcloud > nodes. > > > One way to do that is: > > ------8<------- > > RPM_PATH=[path to the rpm downloaded on the undercloud] > . ~/stackrc > openstack server list -f json > ~/server.json > jq -r '.[] | .Networks' ~/server.json | cut -d= -f2 > ~/ips.txt > > for ip in $(cat ~/ips.txt); do scp $RMP_PATH heat-admin@${ip}: ; done > for ip in $(cat ~/ips.txt); do ssh heat-admin@${ip} yum install -y > ./$(basename $RMP_PATH) ; done > > ------>8------- > > there are others, it's just given as an example. *** Bug 1638303 has been marked as a duplicate of this bug. *** Moving this to verified as Marius has already tested the patch and the issue was solved when shipped on the client side. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3673 Hi, please check that comment [1] especially the part on the undercloud configuration needed to have scale out node with host parameter with fqdn[2]. The fact that cloud-init set it to short name indicate that it may be the case that the undercloud is not configured properly: 2019-04-02 14:31:43,422 - cc_set_hostname.py[DEBUG]: Setting the hostname to m1pl-st-comp0-12 (m1pl-st-comp0-12) 2019-04-02 14:31:43,422 - util.py[DEBUG]: Running command ['hostnamectl', 'set-hostname', 'm1pl-st-comp0-12'] with allowed return codes [0] (shell=False, capture=True) So first check the undercloud configuration and make sure it match what is described in [2]. Please report you finding. For better tracking I think it would be better to open a new bz, like "scale out compute node have short name host parameter, blocking ffu". Thanks, [1] https://bugzilla.redhat.com/show_bug.cgi?id=1657692#c21 [2] https://access.redhat.com/solutions/2089051 |