Description of problem: After running the OSP10 minor update procedure and after rebooting the networker nodes the Neutron routers created before running the minor update procedure are note available anymore. According to the /var/log/neutron/l3-agent.log files it appears that the neutron l3 agent process cannot access the pid files in /var/lib/neutron/external/pids/ due to permission errors: [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router f4c26283-f4a9-4a65-be11-5b77cf330286 +--------------------------------------+-------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+-------------------------+----------------+-------+----------+ | 9d13f3a5-199d-43a5-a49e-8a65f0188b05 | networker-0.localdomain | True | xxx | standby | | b56c201a-2443-478d-ab92-f5f9c6792515 | networker-1.localdomain | True | xxx | standby | +--------------------------------------+-------------------------+----------------+-------+----------+ [root@networker-1 heat-admin]# grep f4c26283-f4a9-4a65-be11-5b77cf330286 /var/log/neutron/l3-agent.log 2018-07-11 14:55:56.276 249422 DEBUG neutron.agent.l3.ha [-] Handling notification for router f4c26283-f4a9-4a65-be11-5b77cf330286, state master enqueue /usr/lib/python2.7/site-packages/neutron/agent/l3/ha.py:88 2018-07-11 14:55:56.277 249422 INFO neutron.agent.l3.ha [-] Router f4c26283-f4a9-4a65-be11-5b77cf330286 transitioned to master 2018-07-11 14:55:56.278 249422 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-f4c26283-f4a9-4a65-be11-5b77cf330286', 'sysctl', '-w', 'net.ipv6.conf.qg-28e948c5-80.accept_ra=2'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:105 2018-07-11 14:55:56.329 249422 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-f4c26283-f4a9-4a65-be11-5b77cf330286', 'sysctl', '-w', 'net.ipv6.conf.qg-28e948c5-80.forwarding=1'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:105 2018-07-11 14:55:56.356 249422 DEBUG neutron.agent.l3.ha [-] Spawning metadata proxy for router f4c26283-f4a9-4a65-be11-5b77cf330286 _update_metadata_proxy /usr/lib/python2.7/site-packages/neutron/agent/l3/ha.py:194 2018-07-11 14:55:56.356 249422 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/f4c26283-f4a9-4a65-be11-5b77cf330286.pid get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:267 2018-07-11 14:55:56.357 249422 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/f4c26283-f4a9-4a65-be11-5b77cf330286.pid get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:267 pidfile /var/lib/neutron/external/pids/f4c26283-f4a9-4a65-be11-5b77cf330286.pid http-request add-header X-Neutron-Router-ID f4c26283-f4a9-4a65-be11-5b77cf330286 2018-07-11 14:55:56.361 249422 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-f4c26283-f4a9-4a65-be11-5b77cf330286', 'haproxy', '-f', '/var/lib/neutron/ns-metadata-proxy/f4c26283-f4a9-4a65-be11-5b77cf330286.conf'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:105 2018-07-11 14:55:56.405 249422 DEBUG neutron.agent.l3.router_info [-] Spawning radvd daemon in router device: f4c26283-f4a9-4a65-be11-5b77cf330286 enable_radvd /usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py:492 2018-07-11 14:55:56.406 249422 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/f4c26283-f4a9-4a65-be11-5b77cf330286.pid.radvd get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:267 2018-07-11 14:55:56.406 249422 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/f4c26283-f4a9-4a65-be11-5b77cf330286.pid.radvd get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:267 2018-07-11 14:55:56.407 249422 DEBUG neutron.agent.linux.external_process [-] No process started for f4c26283-f4a9-4a65-be11-5b77cf330286 disable /usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py:123 2018-07-11 14:55:56.407 249422 DEBUG neutron.agent.linux.ra [-] radvd disabled for router f4c26283-f4a9-4a65-be11-5b77cf330286 disable /usr/lib/python2.7/site-packages/neutron/agent/linux/ra.py:192 2018-07-11 14:56:12.411 249422 DEBUG neutron.agent.l3.ha [-] Updating server with HA routers states {'461ca428-e6de-4fc2-b571-507a96476a83': 'active', 'd8074e7e-b108-4c62-aa95-b49e72867562': 'active', '7c3dd049-cc73-43cf-9e8d-649da3ecacea': 'active', '03ca67c8-eece-44ee-9a51-fd7adff0d85f': 'active', 'edccfa87-1ea6-4937-a0ef-2f4c0d8551fe': 'active', '788fcbae-6bb4-4458-9298-81dadefe8dc5': 'active', 'a8b24af1-3d7a-4ff5-baf0-f8577ed77a23': 'active', '448cdab2-5bbe-41d7-a2e9-5ad60596b107': 'active', 'f4c26283-f4a9-4a65-be11-5b77cf330286': 'active', '7b9920fb-5299-4388-b74c-fe7ab6d068a4': 'active', '7b35e807-52a5-49cf-96f6-facf88456488': 'active', '6fc34db3-4b90-4556-98c2-d5366c0685d1': 'active', '75da816e-ac8b-4c45-b8fa-6f6f6158e6af': 'active'} notify_server /usr/lib/python2.7/site-packages/neutron/agent/l3/ha.py:215 [root@networker-1 heat-admin]# ls -l /var/lib/neutron/external/pids/ total 52 -rw-r--r--. 1 root root 7 Jul 11 14:55 03ca67c8-eece-44ee-9a51-fd7adff0d85f.pid -rw-r--r--. 1 root root 7 Jul 11 14:55 448cdab2-5bbe-41d7-a2e9-5ad60596b107.pid -rw-r--r--. 1 root root 7 Jul 11 14:55 461ca428-e6de-4fc2-b571-507a96476a83.pid -rw-r--r--. 1 root root 7 Jul 11 14:55 6fc34db3-4b90-4556-98c2-d5366c0685d1.pid -rw-r--r--. 1 root root 7 Jul 11 14:55 75da816e-ac8b-4c45-b8fa-6f6f6158e6af.pid -rw-r--r--. 1 root root 7 Jul 11 14:55 788fcbae-6bb4-4458-9298-81dadefe8dc5.pid -rw-r--r--. 1 root root 7 Jul 11 14:55 7b35e807-52a5-49cf-96f6-facf88456488.pid -rw-r--r--. 1 root root 7 Jul 11 14:55 7b9920fb-5299-4388-b74c-fe7ab6d068a4.pid -rw-r--r--. 1 root root 7 Jul 11 14:55 7c3dd049-cc73-43cf-9e8d-649da3ecacea.pid -rw-r--r--. 1 root root 7 Jul 11 14:55 a8b24af1-3d7a-4ff5-baf0-f8577ed77a23.pid -rw-r--r--. 1 root root 7 Jul 11 14:55 d8074e7e-b108-4c62-aa95-b49e72867562.pid -rw-r--r--. 1 root root 7 Jul 11 14:55 edccfa87-1ea6-4937-a0ef-2f4c0d8551fe.pid -rw-r--r--. 1 root root 7 Jul 11 14:55 f4c26283-f4a9-4a65-be11-5b77cf330286.pid Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Deploy OSP10 z4 with 3 controllers + 2 computes + 3 ceph osd nodes + 2 networker nodes 2. Reboot overcloud nodes 3. Check neutron routers created before update status Actual results: All l3 agents are in ha_state 'standby' Expected results: There is one active l3 agent per router. Additional info: Attaching sosreports from networker nodes.
Created attachment 1458140 [details] sosreport-networker-0
Created attachment 1458141 [details] sosreport-networker-1
Some notes: the issue didn't reproduce on a fresh OSP10 latest deployment.
So the odd thing I see in the l3-agent log on networker-1 is: 2018-07-11 15:07:47.266 3722 DEBUG neutron.agent.l3.agent [req-f1f57f60-e241-43fb-b06b-cebc838896f7 - - - - -] Starting fullsync periodic_sync_routers_task periodic_sync_routers_task [...] 2018-07-11 15:07:53.127 3722 DEBUG neutron.agent.l3.agent [req-f1f57f60-e241-43fb-b06b-cebc838896f7 - - - - -] periodic_sync_routers_task successfully completed fetch_and_sync_all_routers But there was no work done, and no message back to the server notifying it of the current state of routers. For example, right before the restart there was: 2018-07-11 14:56:12.411 249422 DEBUG neutron.agent.l3.ha [-] Updating server with HA routers states {'461ca428-e6de-4fc2-b571-507a96476a83': 'active', 'd8074e7e-b108-4c62-aa95-b49e72867562': 'active', '7c3dd049-cc73-43cf-9e8d-649da3ecacea': 'active', '03ca67c8-eece-44ee-9a51-fd7adff0d85f': 'active', 'edccfa87-1ea6-4937-a0ef-2f4c0d8551fe': 'active', '788fcbae-6bb4-4458-9298-81dadefe8dc5': 'active', 'a8b24af1-3d7a-4ff5-baf0-f8577ed77a23': 'active', '448cdab2-5bbe-41d7-a2e9-5ad60596b107': 'active', 'f4c26283-f4a9-4a65-be11-5b77cf330286': 'active', '7b9920fb-5299-4388-b74c-fe7ab6d068a4': 'active', '7b35e807-52a5-49cf-96f6-facf88456488': 'active', '6fc34db3-4b90-4556-98c2-d5366c0685d1': 'active', '75da816e-ac8b-4c45-b8fa-6f6f6158e6af': 'active'} notify_server I didn't see an l3-agent log in the networker-0 sosreport to correlate things. It's as if there were no routers returned in the full sync call.
I reproduced the issue(it only shows up after minor update). After rebooting networker-1 node: [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router 4c373aff-7755-40c5-80b7-289475fd9008 +--------------------------------------+-------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+-------------------------+----------------+-------+----------+ | 3d07375f-3e6a-41bb-9712-f814c5008807 | networker-0.localdomain | True | :-) | active | | 61e270cb-586c-49df-b925-2d9d58bbe70d | networker-1.localdomain | True | xxx | standby | +--------------------------------------+-------------------------+----------------+-------+----------+ Attaching /var/log/neutron from networker-1.
Created attachment 1458253 [details] neutron.tar.gz
Note: this only seems to be happening when you have separate Networker nodes. I wasn't able to reproduce it with monolithic controllers.
OK, so what happens is that after update and reboot the hostname for the agents changes(see networker-1.localdomain agents are down while networker-1 agents are up): [stack@undercloud-0 ~]$ neutron agent-list +--------------------------------------+--------------------+-------------------------+-------------------+-------+----------------+---------------------------+ | id | agent_type | host | availability_zone | alive | admin_state_up | binary | +--------------------------------------+--------------------+-------------------------+-------------------+-------+----------------+---------------------------+ | 03e37225-5d81-4acc-87dd-287afdec9e09 | Metadata agent | networker-1.localdomain | | xxx | True | neutron-metadata-agent | | 37d9c9a1-99ab-4744-aa78-5b3948743670 | Open vSwitch agent | networker-0.localdomain | | :-) | True | neutron-openvswitch-agent | | 3c591ad5-8a2d-40db-9eb6-2d3a1251e201 | DHCP agent | networker-0.localdomain | nova | :-) | True | neutron-dhcp-agent | | 402cb673-47a8-43d7-b81c-69330c8d4e45 | Open vSwitch agent | compute-0.localdomain | | :-) | True | neutron-openvswitch-agent | | 496f9a91-5b51-486f-81a1-31b526ccbf65 | L3 agent | networker-0.localdomain | nova | :-) | True | neutron-l3-agent | | 55da449a-3d94-4165-89fc-499025050f25 | DHCP agent | networker-1 | nova | :-) | True | neutron-dhcp-agent | | 88e7f6aa-3520-4c52-8f63-f3dd3fd08172 | Metadata agent | networker-0.localdomain | | :-) | True | neutron-metadata-agent | | 8a0c92ed-cdc7-4701-9fbf-86f3a4589fe9 | L3 agent | networker-1.localdomain | nova | xxx | True | neutron-l3-agent | | 94292a2f-d8e4-480a-bb21-e8df37f97154 | Open vSwitch agent | compute-4.localdomain | | :-) | True | neutron-openvswitch-agent | | 9816c1be-f4f9-46e3-b905-f8efaacb54b1 | Open vSwitch agent | networker-1.localdomain | | xxx | True | neutron-openvswitch-agent | | 9be3b9ba-5cb1-42b9-a1de-f8ce1748d444 | Open vSwitch agent | networker-1 | | :-) | True | neutron-openvswitch-agent | | bb3a5435-e9e2-4c01-b717-989d28c14486 | Open vSwitch agent | compute-2.localdomain | | :-) | True | neutron-openvswitch-agent | | d4783830-25d9-4a4e-955d-55e5670c5bb5 | DHCP agent | networker-1.localdomain | nova | xxx | True | neutron-dhcp-agent | | df386d89-d052-4052-8e23-2abe8ea03f4f | L3 agent | networker-1 | nova | :-) | True | neutron-l3-agent | | e24a5ce2-a0ee-4733-93ec-6c191190782f | Open vSwitch agent | compute-1.localdomain | | :-) | True | neutron-openvswitch-agent | | e27638e7-2b3c-4ba7-a293-3a79395ec47b | Metadata agent | networker-1 | | :-) | True | neutron-metadata-agent | | e295f9e6-6180-4531-985e-786e61f2b4d8 | Open vSwitch agent | compute-3.localdomain | | :-) | True | neutron-openvswitch-agent | +--------------------------------------+--------------------+-------------------------+-------------------+-------+----------------+---------------------------+
[root@networker-1 heat-admin]# grep -v ^# /etc/neutron/neutron.conf | grep -v ^$ [DEFAULT] auth_strategy=keystone core_plugin=ml2 service_plugins=router,qos,trunk allow_overlapping_ips=True host=networker-1 global_physnet_mtu=1496 dhcp_agents_per_network=2 debug=True log_dir=/var/log/neutron rpc_backend=rabbit control_exchange=neutron [agent] root_helper=sudo neutron-rootwrap /etc/neutron/rootwrap.conf [cors] [cors.subdomain] [database] [keystone_authtoken] [matchmaker_redis] [nova] [oslo_concurrency] lock_path=$state_path/lock [oslo_messaging_amqp] [oslo_messaging_notifications] [oslo_messaging_rabbit] rabbit_hosts=172.17.1.17:5672,172.17.1.18:5672,172.17.1.27:5672 rabbit_use_ssl=False rabbit_userid=guest rabbit_password=8rabVQE2unyVyvb3uwmhBXPuV rabbit_ha_queues=True heartbeat_timeout_threshold=60 [oslo_messaging_zmq] [oslo_middleware] [oslo_policy] [qos] [quotas] [ssl]
Adding DFG:Upgrades as I think this is related to BZ#1499201
Moving Networking DFG to observer / secondary DFG. If the hostname changes on the agents that explains the problem, and seems to be more Upgrades DFG related. Marius do you think that RHBZ 1499201 might not have been resolved in all cases?
(In reply to Assaf Muller from comment #11) > Moving Networking DFG to observer / secondary DFG. If the hostname changes > on the agents that explains the problem, and seems to be more Upgrades DFG > related. Marius do you think that RHBZ 1499201 might not have been resolved > in all cases? Yes, I could only spot this issue when using a Networker role so I suspect this is affecting only deployments involving custom roles(not the monolithic controllers).
Hi, so, we definitively have the host change: Jul 11 01:10:16 networker-0 os-collect-config[3201]: [2018-07-11 01:10:15,354] (heat-config) [DEBUG] [2018-07-11 01:09:53,868] (heat-config) [DEBUG] Running FACTER_heat_outputs_path="/var/run/heat-config/heat-config-puppet/c0d7893b-22e4-4e0f-9099-fbe275ccb2d7" FACTER_fqdn="networker-0.localdomain" FACTER_deploy_config_name="NetworkerDeployment_Step3" puppet apply --detailed-exitcodes --logdest console --modulepath /etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --debug --logdest /var/log/puppet/heat-debug.log /var/lib/heat-config/heat-config-puppet/c0d7893b-22e4-4e0f-9099-fbe275ccb2d7.pp /Stage[main]/Neutron/Neutron_config[DEFAULT/host]/value: value changed ['networker-0.localdomain'] to ['networker-0']\ Jul 11 01:10:16 networker-0 os-collect-config[3201]: Debug: Loading facts from /usr/share/openstack-puppet/modules/tripleo/lib/facter/current_config_hosts.rb and the current_config_hosts.rb is loaded. So it output network-0 as a result of its "introspection", which is unexpected. I will need to deploy the same env to see what is happening in this context. As a workaround, the workaround scripts in https://bugzilla.redhat.com/show_bug.cgi?id=1499201 should work here as well (as temporary fix): - change host value in neutron.conf to the previous one,restart neutron on the networker role - use " Workaround to get l3ha routers rescheduled " to get the new value in.
hi, so the networker replacement test doesn't seem to use overcloud stack update command but some form of added templates to a deploy command. So first I wonder if this change of protocol for testing invalidate the patch as we're still not sure that it's working or not for update. Then I need logs or better yet a environment to be able to determine if the problem is similar to what we saw during update. Lastly I would like to know which version of rhel we're using for osp10 deployment. Thanks,
Hi, so here's the rpm build. If it's delivered as an hotfix, it has to be installed on all the overcloud nodes. One way to do that is: ------8<------- RPM_PATH=[path to the rpm downloaded on the undercloud] . ~/stackrc openstack server list -f json > ~/server.json jq -r '.[] | .Networks' ~/server.json | cut -d= -f2 > ~/ips.txt for ip in $(cat ~/ips.txt); do scp $RMP_PATH heat-admin@${ip}: ; done for ip in $(cat ~/ips.txt); do ssh heat-admin@${ip} yum install -y ./$(basename $RMP_PATH) ; done ------>8------- there are others, it's just given as an example.
(In reply to Sofer Athlan-Guyot from comment #27) Hi, Mistake in the example script (missing sudo), sorry: ------8<------- RPM_PATH=[path to the rpm downloaded on the undercloud] . ~/stackrc openstack server list -f json > ~/server.json jq -r '.[] | .Networks' ~/server.json | cut -d= -f2 > ~/ips.txt for ip in $(cat ~/ips.txt); do scp $RMP_PATH heat-admin@${ip}: ; done for ip in $(cat ~/ips.txt); do ssh heat-admin@${ip} sudo yum install -y ./$(basename $RMP_PATH) ; done ------>8------- > Hi, > > so here's the rpm build. > > If it's delivered as an hotfix, it has to be installed on all the overcloud > nodes. > > > One way to do that is: > > ------8<------- > > RPM_PATH=[path to the rpm downloaded on the undercloud] > . ~/stackrc > openstack server list -f json > ~/server.json > jq -r '.[] | .Networks' ~/server.json | cut -d= -f2 > ~/ips.txt > > for ip in $(cat ~/ips.txt); do scp $RMP_PATH heat-admin@${ip}: ; done > for ip in $(cat ~/ips.txt); do ssh heat-admin@${ip} yum install -y > ./$(basename $RMP_PATH) ; done > > ------>8------- > > there are others, it's just given as an example.
*** Bug 1638303 has been marked as a duplicate of this bug. ***
Moving this to verified as Marius has already tested the patch and the issue was solved when shipped on the client side.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3673
Hi, please check that comment [1] especially the part on the undercloud configuration needed to have scale out node with host parameter with fqdn[2]. The fact that cloud-init set it to short name indicate that it may be the case that the undercloud is not configured properly: 2019-04-02 14:31:43,422 - cc_set_hostname.py[DEBUG]: Setting the hostname to m1pl-st-comp0-12 (m1pl-st-comp0-12) 2019-04-02 14:31:43,422 - util.py[DEBUG]: Running command ['hostnamectl', 'set-hostname', 'm1pl-st-comp0-12'] with allowed return codes [0] (shell=False, capture=True) So first check the undercloud configuration and make sure it match what is described in [2]. Please report you finding. For better tracking I think it would be better to open a new bz, like "scale out compute node have short name host parameter, blocking ffu". Thanks, [1] https://bugzilla.redhat.com/show_bug.cgi?id=1657692#c21 [2] https://access.redhat.com/solutions/2089051