Bug 1264690
Summary: | neutron service is not running after a reboot | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | bigswitch <rhosp-bugs-internal> |
Component: | rhosp-director | Assignee: | Hugh Brock <hbrock> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | yeylon <yeylon> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.0 (Kilo) | CC: | abeekhof, amuller, chrisw, dblack, ekuris, fdinitto, jguiditt, mburns, michele, morazi, nyechiel, rhel-osp-director-maint, rhosp-bugs-internal, srevivo, tfreger, yeylon |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | 8.0 (Liberty) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-02-28 07:46:22 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
bigswitch
2015-09-20 18:51:21 UTC
Change title description, neutron server is not running on two controller nodes. After starting those services, ip netns shows up. however, the ip is still mess up on controller0 [root@overcloud-controller-0 heat-admin]# ip netns exec qdhcp-0e4dc72c-343f-49e9-98cc-a77e9311c280 ifconfig ns-025443dd-f7: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.1.1.3 netmask 255.255.255.0 broadcast 10.1.1.255 inet6 fe80::f816:3eff:fe35:3d15 prefixlen 64 scopeid 0x20<link> ether fa:16:3e:35:3d:15 txqueuelen 1000 (Ethernet) RX packets 2370 bytes 142350 (139.0 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 11 bytes 774 (774.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@overcloud-controller-0 heat-admin]# on controller1 [root@overcloud-controller-1 heat-admin]# ip netns exec qdhcp-0e4dc72c-343f-49e9-98cc-a77e9311c280 ifconfig lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 0 (Local Loopback) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ns-064f2a04-a3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.1.1.4 netmask 255.255.255.0 broadcast 10.1.1.255 inet6 fe80::f816:3eff:fef9:986 prefixlen 64 scopeid 0x20<link> ether fa:16:3e:f9:09:86 txqueuelen 1000 (Ethernet) RX packets 107 bytes 6588 (6.4 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 107 bytes 4806 (4.6 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@overcloud-controller-1 heat-admin]# on controller2 [root@overcloud-controller-2 heat-admin]# ip netns exec qdhcp-0e4dc72c-343f-49e9-98cc-a77e9311c280 ifconfig lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 0 (Local Loopback) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ns-68de1b0f-cd: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.1.1.2 netmask 255.255.255.0 broadcast 10.1.1.255 inet6 fe80::f816:3eff:fe4e:b72 prefixlen 64 scopeid 0x20<link> ether fa:16:3e:4e:0b:72 txqueuelen 1000 (Ethernet) RX packets 106 bytes 6528 (6.3 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 8 bytes 648 (648.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@overcloud-controller-2 heat-admin]# What was triggering the reboot of all nodes? Do we have any status information from pacemaker during this interaction? Some questions. First, if this is related to the BZ you list above, is it the same setup with a later problem? If so, it could be that there are additional complications with neutron due to rabbit failure. Next, how did you restart the services after reboot, 'pcs resource cleanup {name}'? If not, what did you use? Can you give us the output of 'pcs status' and 'pcs constraint' as well as the crm_report? I attempt to recover the setup by rebooting all three controller node after bugzilla 1264688. it is the same setup with a later problem. I did a systemctl start neutron-server on the controller nodes. Re-adding needinfo. Can you provide the pcs status and pcs constraint output? Also, just a note, systemctl should not be used. You should be using pcs resource cleanup <name> and then pcs resource start <name> Sep 20 14:17:03 overcloud-controller-1 pengine[3163]: warning: unpack_rsc_op_failure: Processing failed op stop for openstack-cinder-volume on overcloud-controller-2: OCF_TIMEOUT (198) Sep 20 14:17:03 overcloud-controller-1 pengine[3163]: warning: unpack_rsc_op_failure: Processing failed op stop for openstack-cinder-volume on overcloud-controller-2: OCF_TIMEOUT (198) Failed stops with no fencing == unsupportable Whatever else is going on, there are a bunch of missing constraints as covered in bug #1257414 (In reply to Andrew Beekhof from comment #9) > Whatever else is going on, there are a bunch of missing constraints as > covered in bug #1257414 Andrew, thanks for checking. The root problems are: 1) missing constraints as described in #1257414 2) missing fencing configuration #1 specifically will break any stop/restart actions and you can apply those fixes manually as described in the bugzilla, while they will be part of the new OSPd update/release. Did following steps [heat-admin@overcloud-controller-1 ~]$ sudo pcs resource cleanup All resources/stonith devices successfully cleaned up restarted neutron-l3-agent [heat-admin@overcloud-controller-1 ~]$ sudo systemctl status neutron-l3-agent neutron-l3-agent.service - Cluster Controlled neutron-l3-agent Loaded: loaded (/usr/lib/systemd/system/neutron-l3-agent.service; disabled) Drop-In: /run/systemd/system/neutron-l3-agent.service.d └─50-pacemaker.conf Active: failed (Result: exit-code) since Wed 2015-09-30 18:35:24 EDT; 1min 10s ago Main PID: 17916 (code=exited, status=1/FAILURE) Sep 30 18:30:17 overcloud-controller-1.localdomain systemd[1]: Started Cluster Controlled neutron-l3-agent. Sep 30 18:35:24 overcloud-controller-1.localdomain systemd[1]: neutron-l3-agent.service: main process exited, code=exited, status=1/FAILURE Sep 30 18:35:24 overcloud-controller-1.localdomain systemd[1]: Unit neutron-l3-agent.service entered failed state. [heat-admin@overcloud-controller-1 ~]$ sudo su [heat-admin@overcloud-controller-1 ~]$ sudo pcs resource show ip-172.17.0.11 (ocf::heartbeat:IPaddr2): Started ip-192.0.2.12 (ocf::heartbeat:IPaddr2): Started Clone Set: haproxy-clone [haproxy] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] ip-172.18.0.10 (ocf::heartbeat:IPaddr2): Started ip-10.17.66.11 (ocf::heartbeat:IPaddr2): Started ip-172.17.0.10 (ocf::heartbeat:IPaddr2): Started Master/Slave Set: galera-master [galera] Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] ip-172.19.0.10 (ocf::heartbeat:IPaddr2): Started Master/Slave Set: redis-master [redis] Masters: [ overcloud-controller-2 ] Slaves: [ overcloud-controller-0 overcloud-controller-1 ] Clone Set: mongod-clone [mongod] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: rabbitmq-clone [rabbitmq] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: memcached-clone [memcached] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler] openstack-nova-scheduler (systemd:openstack-nova-scheduler): FAILED Started: [ overcloud-controller-0 overcloud-controller-1 ] Clone Set: neutron-l3-agent-clone [neutron-l3-agent] Started: [ overcloud-controller-0 overcloud-controller-2 ] Stopped: [ overcloud-controller-1 ] Clone Set: openstack-ceilometer-alarm-notifier-clone [openstack-ceilometer-alarm-notifier] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-heat-engine-clone [openstack-heat-engine] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup] Started: [ overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-controller-0 ] Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup] Started: [ overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-controller-0 ] Clone Set: openstack-heat-api-clone [openstack-heat-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-api-clone [openstack-nova-api] openstack-nova-api (systemd:openstack-nova-api): FAILED Stopped: [ overcloud-controller-0 overcloud-controller-1 ] Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-keystone-clone [openstack-keystone] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth] openstack-nova-consoleauth (systemd:openstack-nova-consoleauth): FAILED openstack-nova-consoleauth (systemd:openstack-nova-consoleauth): FAILED openstack-nova-consoleauth (systemd:openstack-nova-consoleauth): FAILED Clone Set: openstack-glance-registry-clone [openstack-glance-registry] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-cinder-api-clone [openstack-cinder-api] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-glance-api-clone [openstack-glance-api] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy] openstack-nova-novncproxy (systemd:openstack-nova-novncproxy): FAILED Stopped: [ overcloud-controller-1 overcloud-controller-2 ] Clone Set: delay-clone [delay] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-server-clone [neutron-server] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: httpd-clone [httpd] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-alarm-evaluator-clone [openstack-ceilometer-alarm-evaluator] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] openstack-cinder-volume (systemd:openstack-cinder-volume): Stopped Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor] openstack-nova-conductor (systemd:openstack-nova-conductor): FAILED Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 2015-09-30 18:34:23.839 17916 DEBUG oslo_messaging._drivers.amqpdriver [req-102987a4-c9a1-4412-9975-32bf95a20daf ] MSG_ID is 5d8139a41dd740f5b6adbc6084480f2e _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:311 2015-09-30 18:34:23.839 17916 DEBUG oslo_messaging._drivers.amqp [req-102987a4-c9a1-4412-9975-32bf95a20daf ] UNIQUE_ID is f9c244d7ae21450d91817cec3cc05b1c. _add_unique_id /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqp.py:258 2015-09-30 18:35:18.646 17916 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" acquired by "_check_child_processes" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:444 2015-09-30 18:35:18.646 17916 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" released by "_check_child_processes" :: held 0.001s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:456 2015-09-30 18:35:24.510 17916 CRITICAL neutron [req-102987a4-c9a1-4412-9975-32bf95a20daf ] MessagingTimeout: Timed out waiting for a reply to message ID 5d8139a41dd740f5b6adbc6084480f2e 2015-09-30 18:35:24.510 17916 TRACE neutron Traceback (most recent call last): 2015-09-30 18:35:24.510 17916 TRACE neutron File "/usr/bin/neutron-l3-agent", line 10, in <module> 2015-09-30 18:35:24.510 17916 TRACE neutron sys.exit(main()) 2015-09-30 18:35:24.510 17916 TRACE neutron File "/usr/lib/python2.7/site-packages/neutron/cmd/eventlet/agents/l3.py", line 17, in main 2015-09-30 18:35:24.510 17916 TRACE neutron l3_agent.main() 2015-09-30 18:35:24.510 17916 TRACE neutron File "/usr/lib/python2.7/site-packages/neutron/agent/l3_agent.py", line 53, in main 2015-09-30 18:35:24.510 17916 TRACE neutron manager=manager) 2015-09-30 18:35:24.510 17916 TRACE neutron File "/usr/lib/python2.7/site-packages/neutron/service.py", line 264, in create 2015-09-30 18:35:24.510 17916 TRACE neutron periodic_fuzzy_delay=periodic_fuzzy_delay) 2015-09-30 18:35:24.510 17916 TRACE neutron File "/usr/lib/python2.7/site-packages/neutron/service.py", line 197, in __init__ 2015-09-30 18:35:24.510 17916 TRACE neutron self.manager = manager_class(host=host, *args, **kwargs) 2015-09-30 18:35:24.510 17916 TRACE neutron File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 548, in __init__ 2015-09-30 18:35:24.510 17916 TRACE neutron super(L3NATAgentWithStateReport, self).__init__(host=host, conf=conf) 2015-09-30 18:35:24.510 17916 TRACE neutron File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 208, in __init__ 2015-09-30 18:35:24.510 17916 TRACE neutron continue 2015-09-30 18:35:24.510 17916 TRACE neutron File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in __exit__ 2015-09-30 18:35:24.510 17916 TRACE neutron six.reraise(self.type_, self.value, self.tb) 2015-09-30 18:35:24.510 17916 TRACE neutron File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 188, in __init__ 2015-09-30 18:35:24.510 17916 TRACE neutron self.plugin_rpc.get_service_plugin_list(self.context)) 2015-09-30 18:35:24.510 17916 TRACE neutron File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 124, in get_service_plugin_list 2015-09-30 18:35:24.510 17916 TRACE neutron return cctxt.call(context, 'get_service_plugin_list') 2015-09-30 18:35:24.510 17916 TRACE neutron File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in call 2015-09-30 18:35:24.510 17916 TRACE neutron retry=self.retry) 2015-09-30 18:35:24.510 17916 TRACE neutron File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send 2015-09-30 18:35:24.510 17916 TRACE neutron timeout=timeout, retry=retry) 2015-09-30 18:35:24.510 17916 TRACE neutron File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 350, in send 2015-09-30 18:35:24.510 17916 TRACE neutron retry=retry) 2015-09-30 18:35:24.510 17916 TRACE neutron File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 339, in _send 2015-09-30 18:35:24.510 17916 TRACE neutron result = self._waiter.wait(msg_id, timeout) 2015-09-30 18:35:24.510 17916 TRACE neutron File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 243, in wait 2015-09-30 18:35:24.510 17916 TRACE neutron message = self.waiters.get(msg_id, timeout=timeout) 2015-09-30 18:35:24.510 17916 TRACE neutron File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 149, in get 2015-09-30 18:35:24.510 17916 TRACE neutron 'to message ID %s' % msg_id) 2015-09-30 18:35:24.510 17916 TRACE neutron MessagingTimeout: Timed out waiting for a reply to message ID 5d8139a41dd740f5b6adbc6084480f2e 2015-09-30 18:35:24.510 17916 TRACE neutron We still need you to apply manually the fixes from bug mentioned in comment #10 and configure fencing. The message you see is potentially related to the fact that neutron (and other services) are missing a start order dependency on rabbitmq. On restart rabbitmq can be shutdown before neutron has done stopping. Please apply the fixes, configure fencing, cleanup those failed services. Hi, Does 7.1 GA code has the fixes? Thanks (In reply to bigswitch from comment #13) > Hi, > > Does 7.1 GA code has the fixes? > > Thanks you need to have: fence-agents-4.0.11-13.el7_1.1 (or greater) pacemaker-1.1.12-22.el7_1.4.x86_64 (or greater) resource-agents-3.9.5-40.el7_1.5.x86_64 (or greater) those updates have shipped after 7.1GA but they are available in the normal update channels. Did you configure fencing and apply the fixes from comment #10? Just a note, because I suspect that the question was actually OSPd 7.1, not RHEL 7.1 In the OSPd 7.1 image: pacemaker-1.1.12-22.el7_1.4 resource-agents-3.9.5-40.el7_1.9 fence-agents-4.0.11-13.el7_1.2 This meets or exceeds the minimums in comment 15 saw the issue in ospd ans osp7 puddle: 2015-10-16.2 This should be fixed in OSP 7.3 and 8.0. Please re-open if you find the behavior again. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |