Bug 1690955 - nova compute service doesn't start after overcloud reboot
Summary: nova compute service doesn't start after overcloud reboot
Keywords:
Status: CLOSED DUPLICATE of bug 1592528
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: Sasha Smolyak
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-20 14:35 UTC by Victor Voronkov
Modified: 2019-06-06 12:19 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-06 12:19:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Victor Voronkov 2019-03-20 14:35:06 UTC
Description of problem:
Sequence of scale up compute-1, then reboot, then scale down compute-0, then reboot again causes nova not to start after second reboot

Version-Release number of selected component (if applicable):


How reproducible:
rerun the jenkins job

Steps to Reproduce:
1. scale up compute-1
2. reboot
3. scale down compute-0
4. reboot

Actual results:
nova service fails to start

Expected results:
should work

Additional info:

from nova-compute.log on compute-1:

2019-03-16 01:56:31.878 1 ERROR oslo.messaging._drivers.impl_rabbit [req-996c9093-ddda-481e-a65c-5fe10e446cbc - - - - -] [c15e8746-ce5f-4644-a1fd-1b32bf9fabd1] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out
2019-03-16 01:56:32.892 1 INFO oslo.messaging._drivers.impl_rabbit [req-996c9093-ddda-481e-a65c-5fe10e446cbc - - - - -] [c15e8746-ce5f-4644-a1fd-1b32bf9fabd1] Reconnected to AMQP server on controller-0.internalapi.localdomain:5672 via [amqp] client with port 36268.
2019-03-16 01:58:32.894 1 ERROR oslo.messaging._drivers.impl_rabbit [req-996c9093-ddda-481e-a65c-5fe10e446cbc - - - - -] [c15e8746-ce5f-4644-a1fd-1b32bf9fabd1] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out
2019-03-16 01:58:33.911 1 INFO oslo.messaging._drivers.impl_rabbit [req-996c9093-ddda-481e-a65c-5fe10e446cbc - - - - -] [c15e8746-ce5f-4644-a1fd-1b32bf9fabd1] Reconnected to AMQP server on controller-0.internalapi.localdomain:5672 via [amqp] client with port 36278.

on the controller-0 rabbitmq log we see that after controllers were rebooted and then compute-1, rabbitmq was restarted at 01:46 again (probably by pacemaker?) and later on some errors:

=ERROR REPORT==== 16-Mar-2019::01:56:30 ===
Discarding message {'$gen_cast',{deliver,{delivery,false,true,<29642.28738.0>,{basic_message,{resource,<<"/">>,exchange,<<"q-server-resource-versions_fanout">>},[<<>>],{content,60,{'P_basic',<<"application/json">>,<<"utf-8">>,[],2,0,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},<<248,0,16,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110,5,117,116,102,45,56,0,0,0,0,2,0>>,rabbit_framing_amqp_0_9_1,[<<"{\"oslo.message\": \"{\\"_context_domain\\": null, \\"_context_request_id\\": \\"req-491a737c-388e-43bc-b548-0f5d5d965db3\\", \\"_context_global_request_id\\": null, \\"_context_auth_token\\": null, \\"_context_resource_uuid\\": null, \\"_context_tenant_name\\": null, \\"_context_user\\": null, \\"_context_user_id\\": null, \\"_context_show_deleted\\": false, \\"_context_is_admin\\": true, \\"version\\": \\"1.0\\", \\"_context_project_domain\\": null, \\"_context_timestamp\\": \\"2019-03-16 01:39:49.059025\\", \\"method\\": \\"report_agent_resource_versions\\", \\"_context_project\\": null, \\"_context_roles\\": [], \\"args\\": {\\"version_map\\": {\\"Subnet\\": \\"1.0\\", \\"Network\\": \\"1.0\\", \\"SubPort\\": \\"1.0\\", \\"SecurityGroup\\": \\"1.0\\", \\"SecurityGroupRule\\": \\"1.0\\", \\"Trunk\\": \\"1.1\\", \\"QosPolicy\\": \\"1.7\\", \\"Port\\": \\"1.1\\", \\"Log\\": \\"1.0\\"}, \\"agent_type\\": \\"Open vSwitch agent\\", \\"agent_host\\": \\"controller-1.localdomain\\"}, \\"_unique_id\\": \\"7233c509dd1a4d6cb8ab79583a060843\\", \\"_context_tenant_id\\": null, \\"_context_is_admin_project\\": true, \\"_context_project_name\\": null, \\"_context_user_identity\\": \\"- - - - -\\", \\"_context_tenant\\": null, \\"_context_project_id\\": null, \\"_context_read_only\\": false, \\"_context_user_domain\\": null, \\"_context_user_name\\": null}\", \"oslo.version\": \"2.0\"}">>]},<<124,165,169,40,69,139,221,178,23,12,115,15,163,108,192,137>>,true},1,flow},false}} from <0.1081.0> to <0.1836.0> in an old incarnation (1) of this node (2)


=WARNING REPORT==== 16-Mar-2019::01:56:31 ===
closing AMQP connection <0.7675.0> (172.17.1.23:36206 -> 172.17.1.16:5672 - nova-compute:1:c15e8746-ce5f-4644-a1fd-1b32bf9fabd1, vhost: '/', user: 'guest'):
client unexpectedly closed TCP connection
...


from openswitch-agent.log on compute-1:

2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID bc0ede5cb23944b79a75b6d09c090759
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 319, in _report_state
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     True)
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 93, in report_state
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     return method(context, 'report_state', **kwargs)
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 174, in call
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     retry=self.retry)
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 131, in _send
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     timeout=timeout, retry=retry)
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in send
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     retry=retry)
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 548, in _send
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     result = self._waiter.wait(msg_id, timeout)
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 440, in wait
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     message = self.waiters.get(msg_id, timeout=timeout)
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 328, in get
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     'to message ID %s' % msg_id)
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID bc0ede5cb23944b79a75b6d09c090759
2019-03-16 01:56:35.797 7530 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent

Comment 3 Chris Jones 2019-06-06 12:19:38 UTC

*** This bug has been marked as a duplicate of bug 1592528 ***


Note You need to log in before you can comment on or make changes to this bug.