2080952 – nova-compute service fails to come up after controller reboot

Bug 2080952 - nova-compute service fails to come up after controller reboot

Summary: nova-compute service fails to come up after controller reboot

Keywords:
Status:	CLOSED DUPLICATE of bug 2115383
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rabbitmq-server
Sub Component:
Version:	17.0 (Wallaby)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Peter Lemenkov
QA Contact:	dabarzil
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-02 12:52 UTC by Ketan Mehta
Modified:	2023-12-02 04:25 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-18 07:17:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-14975	0	None	None	None	2022-05-02 13:18:04 UTC

Description Ketan Mehta 2022-05-02 12:52:09 UTC

Description of problem:

If the pcs cluster is stopped on all nodes, followed by controller reboot the cluster and all it's services come up.

However, the nova-compute services fail to come up either on all nodes or on a random number of nodes.

The error log suggests that it is unable to AMQP and the request fails with ConnectionRefusedError.

~~~
2022-05-02 10:02:57.704 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.710 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 8.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.716 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.723 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.729 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 8.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:59.716 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:59.723 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:59.729 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 6.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:03:05.734 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:03:05.750 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
~~~

Alternatively, at times the nova_compute container remains in unhealthy state and does not report any logs and stays down until restarted which fixes the issue.

Version-Release number of selected component (if applicable):

# rpm -qa |grep -i nova
openstack-nova-common-23.2.1-0.20220317230946.9609ae0.el9ost.noarch
openstack-nova-compute-23.2.1-0.20220317230946.9609ae0.el9ost.noarch
openstack-nova-migration-23.2.1-0.20220317230946.9609ae0.el9ost.noarch

# rpm -qa |grep -i rabbit
[root@controller-2 /]# rpm -qa |grep -i rabbit
rabbitmq-server-3.9.10-2.el9ost.x86_64

How reproducible:


Steps to Reproduce:
1. Stop pcs cluster on all nodes (pcs cluster stop --all)
2. Reboot the controllers
3. Check the compute service status (openstack compute service list)

Actual results:

nova-compute service remains down on all/few nodes.

Expected results:

nova-compute service should come up.

Additional info:

Comment 2 Artom Lifshitz 2022-05-02 17:00:15 UTC

Hi Ketan,

I'm not sure where to redirect you, but I don't think DFG:Compute (aka the Nova team) can help here. The issue is clearly that nova-compute cannot connect to the message queue, but why this is happening is anyone's guess. Could be a network issue, a VIP issue, a firewall issue, I'm not sure. Maybe try DFG:PIDONE for initial triage? I'm going to close this as NOTABUG because there is no Nova bug here, and because I want to get this off our triage list. When you re-open this, please change the component and/or the Internal Whiteboard field to a more appropriate team.

Thanks in advance.

Comment 3 Ketan Mehta 2022-05-04 06:06:31 UTC

Thanks Artom, I'll reopen it with PIDONE for rabbitmq for the initial triage.

Comment 4 Luca Miccini 2022-05-04 07:18:05 UTC

Hey Ketan, do you have the logs/sosreports somewhere?

Comment 14 Red Hat Bugzilla 2023-12-02 04:25:44 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.