Bug 2080952 - nova-compute service fails to come up after controller reboot
Summary: nova-compute service fails to come up after controller reboot
Keywords:
Status: CLOSED DUPLICATE of bug 2115383
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 17.0 (Wallaby)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Peter Lemenkov
QA Contact: dabarzil
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-02 12:52 UTC by Ketan Mehta
Modified: 2023-12-02 04:25 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-18 07:17:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-14975 0 None None None 2022-05-02 13:18:04 UTC

Description Ketan Mehta 2022-05-02 12:52:09 UTC
Description of problem:

If the pcs cluster is stopped on all nodes, followed by controller reboot the cluster and all it's services come up.

However, the nova-compute services fail to come up either on all nodes or on a random number of nodes.

The error log suggests that it is unable to AMQP and the request fails with ConnectionRefusedError.

~~~
2022-05-02 10:02:57.704 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.710 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 8.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.716 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.723 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.729 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 8.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:59.716 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:59.723 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:59.729 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 6.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:03:05.734 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:03:05.750 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
~~~

Alternatively, at times the nova_compute container remains in unhealthy state and does not report any logs and stays down until restarted which fixes the issue.

Version-Release number of selected component (if applicable):

# rpm -qa |grep -i nova
openstack-nova-common-23.2.1-0.20220317230946.9609ae0.el9ost.noarch
openstack-nova-compute-23.2.1-0.20220317230946.9609ae0.el9ost.noarch
openstack-nova-migration-23.2.1-0.20220317230946.9609ae0.el9ost.noarch

# rpm -qa |grep -i rabbit
[root@controller-2 /]# rpm -qa |grep -i rabbit
rabbitmq-server-3.9.10-2.el9ost.x86_64

How reproducible:


Steps to Reproduce:
1. Stop pcs cluster on all nodes (pcs cluster stop --all)
2. Reboot the controllers
3. Check the compute service status (openstack compute service list)

Actual results:

nova-compute service remains down on all/few nodes.

Expected results:

nova-compute service should come up.

Additional info:

Comment 2 Artom Lifshitz 2022-05-02 17:00:15 UTC
Hi Ketan,

I'm not sure where to redirect you, but I don't think DFG:Compute (aka the Nova team) can help here. The issue is clearly that nova-compute cannot connect to the message queue, but why this is happening is anyone's guess. Could be a network issue, a VIP issue, a firewall issue, I'm not sure. Maybe try DFG:PIDONE for initial triage? I'm going to close this as NOTABUG because there is no Nova bug here, and because I want to get this off our triage list. When you re-open this, please change the component and/or the Internal Whiteboard field to a more appropriate team.

Thanks in advance.

Comment 3 Ketan Mehta 2022-05-04 06:06:31 UTC
Thanks Artom, I'll reopen it with PIDONE for rabbitmq for the initial triage.

Comment 4 Luca Miccini 2022-05-04 07:18:05 UTC
Hey Ketan, do you have the logs/sosreports somewhere?

Comment 14 Red Hat Bugzilla 2023-12-02 04:25:44 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.