Bug 2080952 - nova-compute service fails to come up after controller reboot [NEEDINFO]
Summary: nova-compute service fails to come up after controller reboot
Keywords:
Status: CLOSED DUPLICATE of bug 2115383
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 17.0 (Wallaby)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Peter Lemenkov
QA Contact: dabarzil
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-02 12:52 UTC by Ketan Mehta
Modified: 2023-08-03 15:46 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-18 07:17:34 UTC
Target Upstream Version:
Embargoed:
ykaul: needinfo? (kmehta)
ifrangs: needinfo? (plemenko)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-14975 0 None None None 2022-05-02 13:18:04 UTC

Description Ketan Mehta 2022-05-02 12:52:09 UTC
Description of problem:

If the pcs cluster is stopped on all nodes, followed by controller reboot the cluster and all it's services come up.

However, the nova-compute services fail to come up either on all nodes or on a random number of nodes.

The error log suggests that it is unable to AMQP and the request fails with ConnectionRefusedError.

~~~
2022-05-02 10:02:57.704 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.710 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 8.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.716 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.723 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.729 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 8.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:59.716 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:59.723 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:59.729 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 6.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:03:05.734 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:03:05.750 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
~~~

Alternatively, at times the nova_compute container remains in unhealthy state and does not report any logs and stays down until restarted which fixes the issue.

Version-Release number of selected component (if applicable):

# rpm -qa |grep -i nova
openstack-nova-common-23.2.1-0.20220317230946.9609ae0.el9ost.noarch
openstack-nova-compute-23.2.1-0.20220317230946.9609ae0.el9ost.noarch
openstack-nova-migration-23.2.1-0.20220317230946.9609ae0.el9ost.noarch

# rpm -qa |grep -i rabbit
[root@controller-2 /]# rpm -qa |grep -i rabbit
rabbitmq-server-3.9.10-2.el9ost.x86_64

How reproducible:


Steps to Reproduce:
1. Stop pcs cluster on all nodes (pcs cluster stop --all)
2. Reboot the controllers
3. Check the compute service status (openstack compute service list)

Actual results:

nova-compute service remains down on all/few nodes.

Expected results:

nova-compute service should come up.

Additional info:

Comment 2 Artom Lifshitz 2022-05-02 17:00:15 UTC
Hi Ketan,

I'm not sure where to redirect you, but I don't think DFG:Compute (aka the Nova team) can help here. The issue is clearly that nova-compute cannot connect to the message queue, but why this is happening is anyone's guess. Could be a network issue, a VIP issue, a firewall issue, I'm not sure. Maybe try DFG:PIDONE for initial triage? I'm going to close this as NOTABUG because there is no Nova bug here, and because I want to get this off our triage list. When you re-open this, please change the component and/or the Internal Whiteboard field to a more appropriate team.

Thanks in advance.

Comment 3 Ketan Mehta 2022-05-04 06:06:31 UTC
Thanks Artom, I'll reopen it with PIDONE for rabbitmq for the initial triage.

Comment 4 Luca Miccini 2022-05-04 07:18:05 UTC
Hey Ketan, do you have the logs/sosreports somewhere?


Note You need to log in before you can comment on or make changes to this bug.