Bug 2080952

Summary: nova-compute service fails to come up after controller reboot
Product: Red Hat OpenStack Reporter: Ketan Mehta <kmehta>
Component: rabbitmq-serverAssignee: Peter Lemenkov <plemenko>
Status: CLOSED DUPLICATE QA Contact: dabarzil
Severity: high Docs Contact:
Priority: high    
Version: 17.0 (Wallaby)CC: alifshit, apevec, bdobreli, dasmith, eglynn, eolivare, jeckersb, jhakimra, kchamart, lhh, lmiccini, plemenko, sbauza, sgordon, skaplons, tfreger, vromanso
Target Milestone: ---Keywords: Reopened, Triaged
Target Release: ---Flags: ykaul: needinfo? (kmehta)
ifrangs: needinfo? (plemenko)
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-18 07:17:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ketan Mehta 2022-05-02 12:52:09 UTC
Description of problem:

If the pcs cluster is stopped on all nodes, followed by controller reboot the cluster and all it's services come up.

However, the nova-compute services fail to come up either on all nodes or on a random number of nodes.

The error log suggests that it is unable to AMQP and the request fails with ConnectionRefusedError.

~~~
2022-05-02 10:02:57.704 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.710 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 8.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.716 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.723 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.729 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 8.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:59.716 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:59.723 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:59.729 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 6.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:03:05.734 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:03:05.750 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
~~~

Alternatively, at times the nova_compute container remains in unhealthy state and does not report any logs and stays down until restarted which fixes the issue.

Version-Release number of selected component (if applicable):

# rpm -qa |grep -i nova
openstack-nova-common-23.2.1-0.20220317230946.9609ae0.el9ost.noarch
openstack-nova-compute-23.2.1-0.20220317230946.9609ae0.el9ost.noarch
openstack-nova-migration-23.2.1-0.20220317230946.9609ae0.el9ost.noarch

# rpm -qa |grep -i rabbit
[root@controller-2 /]# rpm -qa |grep -i rabbit
rabbitmq-server-3.9.10-2.el9ost.x86_64

How reproducible:


Steps to Reproduce:
1. Stop pcs cluster on all nodes (pcs cluster stop --all)
2. Reboot the controllers
3. Check the compute service status (openstack compute service list)

Actual results:

nova-compute service remains down on all/few nodes.

Expected results:

nova-compute service should come up.

Additional info:

Comment 2 Artom Lifshitz 2022-05-02 17:00:15 UTC
Hi Ketan,

I'm not sure where to redirect you, but I don't think DFG:Compute (aka the Nova team) can help here. The issue is clearly that nova-compute cannot connect to the message queue, but why this is happening is anyone's guess. Could be a network issue, a VIP issue, a firewall issue, I'm not sure. Maybe try DFG:PIDONE for initial triage? I'm going to close this as NOTABUG because there is no Nova bug here, and because I want to get this off our triage list. When you re-open this, please change the component and/or the Internal Whiteboard field to a more appropriate team.

Thanks in advance.

Comment 3 Ketan Mehta 2022-05-04 06:06:31 UTC
Thanks Artom, I'll reopen it with PIDONE for rabbitmq for the initial triage.

Comment 4 Luca Miccini 2022-05-04 07:18:05 UTC
Hey Ketan, do you have the logs/sosreports somewhere?