Description of problem:
If the pcs cluster is stopped on all nodes, followed by controller reboot the cluster and all it's services come up.
However, the nova-compute services fail to come up either on all nodes or on a random number of nodes.
The error log suggests that it is unable to AMQP and the request fails with ConnectionRefusedError.
~~~
2022-05-02 10:02:57.704 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.710 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 8.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.716 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.723 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:57.729 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 8.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:59.716 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:59.723 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:02:59.729 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 6.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:03:05.734 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-05-02 10:03:05.750 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
~~~
Alternatively, at times the nova_compute container remains in unhealthy state and does not report any logs and stays down until restarted which fixes the issue.
Version-Release number of selected component (if applicable):
# rpm -qa |grep -i nova
openstack-nova-common-23.2.1-0.20220317230946.9609ae0.el9ost.noarch
openstack-nova-compute-23.2.1-0.20220317230946.9609ae0.el9ost.noarch
openstack-nova-migration-23.2.1-0.20220317230946.9609ae0.el9ost.noarch
# rpm -qa |grep -i rabbit
[root@controller-2 /]# rpm -qa |grep -i rabbit
rabbitmq-server-3.9.10-2.el9ost.x86_64
How reproducible:
Steps to Reproduce:
1. Stop pcs cluster on all nodes (pcs cluster stop --all)
2. Reboot the controllers
3. Check the compute service status (openstack compute service list)
Actual results:
nova-compute service remains down on all/few nodes.
Expected results:
nova-compute service should come up.
Additional info:
Hi Ketan,
I'm not sure where to redirect you, but I don't think DFG:Compute (aka the Nova team) can help here. The issue is clearly that nova-compute cannot connect to the message queue, but why this is happening is anyone's guess. Could be a network issue, a VIP issue, a firewall issue, I'm not sure. Maybe try DFG:PIDONE for initial triage? I'm going to close this as NOTABUG because there is no Nova bug here, and because I want to get this off our triage list. When you re-open this, please change the component and/or the Internal Whiteboard field to a more appropriate team.
Thanks in advance.