Description of problem: If the pcs cluster is stopped on all nodes, followed by controller reboot the cluster and all it's services come up. However, the nova-compute services fail to come up either on all nodes or on a random number of nodes. The error log suggests that it is unable to AMQP and the request fails with ConnectionRefusedError. ~~~ 2022-05-02 10:02:57.704 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED 2022-05-02 10:02:57.710 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 8.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED 2022-05-02 10:02:57.716 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED 2022-05-02 10:02:57.723 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED 2022-05-02 10:02:57.729 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 8.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED 2022-05-02 10:02:59.716 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED 2022-05-02 10:02:59.723 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED 2022-05-02 10:02:59.729 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 6.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED 2022-05-02 10:03:05.734 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED 2022-05-02 10:03:05.750 2 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED ~~~ Alternatively, at times the nova_compute container remains in unhealthy state and does not report any logs and stays down until restarted which fixes the issue. Version-Release number of selected component (if applicable): # rpm -qa |grep -i nova openstack-nova-common-23.2.1-0.20220317230946.9609ae0.el9ost.noarch openstack-nova-compute-23.2.1-0.20220317230946.9609ae0.el9ost.noarch openstack-nova-migration-23.2.1-0.20220317230946.9609ae0.el9ost.noarch # rpm -qa |grep -i rabbit [root@controller-2 /]# rpm -qa |grep -i rabbit rabbitmq-server-3.9.10-2.el9ost.x86_64 How reproducible: Steps to Reproduce: 1. Stop pcs cluster on all nodes (pcs cluster stop --all) 2. Reboot the controllers 3. Check the compute service status (openstack compute service list) Actual results: nova-compute service remains down on all/few nodes. Expected results: nova-compute service should come up. Additional info:
Hi Ketan, I'm not sure where to redirect you, but I don't think DFG:Compute (aka the Nova team) can help here. The issue is clearly that nova-compute cannot connect to the message queue, but why this is happening is anyone's guess. Could be a network issue, a VIP issue, a firewall issue, I'm not sure. Maybe try DFG:PIDONE for initial triage? I'm going to close this as NOTABUG because there is no Nova bug here, and because I want to get this off our triage list. When you re-open this, please change the component and/or the Internal Whiteboard field to a more appropriate team. Thanks in advance.
Thanks Artom, I'll reopen it with PIDONE for rabbitmq for the initial triage.
Hey Ketan, do you have the logs/sosreports somewhere?
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days