Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1942079

Summary: Instances are stuck in scheduling/building state when scheduled on specific compute host
Product: Red Hat OpenStack Reporter: Alex Stupnikov <astupnik>
Component: openstack-novaAssignee: OSP DFG:Compute <osp-dfg-compute>
Status: CLOSED NOTABUG QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: alifshit, dasmith, dhruv, eglynn, hberaud, igallagh, jhakimra, kchamart, lmiccini, ltamagno, mwitt, sbauza, sgordon, smooney, vromanso
Target Milestone: ---Flags: ltamagno: needinfo?
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-21 15:14:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alex Stupnikov 2021-03-23 15:39:26 UTC
Description of problem:

One of our customers reported a problem with his RHOSP 13 Z10 overcloud: instances are stuck in scheduling/building state [1] when scheduled on specific hypervisor (it is interesting that from Nova perspective hypervisor is not defined).

This problem was reproduced in two different ways:

1. Customer used horizon to start some amount of instances. Instances scheduled on affected compute got stuck in state [1] (no exceptions)
2. Customer specified provided AZ:compute_hostname argument when using nova command to create instances and faced same problem.

For second reproducer we have debug logs from controllers and affected compute nodes among with debug CLI outputs attached to support case (please check latest set of files). I have tried to understand this problem better, but failed to isolate the root cause.

When I started working with this issue it looked like a problem with nova-scheduler or rabbitMQ: I can see that nova-scheduler selected compute host to run the instance and nova-conductor created block device mapping. At the same time, compute host didn't get the call to start instance. So it looks like something is wrong with RPC or instance's state machine (instance just hangs in scheduling state). But the problem is that all nova services on controller nodes and nova-compute service were restarted, rabbitmq resource was also restarted and nothing changed.

Then I thought that customer is facing bug https://bugs.launchpad.net/nova/+bug/1652335 : I didn't know about this problem affecting all instances and thought that it is caused by AZ:compute_hostname argument. But again, I couldn't find anything interesting in the logs and amount of available pCPUs * 16.0 is well above number of used vCPUs.

At this point we are stuck and I am asking for a second look from an engineering. I will provide customer-specific information privately.

[1]
(empty and customer-specific fields are removed)
+-------------------------------------+--------------------------------------------------------+
| Field                               | Value                                                  |
+-------------------------------------+--------------------------------------------------------+
| OS-DCF:diskConfig                   | AUTO                                                   |
| OS-EXT-AZ:availability_zone         | nova                                                   |
| OS-EXT-SRV-ATTR:host                | None                                                   |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None                                                   |
| OS-EXT-SRV-ATTR:instance_name       | instance-00001be2                                      |
| OS-EXT-STS:power_state              | NOSTATE                                                |
| OS-EXT-STS:task_state               | scheduling                                             |
| OS-EXT-STS:vm_state                 | building                                               |
| OS-SRV-USG:launched_at              | None                                                   |
| OS-SRV-USG:terminated_at            | None                                                   |
| created                             | 2021-03-12T17:39:06Z                                   |
| progress                            | 0                                                      |
| updated                             | 2021-03-12T18:23:57Z                                   |
| user_id                             | b4670861a73347c38f2f3c662e13d30a                       |
| volumes_attached                    | id='VOL_UUID'                                          |
+-------------------------------------+--------------------------------------------------------+

Comment 25 Artom Lifshitz 2021-03-31 16:22:42 UTC
We discussed this on the DFG:Compute bug triage call today. If there are no (needed) instances on the affected compute host (C1F-OPS-CMPC20/c1f-ops-cmpc20), the simplest way to fix this would be to just scale in (remove the compute) and scale out again to re-add it back to the environment. Is this something the customer can do?

Comment 26 Alex Stupnikov 2021-04-01 07:08:08 UTC
Unfortunately there are instances on affected compute, I am not sure if remaining computes have enough resources to host them and how migration would work with this RPC problem in place. I am wondering, what other options we have and how time-consuming and disruptive are they?

Kind Regards, Alex.

Comment 40 Artom Lifshitz 2021-04-14 21:57:46 UTC
We discussed this some more during bug triage today. Could we also get a placement database dump, please? We need to understand how the compute host renames have affected its resource provider in Placement.

That being said, if you can convince the customer to drain the node, scale in and scale back out, that would be the safest way to fix this. And by draining first, they shouldn't need extra hardware (assuming the cloud isn't at full capacity).

Comment 53 smooney 2021-05-21 15:14:13 UTC
not that the customer case is resovled im closing this as not a bug as the root cause was determined to be an unsupported operation.
the db has not been fixed and configs updated so closing this to reflect that.