1596171 – [RFE] VM_state, power_state not displaying correct info when hypervisor is shutdown

Bug 1596171 - [RFE] VM_state, power_state not displaying correct info when hypervisor is shutdown

Summary: [RFE] VM_state, power_state not displaying correct info when hypervisor is sh...

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-novaclient
Sub Component:
Version:	11.0 (Ocata)
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Matthew Booth
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-28 11:30 UTC by kforde
Modified:	2023-09-18 00:13 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-16 15:06:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-11423	0	None	None	None	2021-12-10 16:31:28 UTC

Description kforde 2018-06-28 11:30:02 UTC

Description of problem:

VM_state, power_state not displaying correct info when hypervisor is shutdown.


Version-Release number of selected component (if applicable):


How reproducible:

Consistent. 

Steps to Reproduce:
1. Create VM on hypervisor
2. Confirm it is running and reachable (ping,ssh, etc.)
3. Shutdown hypervisor
4. Do: Openstack server show <instance_id>

Actual results:

| OS-EXT-STS:power_state              | Running
| OS-EXT-STS:vm_state                 | active

Expected results:

| OS-EXT-STS:power_state              | Shutdown
| OS-EXT-STS:vm_state                 | stopped


Additional info:

Openstack version: Ocata
OS: Centos-7.4/5

nova-related rpms:
openstack-nova-scheduler-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch
python2-novaclient-7.1.2-1.el7.noarch
openstack-nova-console-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch
openstack-nova-common-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch
puppet-nova-10.5.1-0.20180428082808.55dee4d.el7.centos.noarch
openstack-nova-novncproxy-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch
python-nova-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch
openstack-nova-conductor-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch
openstack-nova-compute-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch
openstack-nova-cert-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch
openstack-nova-migration-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch
openstack-nova-api-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch

Comment 1 Matthew Booth 2018-06-29 10:43:43 UTC

(In reply to kforde from comment #0)
> Description of problem:
> 
> VM_state, power_state not displaying correct info when hypervisor is
> shutdown.
> 
> 
> Version-Release number of selected component (if applicable):
> 
> 
> How reproducible:
> 
> Consistent. 
> 
> Steps to Reproduce:
> 1. Create VM on hypervisor
> 2. Confirm it is running and reachable (ping,ssh, etc.)
> 3. Shutdown hypervisor

Please can you explain exactly what step 3 means to you, preferably by providing the exact actions you took to achieve it.

> 4. Do: Openstack server show <instance_id>

Comment 2 kforde 2018-06-29 10:54:45 UTC

Step3: 


ipmitool -U <USERNAME> -P <PASSWRD> -I lanplus -H 10.x.x.x power cycle





(In reply to Matthew Booth from comment #1)


> (In reply to kforde from comment #0)
> > Description of problem:
> > 
> > VM_state, power_state not displaying correct info when hypervisor is
> > shutdown.
> > 
> > 
> > Version-Release number of selected component (if applicable):
> > 
> > 
> > How reproducible:
> > 
> > Consistent. 
> > 
> > Steps to Reproduce:
> > 1. Create VM on hypervisor
> > 2. Confirm it is running and reachable (ping,ssh, etc.)
> > 3. Shutdown hypervisor
> 
> Please can you explain exactly what step 3 means to you, preferably by
> providing the exact actions you took to achieve it.
> 
> > 4. Do: Openstack server show <instance_id>

Comment 3 kforde 2018-06-29 10:55:59 UTC

In fact I disable the compute node first ... not sure if that is affecting the status update? 


(In reply to kforde from comment #2)
> Step3: 
> 
> 
> ipmitool -U <USERNAME> -P <PASSWRD> -I lanplus -H 10.x.x.x power cycle
> 
> 
> 
> 
> 
> (In reply to Matthew Booth from comment #1)
> 
> 
> > (In reply to kforde from comment #0)
> > > Description of problem:
> > > 
> > > VM_state, power_state not displaying correct info when hypervisor is
> > > shutdown.
> > > 
> > > 
> > > Version-Release number of selected component (if applicable):
> > > 
> > > 
> > > How reproducible:
> > > 
> > > Consistent. 
> > > 
> > > Steps to Reproduce:
> > > 1. Create VM on hypervisor
> > > 2. Confirm it is running and reachable (ping,ssh, etc.)
> > > 3. Shutdown hypervisor
> > 
> > Please can you explain exactly what step 3 means to you, preferably by
> > providing the exact actions you took to achieve it.
> > 
> > > 4. Do: Openstack server show <instance_id>

Comment 4 Matthew Booth 2018-06-29 13:27:48 UTC

(In reply to kforde from comment #3)
> In fact I disable the compute node first ... not sure if that is affecting
> the status update? 
> 
> 
> (In reply to kforde from comment #2)
> > Step3: 
> > 
> > 
> > ipmitool -U <USERNAME> -P <PASSWRD> -I lanplus -H 10.x.x.x power cycle

So to be clear:

1. Disable compute service
2. Power down compute host

?

Comment 5 kforde 2018-06-29 13:30:38 UTC

Exactly. 

Disable, then power down via ipmitool. 

Show instances before power down ... shows active/running
Show instances when hypervisor is switched off .... shows active/running


(In reply to Matthew Booth from comment #4)
> (In reply to kforde from comment #3)
> > In fact I disable the compute node first ... not sure if that is affecting
> > the status update? 
> > 
> > 
> > (In reply to kforde from comment #2)
> > > Step3: 
> > > 
> > > 
> > > ipmitool -U <USERNAME> -P <PASSWRD> -I lanplus -H 10.x.x.x power cycle
> 
> So to be clear:
> 
> 1. Disable compute service
> 2. Power down compute host
> 
> ?

Comment 6 Matthew Booth 2018-06-29 13:42:55 UTC

Assuming the answer to comment 4 is yes, I think this would currently be expected behaviour. There's a periodic task which runs on the compute host which checks that the instance's power state according to the hypervisor is still what we've got in the DB. However, you're not just shutting down the hypervisor, you're shutting down the whole compute host, so that won't run.

Architecturally you'd want a component which notices that the compute host is gone and is prepared to take an action. I don't think we have that in Nova. We do have servicegroup, which will notice that the service is gone, but we don't have a nova component monitoring service status to take action. If I proposed one I suspect it would be rejected as a task for external orchestration.

I'll bring it up at the weekly team meeting in case I missed anything, or there's something on the horizon I'm not aware of. If you can supply a more complete description of the entire use case (i.e. in terms of operational events and intended high-level end-user outcome, not focusing narrowly on how you would like to achieve this) we might be able to sign-post you elsewhere. However, as described I currently expect that we wouldn't address this.

Comment 7 kforde 2018-06-29 13:55:41 UTC

Hi,

Yes, I was sort of expecting this answer. 

I just find it strange that when a compute node goes down (power-cut etc.) ... that an end user will see his/her VM as 'running' but not be able to reach it. 

When a compute goes down, I would have expected that the DB would be updated for all the VMs on that compute host. 

From an operators point of view, it results in inaccurate information when creating reports on running VMs or at least we have to add more intelligence to the scripts to account for this.

Comment 8 Matthew Booth 2018-06-29 14:28:27 UTC

So the api *does* check the service at the time of the call, but it only reports it via host_state, which can only been seen by admin:

<melwitt> looks like it might be this microversion for host_status https://docs.openstack.org/nova/latest/reference/api-microversion-history.html#id14

I'm going to turn this into an RFE to expose this better to users. The best current idea is to report 'unknown' for power state if the host state is also unknown. Fundamentally if the compute host is down we have no way to discover the hypervisor state, so I think this is the best available information. Do you think that would resolve your use case?

Comment 10 David Hill 2018-07-10 19:09:34 UTC

This is an issue that comes back every now and then.   I guess having the state of the VMs marked as "unknown" if the nova-compute service is marked as "DOWN" or "unknown" would be a reasonable alternative.

Comment 11 Artom Lifshitz 2020-03-16 15:06:16 UTC

(In reply to David Hill from comment #10)
> This is an issue that comes back every now and then.   I guess having the
> state of the VMs marked as "unknown" if the nova-compute service is marked
> as "DOWN" or "unknown" would be a reasonable alternative.

That's been done upstream, though I'm having trouble finding the exact patch.

Comment 12 Artom Lifshitz 2020-03-16 16:27:02 UTC

(In reply to Artom Lifshitz from comment #11)
> (In reply to David Hill from comment #10)
> > This is an issue that comes back every now and then.   I guess having the
> > state of the VMs marked as "unknown" if the nova-compute service is marked
> > as "DOWN" or "unknown" would be a reasonable alternative.
> 
> That's been done upstream, though I'm having trouble finding the exact patch.

Correction - what's been done [1] is adding a policy rule that would expose the host_status field for everyone, only when the host_status is UNKNOWN. This is the indication that something is wrong on the host, and that vm_state and power_state are not to be trusted.

[1] https://review.opendev.org/679181

Comment 13 Artom Lifshitz 2020-03-16 16:32:03 UTC

And one more piece of info: https://bugzilla.redhat.com/show_bug.cgi?id=1672972 is the downstream RFE for the previous comment's work.

Comment 14 Red Hat Bugzilla 2023-09-18 00:13:58 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.