Description of problem: VM_state, power_state not displaying correct info when hypervisor is shutdown. Version-Release number of selected component (if applicable): How reproducible: Consistent. Steps to Reproduce: 1. Create VM on hypervisor 2. Confirm it is running and reachable (ping,ssh, etc.) 3. Shutdown hypervisor 4. Do: Openstack server show <instance_id> Actual results: | OS-EXT-STS:power_state | Running | OS-EXT-STS:vm_state | active Expected results: | OS-EXT-STS:power_state | Shutdown | OS-EXT-STS:vm_state | stopped Additional info: Openstack version: Ocata OS: Centos-7.4/5 nova-related rpms: openstack-nova-scheduler-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch python2-novaclient-7.1.2-1.el7.noarch openstack-nova-console-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch openstack-nova-common-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch puppet-nova-10.5.1-0.20180428082808.55dee4d.el7.centos.noarch openstack-nova-novncproxy-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch python-nova-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch openstack-nova-conductor-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch openstack-nova-compute-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch openstack-nova-cert-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch openstack-nova-migration-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch openstack-nova-api-15.1.2-0.20180508105133.691ffcf.el7.centos.noarch
(In reply to kforde from comment #0) > Description of problem: > > VM_state, power_state not displaying correct info when hypervisor is > shutdown. > > > Version-Release number of selected component (if applicable): > > > How reproducible: > > Consistent. > > Steps to Reproduce: > 1. Create VM on hypervisor > 2. Confirm it is running and reachable (ping,ssh, etc.) > 3. Shutdown hypervisor Please can you explain exactly what step 3 means to you, preferably by providing the exact actions you took to achieve it. > 4. Do: Openstack server show <instance_id>
Step3: ipmitool -U <USERNAME> -P <PASSWRD> -I lanplus -H 10.x.x.x power cycle (In reply to Matthew Booth from comment #1) > (In reply to kforde from comment #0) > > Description of problem: > > > > VM_state, power_state not displaying correct info when hypervisor is > > shutdown. > > > > > > Version-Release number of selected component (if applicable): > > > > > > How reproducible: > > > > Consistent. > > > > Steps to Reproduce: > > 1. Create VM on hypervisor > > 2. Confirm it is running and reachable (ping,ssh, etc.) > > 3. Shutdown hypervisor > > Please can you explain exactly what step 3 means to you, preferably by > providing the exact actions you took to achieve it. > > > 4. Do: Openstack server show <instance_id>
In fact I disable the compute node first ... not sure if that is affecting the status update? (In reply to kforde from comment #2) > Step3: > > > ipmitool -U <USERNAME> -P <PASSWRD> -I lanplus -H 10.x.x.x power cycle > > > > > > (In reply to Matthew Booth from comment #1) > > > > (In reply to kforde from comment #0) > > > Description of problem: > > > > > > VM_state, power_state not displaying correct info when hypervisor is > > > shutdown. > > > > > > > > > Version-Release number of selected component (if applicable): > > > > > > > > > How reproducible: > > > > > > Consistent. > > > > > > Steps to Reproduce: > > > 1. Create VM on hypervisor > > > 2. Confirm it is running and reachable (ping,ssh, etc.) > > > 3. Shutdown hypervisor > > > > Please can you explain exactly what step 3 means to you, preferably by > > providing the exact actions you took to achieve it. > > > > > 4. Do: Openstack server show <instance_id>
(In reply to kforde from comment #3) > In fact I disable the compute node first ... not sure if that is affecting > the status update? > > > (In reply to kforde from comment #2) > > Step3: > > > > > > ipmitool -U <USERNAME> -P <PASSWRD> -I lanplus -H 10.x.x.x power cycle So to be clear: 1. Disable compute service 2. Power down compute host ?
Exactly. Disable, then power down via ipmitool. Show instances before power down ... shows active/running Show instances when hypervisor is switched off .... shows active/running (In reply to Matthew Booth from comment #4) > (In reply to kforde from comment #3) > > In fact I disable the compute node first ... not sure if that is affecting > > the status update? > > > > > > (In reply to kforde from comment #2) > > > Step3: > > > > > > > > > ipmitool -U <USERNAME> -P <PASSWRD> -I lanplus -H 10.x.x.x power cycle > > So to be clear: > > 1. Disable compute service > 2. Power down compute host > > ?
Assuming the answer to comment 4 is yes, I think this would currently be expected behaviour. There's a periodic task which runs on the compute host which checks that the instance's power state according to the hypervisor is still what we've got in the DB. However, you're not just shutting down the hypervisor, you're shutting down the whole compute host, so that won't run. Architecturally you'd want a component which notices that the compute host is gone and is prepared to take an action. I don't think we have that in Nova. We do have servicegroup, which will notice that the service is gone, but we don't have a nova component monitoring service status to take action. If I proposed one I suspect it would be rejected as a task for external orchestration. I'll bring it up at the weekly team meeting in case I missed anything, or there's something on the horizon I'm not aware of. If you can supply a more complete description of the entire use case (i.e. in terms of operational events and intended high-level end-user outcome, not focusing narrowly on how you would like to achieve this) we might be able to sign-post you elsewhere. However, as described I currently expect that we wouldn't address this.
Hi, Yes, I was sort of expecting this answer. I just find it strange that when a compute node goes down (power-cut etc.) ... that an end user will see his/her VM as 'running' but not be able to reach it. When a compute goes down, I would have expected that the DB would be updated for all the VMs on that compute host. From an operators point of view, it results in inaccurate information when creating reports on running VMs or at least we have to add more intelligence to the scripts to account for this.
So the api *does* check the service at the time of the call, but it only reports it via host_state, which can only been seen by admin: <melwitt> looks like it might be this microversion for host_status https://docs.openstack.org/nova/latest/reference/api-microversion-history.html#id14 I'm going to turn this into an RFE to expose this better to users. The best current idea is to report 'unknown' for power state if the host state is also unknown. Fundamentally if the compute host is down we have no way to discover the hypervisor state, so I think this is the best available information. Do you think that would resolve your use case?
This is an issue that comes back every now and then. I guess having the state of the VMs marked as "unknown" if the nova-compute service is marked as "DOWN" or "unknown" would be a reasonable alternative.
(In reply to David Hill from comment #10) > This is an issue that comes back every now and then. I guess having the > state of the VMs marked as "unknown" if the nova-compute service is marked > as "DOWN" or "unknown" would be a reasonable alternative. That's been done upstream, though I'm having trouble finding the exact patch.
(In reply to Artom Lifshitz from comment #11) > (In reply to David Hill from comment #10) > > This is an issue that comes back every now and then. I guess having the > > state of the VMs marked as "unknown" if the nova-compute service is marked > > as "DOWN" or "unknown" would be a reasonable alternative. > > That's been done upstream, though I'm having trouble finding the exact patch. Correction - what's been done [1] is adding a policy rule that would expose the host_status field for everyone, only when the host_status is UNKNOWN. This is the indication that something is wrong on the host, and that vm_state and power_state are not to be trusted. [1] https://review.opendev.org/679181
And one more piece of info: https://bugzilla.redhat.com/show_bug.cgi?id=1672972 is the downstream RFE for the previous comment's work.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days