Bug 1230759

Summary: nova fails to evacuate instance due to invalid shared storage state
Product: Red Hat OpenStack Reporter: Fabio Massimo Di Nitto <fdinitto>
Component: openstack-novaAssignee: Vladik Romanovsky <vromanso>
Status: CLOSED INSUFFICIENT_DATA QA Contact: nlevinki <nlevinki>
Severity: high Docs Contact:
Priority: medium    
Version: 7.0 (Kilo)CC: berrange, dasmith, eglynn, fdinitto, jschluet, kchamart, rscarazz, sbauza, sferdjao, sgordon, srevivo, vromanso
Target Milestone: z5Keywords: ZStream
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1295603 (view as bug list) Environment:
Last Closed: 2017-07-18 13:45:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1185030, 1251948, 1261487, 1295603    

Description Fabio Massimo Di Nitto 2015-06-11 13:42:20 UTC
Description of problem:

We are testing Instance HA but this can be reproduced without the whole pacemaker setup.

I am using the scratch build I was provided to address another bug:

2015.1.0-9 across the board

We have no shared storage in this setup (yes it's for testing purposes only) and we do configure and invoke nova evacuation without --on-shared-storage option.

One compute node was running 7 instances and we failed it by crashing the kernel.

Of the 7 Vms 2 failed with the following error:

+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Property                             | Value                                                                                                                                                                                          |
+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig                    | MANUAL                                                                                                                                                                                         |
| OS-EXT-AZ:availability_zone          | nova                                                                                                                                                                                           |
| OS-EXT-SRV-ATTR:host                 | mrg-09.mpc.lab.eng.bos.redhat.com                                                                                                                                                              |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | mrg-09.mpc.lab.eng.bos.redhat.com                                                                                                                                                              |
| OS-EXT-SRV-ATTR:instance_name        | instance-0000028e                                                                                                                                                                              |
| OS-EXT-STS:power_state               | 1                                                                                                                                                                                              |
| OS-EXT-STS:task_state                | -                                                                                                                                                                                              |
| OS-EXT-STS:vm_state                  | error                                                                                                                                                                                          |
| OS-SRV-USG:launched_at               | 2015-06-11T13:28:38.000000                                                                                                                                                                     |
| OS-SRV-USG:terminated_at             | -                                                                                                                                                                                              |
| accessIPv4                           |                                                                                                                                                                                                |
| accessIPv6                           |                                                                                                                                                                                                |
| config_drive                         |                                                                                                                                                                                                |
| created                              | 2015-06-11T13:13:29Z                                                                                                                                                                           |
| fault                                | {"message": "Invalid state of instance files on shared storage", "code": 500, "details": "  File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 343, in decorated_function |
|                                      |     return function(self, context, *args, **kwargs)                                                                                                                                            |
|                                      |   File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 2947, in rebuild_instance                                                                                            |
|                                      |     _(\"Invalid state of instance files on shared\"                                                                                                                                            |
|                                      | ", "created": "2015-06-11T13:34:43Z"}                                                                                                                                                          |
| flavor                               | m1.tiny (1)                                                                                                                                                                                    |
| hostId                               | dc0ee1ecf403c0bc45b5ab410c457032d4c8cb0675c7125ae8fa473a                                                                                                                                       |
| id                                   | e7e4c891-aa27-485d-a408-3b899cf95f26                                                                                                                                                           |
| image                                | cirros (943df9b3-c684-44e3-9ad2-86a11c6c4265)                                                                                                                                                  |
| internal_lan network                 | 192.168.100.218, 10.16.144.83                                                                                                                                                                  |
| key_name                             | -                                                                                                                                                                                              |
| metadata                             | {}                                                                                                                                                                                             |
| name                                 | test-7                                                                                                                                                                                         |
| os-extended-volumes:volumes_attached | []                                                                                                                                                                                             |
| security_groups                      | default                                                                                                                                                                                        |
| status                               | ERROR                                                                                                                                                                                          |
| tenant_id                            | 32bb46c0ef7340db94a58742ac6fe1e7                                                                                                                                                               |
| updated                              | 2015-06-11T13:34:43Z                                                                                                                                                                           |
| user_id                              | a7e7bea4352d498cb1278c233f6dc4a7                                                                                                                                                               |
+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

that doesn't really make sense because there is no shared storage.

Comment 3 Fabio Massimo Di Nitto 2015-06-11 15:40:45 UTC
I have been able to trigger this problem also with shared storage.

Raising severity.

Comment 4 Fabio Massimo Di Nitto 2015-06-12 09:01:29 UTC
I have tested the scratch build provided to me here:

http://download.devel.redhat.com/brewroot/work/tasks/7275/9347275/

that is supposed to be 2015.1.0-4 + the fix for #1230237 and I have tested successfully failover and creations of Instances for over 5 hours without any glitch.

I can only suspect a regression between .4 and .9 at this point.

Comment 5 Fabio Massimo Di Nitto 2015-06-13 05:26:44 UTC
One extra piece of information that might be useful.

When I first switched from local to shared storage with .8+patch build I followed this process:

1) stop nova everywhere
2) wipe clean /var/lib/nova/instances on all nodes
3) mounted the NFS export to /var/lib/nova/instances
   (it was already clean)
4) started nova again across the board

I recall, pretty clearly that /var/lib/nova/instances/compute_nodes file was NOT there. I was looking for it for curiosity (since I saw it on non-shared-storage installation) and I was interested to see how the contents change with shared-storage. I thought that was normal and not given any weight to it.

After rolling back to .4+patch (stop everything, wipe everything, downgrade, start), now the file is there with all relevant info about registered compute-nodes that can access a given shared storage.

Perhaps that could be part of the reason why we see the problem with shared storage. Maybe it's not relevant at all, but I thought it might good to know anyway.

Comment 6 Fabio Massimo Di Nitto 2015-06-16 09:02:28 UTC
After a full redeploy with .10 packages, i have been unable to reproduce this problem (with shared storage).

I am lowering the priority, even tho the severity remains unchanged (due to potential impact on customer).

I suspect that the move from non-shared to shared storage did confuse internal status of affairs (even tho all /var/lib/instances were properly wiped while services were in shutdown). On a fresh install the problem is not happening.

Perhaps here is a flag somewhere in the db that´s not updated properly? just a guess at this point.

Comment 7 Stephen Gordon 2015-07-16 14:22:19 UTC
Hi Fabio, any further re-occurrences of this?

Comment 8 Fabio Massimo Di Nitto 2015-07-16 14:30:34 UTC
I haven´t seen it since comment #6 with shared storage. No testing has been done without shared storage.

Comment 14 Stephen Gordon 2017-07-18 13:45:53 UTC
Since we haven't had any reports of this being re-produced since https://bugzilla.redhat.com/show_bug.cgi?id=1230759#c6 where Fabio notes he was not seeing it with .10 version of the packages I am closing this. Please re-open if this issue re-occurs.