Bug 1230759
| Summary: | nova fails to evacuate instance due to invalid shared storage state | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Fabio Massimo Di Nitto <fdinitto> | |
| Component: | openstack-nova | Assignee: | Vladik Romanovsky <vromanso> | |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | nlevinki <nlevinki> | |
| Severity: | high | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 7.0 (Kilo) | CC: | berrange, dasmith, eglynn, fdinitto, jschluet, kchamart, rscarazz, sbauza, sferdjao, sgordon, srevivo, vromanso | |
| Target Milestone: | z5 | Keywords: | ZStream | |
| Target Release: | 7.0 (Kilo) | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1295603 (view as bug list) | Environment: | ||
| Last Closed: | 2017-07-18 13:45:53 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1185030, 1251948, 1261487, 1295603 | |||
I have been able to trigger this problem also with shared storage. Raising severity. I have tested the scratch build provided to me here: http://download.devel.redhat.com/brewroot/work/tasks/7275/9347275/ that is supposed to be 2015.1.0-4 + the fix for #1230237 and I have tested successfully failover and creations of Instances for over 5 hours without any glitch. I can only suspect a regression between .4 and .9 at this point. One extra piece of information that might be useful. When I first switched from local to shared storage with .8+patch build I followed this process: 1) stop nova everywhere 2) wipe clean /var/lib/nova/instances on all nodes 3) mounted the NFS export to /var/lib/nova/instances (it was already clean) 4) started nova again across the board I recall, pretty clearly that /var/lib/nova/instances/compute_nodes file was NOT there. I was looking for it for curiosity (since I saw it on non-shared-storage installation) and I was interested to see how the contents change with shared-storage. I thought that was normal and not given any weight to it. After rolling back to .4+patch (stop everything, wipe everything, downgrade, start), now the file is there with all relevant info about registered compute-nodes that can access a given shared storage. Perhaps that could be part of the reason why we see the problem with shared storage. Maybe it's not relevant at all, but I thought it might good to know anyway. After a full redeploy with .10 packages, i have been unable to reproduce this problem (with shared storage). I am lowering the priority, even tho the severity remains unchanged (due to potential impact on customer). I suspect that the move from non-shared to shared storage did confuse internal status of affairs (even tho all /var/lib/instances were properly wiped while services were in shutdown). On a fresh install the problem is not happening. Perhaps here is a flag somewhere in the db that´s not updated properly? just a guess at this point. Hi Fabio, any further re-occurrences of this? I haven´t seen it since comment #6 with shared storage. No testing has been done without shared storage. Since we haven't had any reports of this being re-produced since https://bugzilla.redhat.com/show_bug.cgi?id=1230759#c6 where Fabio notes he was not seeing it with .10 version of the packages I am closing this. Please re-open if this issue re-occurs. |
Description of problem: We are testing Instance HA but this can be reproduced without the whole pacemaker setup. I am using the scratch build I was provided to address another bug: 2015.1.0-9 across the board We have no shared storage in this setup (yes it's for testing purposes only) and we do configure and invoke nova evacuation without --on-shared-storage option. One compute node was running 7 instances and we failed it by crashing the kernel. Of the 7 Vms 2 failed with the following error: +--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Property | Value | +--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | mrg-09.mpc.lab.eng.bos.redhat.com | | OS-EXT-SRV-ATTR:hypervisor_hostname | mrg-09.mpc.lab.eng.bos.redhat.com | | OS-EXT-SRV-ATTR:instance_name | instance-0000028e | | OS-EXT-STS:power_state | 1 | | OS-EXT-STS:task_state | - | | OS-EXT-STS:vm_state | error | | OS-SRV-USG:launched_at | 2015-06-11T13:28:38.000000 | | OS-SRV-USG:terminated_at | - | | accessIPv4 | | | accessIPv6 | | | config_drive | | | created | 2015-06-11T13:13:29Z | | fault | {"message": "Invalid state of instance files on shared storage", "code": 500, "details": " File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 343, in decorated_function | | | return function(self, context, *args, **kwargs) | | | File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 2947, in rebuild_instance | | | _(\"Invalid state of instance files on shared\" | | | ", "created": "2015-06-11T13:34:43Z"} | | flavor | m1.tiny (1) | | hostId | dc0ee1ecf403c0bc45b5ab410c457032d4c8cb0675c7125ae8fa473a | | id | e7e4c891-aa27-485d-a408-3b899cf95f26 | | image | cirros (943df9b3-c684-44e3-9ad2-86a11c6c4265) | | internal_lan network | 192.168.100.218, 10.16.144.83 | | key_name | - | | metadata | {} | | name | test-7 | | os-extended-volumes:volumes_attached | [] | | security_groups | default | | status | ERROR | | tenant_id | 32bb46c0ef7340db94a58742ac6fe1e7 | | updated | 2015-06-11T13:34:43Z | | user_id | a7e7bea4352d498cb1278c233f6dc4a7 | +--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ that doesn't really make sense because there is no shared storage.