Description of problem: We are testing Instance HA but this can be reproduced without the whole pacemaker setup. I am using the scratch build I was provided to address another bug: 2015.1.0-9 across the board We have no shared storage in this setup (yes it's for testing purposes only) and we do configure and invoke nova evacuation without --on-shared-storage option. One compute node was running 7 instances and we failed it by crashing the kernel. Of the 7 Vms 2 failed with the following error: +--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Property | Value | +--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | mrg-09.mpc.lab.eng.bos.redhat.com | | OS-EXT-SRV-ATTR:hypervisor_hostname | mrg-09.mpc.lab.eng.bos.redhat.com | | OS-EXT-SRV-ATTR:instance_name | instance-0000028e | | OS-EXT-STS:power_state | 1 | | OS-EXT-STS:task_state | - | | OS-EXT-STS:vm_state | error | | OS-SRV-USG:launched_at | 2015-06-11T13:28:38.000000 | | OS-SRV-USG:terminated_at | - | | accessIPv4 | | | accessIPv6 | | | config_drive | | | created | 2015-06-11T13:13:29Z | | fault | {"message": "Invalid state of instance files on shared storage", "code": 500, "details": " File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 343, in decorated_function | | | return function(self, context, *args, **kwargs) | | | File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 2947, in rebuild_instance | | | _(\"Invalid state of instance files on shared\" | | | ", "created": "2015-06-11T13:34:43Z"} | | flavor | m1.tiny (1) | | hostId | dc0ee1ecf403c0bc45b5ab410c457032d4c8cb0675c7125ae8fa473a | | id | e7e4c891-aa27-485d-a408-3b899cf95f26 | | image | cirros (943df9b3-c684-44e3-9ad2-86a11c6c4265) | | internal_lan network | 192.168.100.218, 10.16.144.83 | | key_name | - | | metadata | {} | | name | test-7 | | os-extended-volumes:volumes_attached | [] | | security_groups | default | | status | ERROR | | tenant_id | 32bb46c0ef7340db94a58742ac6fe1e7 | | updated | 2015-06-11T13:34:43Z | | user_id | a7e7bea4352d498cb1278c233f6dc4a7 | +--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ that doesn't really make sense because there is no shared storage.
I have been able to trigger this problem also with shared storage. Raising severity.
I have tested the scratch build provided to me here: http://download.devel.redhat.com/brewroot/work/tasks/7275/9347275/ that is supposed to be 2015.1.0-4 + the fix for #1230237 and I have tested successfully failover and creations of Instances for over 5 hours without any glitch. I can only suspect a regression between .4 and .9 at this point.
One extra piece of information that might be useful. When I first switched from local to shared storage with .8+patch build I followed this process: 1) stop nova everywhere 2) wipe clean /var/lib/nova/instances on all nodes 3) mounted the NFS export to /var/lib/nova/instances (it was already clean) 4) started nova again across the board I recall, pretty clearly that /var/lib/nova/instances/compute_nodes file was NOT there. I was looking for it for curiosity (since I saw it on non-shared-storage installation) and I was interested to see how the contents change with shared-storage. I thought that was normal and not given any weight to it. After rolling back to .4+patch (stop everything, wipe everything, downgrade, start), now the file is there with all relevant info about registered compute-nodes that can access a given shared storage. Perhaps that could be part of the reason why we see the problem with shared storage. Maybe it's not relevant at all, but I thought it might good to know anyway.
After a full redeploy with .10 packages, i have been unable to reproduce this problem (with shared storage). I am lowering the priority, even tho the severity remains unchanged (due to potential impact on customer). I suspect that the move from non-shared to shared storage did confuse internal status of affairs (even tho all /var/lib/instances were properly wiped while services were in shutdown). On a fresh install the problem is not happening. Perhaps here is a flag somewhere in the db that´s not updated properly? just a guess at this point.
Hi Fabio, any further re-occurrences of this?
I haven´t seen it since comment #6 with shared storage. No testing has been done without shared storage.
Since we haven't had any reports of this being re-produced since https://bugzilla.redhat.com/show_bug.cgi?id=1230759#c6 where Fabio notes he was not seeing it with .10 version of the packages I am closing this. Please re-open if this issue re-occurs.