Description of problem: When deploying several images at the same time, we get errors, probably because several processes try to uncompress the same base image at the same time in the same location when using shared storage. How reproducible: - When we first deploy multiple VM’s from fresh Image many fails Workaround, but not space-savvy or performance valid: Change in image_cache_subdirectory_name to _base_$my_ip seems to solve the problem ( vs default conf _base which is shared to all compute ) image_cache_subdirectory_name=_base_$my_ip Actual results: Either we get errors or we use lot of CPU/Storage Expected results: Implement locking so this is avoided Additional info: Problem seems to be described here: https://lists.launchpad.net/openstack/msg16564.html
What type of shared storage? If NFS, can you make sure that version 4 is being used? v4 is required for locking to work.
Hi, It's ceph according to their details during onsite visit. Regards, Pablo
Russell, Is there anything else I should be providing that helps to diagnose this? Thanks! Pablo
Pádraig, As the case has been asigned to you, is there any extra information needed from customer ? Thanks, Pablo
The references (gerrit review, ML post) refer to a shared locks directory that was implemented long ago. It's the solution for this. Is the config option "lock_path" set? If so, is it set to a value on the same shared storage? If it's not set, it defaults to a subdirectory called "locks" under instances_path. Is this locks directory present on the same shared storage used for the instance storage?
Hi Rusell, lock_path is set at nova.conf at /var/lib/nova/tmp I'm pointing them to use the folder to be in the shared storage. Regards, Pablo
(In reply to Pablo Iranzo Gómez from comment #8) > Hi Rusell, > lock_path is set at nova.conf at /var/lib/nova/tmp > > I'm pointing them to use the folder to be in the shared storage. > > Regards, > Pablo OK, thanks for checking. If that directory was indeed not on the shared storage, that would explain this problem. I'm going to close this out for now, but please re-open and contact me directly if there's still a problem after this config fix.
lock_path needs to be set to a specific value for other reasons mentioned in bug 961557 but I think that's OK as nova should use a shared lock directory where required. Digging into the logs this seems to be the case as we have: Got file lock "56f350a9c08f513350b6bc8911fb6acb0aa3e852" at /cloudfs/nova/locks/nova-56f350a9c08f513350b6bc8911fb6acb0aa3e852 I.E. /cloudfs/nova/ is the instances path in this case, and nova then uses /cloudfs/nova/locks/... for locking. Now there was a problematic POSIX IPC locking implementation introduced recently (already fixed) which could explain this, though that code should never have hit icehouse so my hunch at this stage is a general locking logic error in nova, as I've not been able to find reference to any fcntl locking issues with ceph, which has been implemented for a long time: http://tracker.ceph.com/issues/23 Extracting the particular failures from the logs.... 2014-07-27 nova.compute.manager [instance: ...] File "/usr/lib/python2.6/site-packages/nova/virt/images.py", line 123, in fetch_to_raw 2014-07-27 nova.compute.manager [instance: ...] ImageUnacceptable: Image 8436fdb2-f688-4eb1-857c-f06c5d07b6be is unacceptable: Converted to raw, but format is now None 2014-07-27 nova.compute.manager [instance: ...] File "/usr/lib/python2.6/site-packages/nova/virt/images.py", line 116, in fetch_to_raw 2014-07-27 nova.compute.manager [instance: ...] ProcessExecutionError: Unexpected error while running command. 2014-07-27 nova.compute.manager [instance: ...] Command: qemu-img convert -O raw /cloudfs/nova/_base/56...52.part /cloudfs/nova/_base/56...52.converted 2014-07-27 nova.compute.manager [instance: ...] Exit code: 1 2014-07-27 nova.compute.manager [instance: ...] Stderr: 'error while reading sector 18284544: Input/output error\n'
Rusell, Customerd had the issue with Beta using the lock_path, they'll be testing with GA again and provide feedback. Pradaig, should we reverting back lock_path to defaults on GA to retest? is there any estimation on this issue to get fixed? Thanks
It would be good to test GA with lock_path set to defaults. If there are still issues, then it's worth testing with lock_path set to /cloudfs/nova/locks/ That should not be needed, but it would indicate that there were locks that are not appropriately annotated within nova.
Digging further, fcntl locking is supported for a long time by the cephfs kernel client. There have been bugs fixed there recently though they're not impacting here I suspect. However the cephfs-fuse client does not currently support fcntl locking, so I presume this is what it being used in this case? Note support for fcntl locking has very recently been added to the fuse client: https://github.com/ceph/ceph/commit/a1b2c8ff9 and this will be in the hammer release. So until then the workaround of using nfs for the locking is the best solution. I'm closing this as there is nothing that we can change in Nova to improve the situation here.