Bug 1114878
Summary: | Problem deploying multiple VM's with shared image_cached | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Pablo Iranzo Gómez <pablo.iranzo> |
Component: | openstack-nova | Assignee: | Pádraig Brady <pbrady> |
Status: | CLOSED CANTFIX | QA Contact: | Toure Dunnon <tdunnon> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 5.0 (RHEL 6) | CC: | dasmith, gfarnum, ndipanov, pablo.iranzo, pbrady, sclewis, sgordon, yeylon |
Target Milestone: | z3 | Keywords: | Reopened, ZStream |
Target Release: | 5.0 (RHEL 6) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-10-24 22:27:28 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Pablo Iranzo Gómez
2014-07-01 08:03:21 UTC
What type of shared storage? If NFS, can you make sure that version 4 is being used? v4 is required for locking to work. Hi, It's ceph according to their details during onsite visit. Regards, Pablo Russell, Is there anything else I should be providing that helps to diagnose this? Thanks! Pablo Pádraig, As the case has been asigned to you, is there any extra information needed from customer ? Thanks, Pablo The references (gerrit review, ML post) refer to a shared locks directory that was implemented long ago. It's the solution for this. Is the config option "lock_path" set? If so, is it set to a value on the same shared storage? If it's not set, it defaults to a subdirectory called "locks" under instances_path. Is this locks directory present on the same shared storage used for the instance storage? Hi Rusell, lock_path is set at nova.conf at /var/lib/nova/tmp I'm pointing them to use the folder to be in the shared storage. Regards, Pablo (In reply to Pablo Iranzo Gómez from comment #8) > Hi Rusell, > lock_path is set at nova.conf at /var/lib/nova/tmp > > I'm pointing them to use the folder to be in the shared storage. > > Regards, > Pablo OK, thanks for checking. If that directory was indeed not on the shared storage, that would explain this problem. I'm going to close this out for now, but please re-open and contact me directly if there's still a problem after this config fix. lock_path needs to be set to a specific value for other reasons mentioned in bug 961557 but I think that's OK as nova should use a shared lock directory where required. Digging into the logs this seems to be the case as we have: Got file lock "56f350a9c08f513350b6bc8911fb6acb0aa3e852" at /cloudfs/nova/locks/nova-56f350a9c08f513350b6bc8911fb6acb0aa3e852 I.E. /cloudfs/nova/ is the instances path in this case, and nova then uses /cloudfs/nova/locks/... for locking. Now there was a problematic POSIX IPC locking implementation introduced recently (already fixed) which could explain this, though that code should never have hit icehouse so my hunch at this stage is a general locking logic error in nova, as I've not been able to find reference to any fcntl locking issues with ceph, which has been implemented for a long time: http://tracker.ceph.com/issues/23 Extracting the particular failures from the logs.... 2014-07-27 nova.compute.manager [instance: ...] File "/usr/lib/python2.6/site-packages/nova/virt/images.py", line 123, in fetch_to_raw 2014-07-27 nova.compute.manager [instance: ...] ImageUnacceptable: Image 8436fdb2-f688-4eb1-857c-f06c5d07b6be is unacceptable: Converted to raw, but format is now None 2014-07-27 nova.compute.manager [instance: ...] File "/usr/lib/python2.6/site-packages/nova/virt/images.py", line 116, in fetch_to_raw 2014-07-27 nova.compute.manager [instance: ...] ProcessExecutionError: Unexpected error while running command. 2014-07-27 nova.compute.manager [instance: ...] Command: qemu-img convert -O raw /cloudfs/nova/_base/56...52.part /cloudfs/nova/_base/56...52.converted 2014-07-27 nova.compute.manager [instance: ...] Exit code: 1 2014-07-27 nova.compute.manager [instance: ...] Stderr: 'error while reading sector 18284544: Input/output error\n' Rusell, Customerd had the issue with Beta using the lock_path, they'll be testing with GA again and provide feedback. Pradaig, should we reverting back lock_path to defaults on GA to retest? is there any estimation on this issue to get fixed? Thanks It would be good to test GA with lock_path set to defaults. If there are still issues, then it's worth testing with lock_path set to /cloudfs/nova/locks/ That should not be needed, but it would indicate that there were locks that are not appropriately annotated within nova. Digging further, fcntl locking is supported for a long time by the cephfs kernel client. There have been bugs fixed there recently though they're not impacting here I suspect. However the cephfs-fuse client does not currently support fcntl locking, so I presume this is what it being used in this case? Note support for fcntl locking has very recently been added to the fuse client: https://github.com/ceph/ceph/commit/a1b2c8ff9 and this will be in the hammer release. So until then the workaround of using nfs for the locking is the best solution. I'm closing this as there is nothing that we can change in Nova to improve the situation here. |