Bug 1114878 - Problem deploying multiple VM's with shared image_cached
Summary: Problem deploying multiple VM's with shared image_cached
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 5.0 (RHEL 6)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z3
: 5.0 (RHEL 6)
Assignee: Pádraig Brady
QA Contact: Toure Dunnon
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-07-01 08:03 UTC by Pablo Iranzo Gómez
Modified: 2019-09-09 16:04 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-10-24 22:27:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 12024 0 None None None Never
Red Hat Bugzilla 1111295 0 high CLOSED Improve Ceph integration 2021-02-22 00:41:40 UTC

Internal Links: 1111295

Description Pablo Iranzo Gómez 2014-07-01 08:03:21 UTC
Description of problem:

When deploying several images at the same time, we get errors, probably because several processes try to uncompress the same base image at the same time in the same location when using shared storage.


How reproducible:


- When we first deploy multiple VM’s from fresh Image many fails

Workaround, but not space-savvy or performance valid:
 
Change in image_cache_subdirectory_name to _base_$my_ip seems to solve the problem ( vs default conf _base which is shared to all compute )
 
image_cache_subdirectory_name=_base_$my_ip
 
Actual results:

Either we get errors or we use lot of CPU/Storage


Expected results:

Implement locking so this is avoided


Additional info:

Problem seems to be described here: https://lists.launchpad.net/openstack/msg16564.html

Comment 2 Russell Bryant 2014-07-02 13:16:10 UTC
What type of shared storage?

If NFS, can you make sure that version 4 is being used?  v4 is required for locking to work.

Comment 3 Pablo Iranzo Gómez 2014-07-02 22:22:09 UTC
Hi,
It's ceph according to their details during onsite visit.

Regards,
Pablo

Comment 4 Pablo Iranzo Gómez 2014-07-07 11:10:03 UTC
Russell,
Is there anything else I should be providing that helps to diagnose this?

Thanks!
Pablo

Comment 5 Pablo Iranzo Gómez 2014-07-10 08:49:52 UTC
Pádraig, As the case has been asigned to you, is there any extra information needed from customer ?

Thanks,
Pablo

Comment 7 Russell Bryant 2014-07-28 18:21:46 UTC
The references (gerrit review, ML post) refer to a shared locks directory that was implemented long ago.  It's the solution for this.

Is the config option "lock_path" set?  If so, is it set to a value on the same shared storage?

If it's not set, it defaults to a subdirectory called "locks" under instances_path.

Is this locks directory present on the same shared storage used for the instance storage?

Comment 8 Pablo Iranzo Gómez 2014-07-28 18:42:29 UTC
Hi Rusell,
lock_path is set at nova.conf at /var/lib/nova/tmp

I'm pointing them to use the folder to be in the shared storage.

Regards,
Pablo

Comment 9 Russell Bryant 2014-07-29 12:57:55 UTC
(In reply to Pablo Iranzo Gómez from comment #8)
> Hi Rusell,
> lock_path is set at nova.conf at /var/lib/nova/tmp
> 
> I'm pointing them to use the folder to be in the shared storage.
> 
> Regards,
> Pablo

OK, thanks for checking.  If that directory was indeed not on the shared storage, that would explain this problem.  I'm going to close this out for now, but please re-open and contact me directly if there's still a problem after this config fix.

Comment 10 Pádraig Brady 2014-07-29 17:19:04 UTC
lock_path needs to be set to a specific value for other reasons mentioned in bug 961557
but I think that's OK as nova should use a shared lock directory where required.
Digging into the logs this seems to be the case as we have:

  Got file lock "56f350a9c08f513350b6bc8911fb6acb0aa3e852"
  at /cloudfs/nova/locks/nova-56f350a9c08f513350b6bc8911fb6acb0aa3e852

I.E. /cloudfs/nova/ is the instances path in this case, and nova then uses
/cloudfs/nova/locks/... for locking.

Now there was a problematic POSIX IPC locking implementation introduced recently
(already fixed) which could explain this, though that code should never have hit icehouse so my hunch at this stage is a general locking logic error in nova,
as I've not been able to find reference to any fcntl locking issues with ceph,
which has been implemented for a long time: http://tracker.ceph.com/issues/23

Extracting the particular failures from the logs....


2014-07-27 nova.compute.manager [instance: ...] File "/usr/lib/python2.6/site-packages/nova/virt/images.py", line 123, in fetch_to_raw
2014-07-27 nova.compute.manager [instance: ...] ImageUnacceptable: Image 8436fdb2-f688-4eb1-857c-f06c5d07b6be is unacceptable: Converted to raw, but format is now None


2014-07-27 nova.compute.manager [instance: ...] File "/usr/lib/python2.6/site-packages/nova/virt/images.py", line 116, in fetch_to_raw
2014-07-27 nova.compute.manager [instance: ...] ProcessExecutionError: Unexpected error while running command.
2014-07-27 nova.compute.manager [instance: ...] Command: qemu-img convert -O raw /cloudfs/nova/_base/56...52.part /cloudfs/nova/_base/56...52.converted
2014-07-27 nova.compute.manager [instance: ...] Exit code: 1
2014-07-27 nova.compute.manager [instance: ...] Stderr: 'error while reading sector 18284544: Input/output error\n'

Comment 11 Pablo Iranzo Gómez 2014-08-11 09:04:42 UTC
Rusell,
Customerd had the issue with Beta using the lock_path, they'll be testing with GA again and provide feedback.

Pradaig, should we reverting back lock_path to defaults on GA to retest? is there any estimation on this issue to get fixed?

Thanks

Comment 12 Pádraig Brady 2014-08-11 10:14:08 UTC
It would be good to test GA with lock_path set to defaults.

If there are still issues, then it's worth testing with lock_path set to /cloudfs/nova/locks/

That should not be needed, but it would indicate that there were locks that are not appropriately annotated within nova.

Comment 19 Pádraig Brady 2014-10-24 22:27:28 UTC
Digging further, fcntl locking is supported for a long time by the cephfs kernel client. There have been bugs fixed there recently though they're not impacting here I suspect.

However the cephfs-fuse client does not currently support fcntl locking, so I presume this is what it being used in this case?
Note support for fcntl locking has very recently been added to the fuse client:
https://github.com/ceph/ceph/commit/a1b2c8ff9
and this will be in the hammer release.

So until then the workaround of using nfs for the locking is the best solution.

I'm closing this as there is nothing that we can change in Nova to improve the situation here.


Note You need to log in before you can comment on or make changes to this bug.