Bug 1114878

Summary:	Problem deploying multiple VM's with shared image_cached
Product:	Red Hat OpenStack	Reporter:	Pablo Iranzo Gómez <pablo.iranzo>
Component:	openstack-nova	Assignee:	Pádraig Brady <pbrady>
Status:	CLOSED CANTFIX	QA Contact:	Toure Dunnon <tdunnon>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.0 (RHEL 6)	CC:	dasmith, gfarnum, ndipanov, pablo.iranzo, pbrady, sclewis, sgordon, yeylon
Target Milestone:	z3	Keywords:	Reopened, ZStream
Target Release:	5.0 (RHEL 6)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-10-24 22:27:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pablo Iranzo Gómez 2014-07-01 08:03:21 UTC

Description of problem:

When deploying several images at the same time, we get errors, probably because several processes try to uncompress the same base image at the same time in the same location when using shared storage.


How reproducible:


- When we first deploy multiple VM’s from fresh Image many fails

Workaround, but not space-savvy or performance valid:
 
Change in image_cache_subdirectory_name to _base_$my_ip seems to solve the problem ( vs default conf _base which is shared to all compute )
 
image_cache_subdirectory_name=_base_$my_ip
 
Actual results:

Either we get errors or we use lot of CPU/Storage


Expected results:

Implement locking so this is avoided


Additional info:

Problem seems to be described here: https://lists.launchpad.net/openstack/msg16564.html

Comment 2 Russell Bryant 2014-07-02 13:16:10 UTC

What type of shared storage?

If NFS, can you make sure that version 4 is being used?  v4 is required for locking to work.

Comment 3 Pablo Iranzo Gómez 2014-07-02 22:22:09 UTC

Hi,
It's ceph according to their details during onsite visit.

Regards,
Pablo

Comment 4 Pablo Iranzo Gómez 2014-07-07 11:10:03 UTC

Russell,
Is there anything else I should be providing that helps to diagnose this?

Thanks!
Pablo

Comment 5 Pablo Iranzo Gómez 2014-07-10 08:49:52 UTC

Pádraig, As the case has been asigned to you, is there any extra information needed from customer ?

Thanks,
Pablo

Comment 7 Russell Bryant 2014-07-28 18:21:46 UTC

The references (gerrit review, ML post) refer to a shared locks directory that was implemented long ago.  It's the solution for this.

Is the config option "lock_path" set?  If so, is it set to a value on the same shared storage?

If it's not set, it defaults to a subdirectory called "locks" under instances_path.

Is this locks directory present on the same shared storage used for the instance storage?

Comment 8 Pablo Iranzo Gómez 2014-07-28 18:42:29 UTC

Hi Rusell,
lock_path is set at nova.conf at /var/lib/nova/tmp

I'm pointing them to use the folder to be in the shared storage.

Regards,
Pablo

Comment 9 Russell Bryant 2014-07-29 12:57:55 UTC

(In reply to Pablo Iranzo Gómez from comment #8)
> Hi Rusell,
> lock_path is set at nova.conf at /var/lib/nova/tmp
> 
> I'm pointing them to use the folder to be in the shared storage.
> 
> Regards,
> Pablo

OK, thanks for checking.  If that directory was indeed not on the shared storage, that would explain this problem.  I'm going to close this out for now, but please re-open and contact me directly if there's still a problem after this config fix.

Comment 10 Pádraig Brady 2014-07-29 17:19:04 UTC

lock_path needs to be set to a specific value for other reasons mentioned in bug 961557
but I think that's OK as nova should use a shared lock directory where required.
Digging into the logs this seems to be the case as we have:

  Got file lock "56f350a9c08f513350b6bc8911fb6acb0aa3e852"
  at /cloudfs/nova/locks/nova-56f350a9c08f513350b6bc8911fb6acb0aa3e852

I.E. /cloudfs/nova/ is the instances path in this case, and nova then uses
/cloudfs/nova/locks/... for locking.

Now there was a problematic POSIX IPC locking implementation introduced recently
(already fixed) which could explain this, though that code should never have hit icehouse so my hunch at this stage is a general locking logic error in nova,
as I've not been able to find reference to any fcntl locking issues with ceph,
which has been implemented for a long time: http://tracker.ceph.com/issues/23

Extracting the particular failures from the logs....


2014-07-27 nova.compute.manager [instance: ...] File "/usr/lib/python2.6/site-packages/nova/virt/images.py", line 123, in fetch_to_raw
2014-07-27 nova.compute.manager [instance: ...] ImageUnacceptable: Image 8436fdb2-f688-4eb1-857c-f06c5d07b6be is unacceptable: Converted to raw, but format is now None


2014-07-27 nova.compute.manager [instance: ...] File "/usr/lib/python2.6/site-packages/nova/virt/images.py", line 116, in fetch_to_raw
2014-07-27 nova.compute.manager [instance: ...] ProcessExecutionError: Unexpected error while running command.
2014-07-27 nova.compute.manager [instance: ...] Command: qemu-img convert -O raw /cloudfs/nova/_base/56...52.part /cloudfs/nova/_base/56...52.converted
2014-07-27 nova.compute.manager [instance: ...] Exit code: 1
2014-07-27 nova.compute.manager [instance: ...] Stderr: 'error while reading sector 18284544: Input/output error\n'

Comment 11 Pablo Iranzo Gómez 2014-08-11 09:04:42 UTC

Rusell,
Customerd had the issue with Beta using the lock_path, they'll be testing with GA again and provide feedback.

Pradaig, should we reverting back lock_path to defaults on GA to retest? is there any estimation on this issue to get fixed?

Thanks

Comment 12 Pádraig Brady 2014-08-11 10:14:08 UTC

It would be good to test GA with lock_path set to defaults.

If there are still issues, then it's worth testing with lock_path set to /cloudfs/nova/locks/

That should not be needed, but it would indicate that there were locks that are not appropriately annotated within nova.

Comment 19 Pádraig Brady 2014-10-24 22:27:28 UTC

Digging further, fcntl locking is supported for a long time by the cephfs kernel client. There have been bugs fixed there recently though they're not impacting here I suspect.

However the cephfs-fuse client does not currently support fcntl locking, so I presume this is what it being used in this case?
Note support for fcntl locking has very recently been added to the fuse client:
https://github.com/ceph/ceph/commit/a1b2c8ff9
and this will be in the hammer release.

So until then the workaround of using nfs for the locking is the best solution.

I'm closing this as there is nothing that we can change in Nova to improve the situation here.