Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1080618

Summary: VMware: spawning large amounts of VMs concurrently sometimes causes "VMDK lock" error
Product: Red Hat OpenStack Reporter: Stephen Gordon <sgordon>
Component: openstack-novaAssignee: Matthew Booth <mbooth>
Status: CLOSED ERRATA QA Contact: Jaroslav Henner <jhenner>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0 (RHEL 7)CC: jhenner, mbooth, mlopes, ndipanov, sclewis, sgordon, yeylon
Target Milestone: rc   
Target Release: 5.0 (RHEL 7)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-nova-2014.1.2-1.el7ost Doc Type: Bug Fix
Doc Text:
With this update, the VMware driver has been updated to initially copy cached images to a unique temporary directory before landing. Consequently, only one process is permitted to create the cached copy, thereby restricting resizing of that image to one process only. As a result, this fix resolves numerous race conditions found in previous releases that occasionally prevented multiple instances from using the same image.
Story Points: ---
Clone Of:
: 1088997 (view as bug list) Environment:
Last Closed: 2014-09-02 18:21:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1055536, 1088997    
Attachments:
Description Flags
logs
none
with_long_caching none

Description Stephen Gordon 2014-03-25 19:27:24 UTC
Description of problem:
When using the VMwareVCDriver, spawning large amounts of virtual machines concurrently causes some instances to spawn with status ERROR. The number of machines that fail to build is unpredictable and sometimes all instances do end up spawning successfully.

The issue can be reproduced by running:

    nova boot --image debian-2.6.32-i686 --flavor 1 --num-instances 32 nameless

The number of instances that causes the errors differ from environment to environment. Start with 30-40. There are two errors seen in the logs that are causing the instance spawn failures. The first is the ESX host not finding the image in the nfs datastore (even though it is there, otherwise other instances couldn't be spawned). The second is the ESX host not being able to access the vmdk image because it is locked.

Image not found error:

Traceback (most recent call last):
  File "/opt/stack/nova/nova/compute/manager.py", line 1408, in _spawn
    block_device_info)
  File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 609, in spawn
    admin_password, network_info, block_device_info)
  File "/opt/stack/nova/nova/virt/vmwareapi/vmops.py", line 440, in spawn
    vmdk_file_size_in_kb, linked_clone)
  File "/opt/stack/nova/nova/virt/vmwareapi/volumeops.py", line 71, in attach_disk_to_vm
    self._session._wait_for_task(instance_uuid, reconfig_task)
  File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 795, in _wait_for_task
    ret_val = done.wait()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
    return hubs.get_hub().switch()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 187, in switch
    return self.greenlet.switch()
NovaException: File [ryan-nfs] vmware_base/e8c42ed8-05e7-45bc-90c3-49a34e5a37c6.vmdk was not found

Image locked error:

Traceback (most recent call last):
  File "/opt/stack/nova/nova/compute/manager.py", line 1407, in _spawn
    block_device_info)
  File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 623, in spawn
    admin_password, network_info, block_device_info)
  File "/opt/stack/nova/nova/virt/vmwareapi/vmops.py", line 504, in spawn
    root_gb_in_kb, linked_clone)
  File "/opt/stack/nova/nova/virt/vmwareapi/volumeops.py", line 71, in attach_disk_to_vm
    self._session._wait_for_task(instance_uuid, reconfig_task)
  File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 900, in _wait_for_task
    ret_val = done.wait()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
    return hubs.get_hub().switch()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 187, in switch
    return self.greenlet.switch()
NovaException: Unable to access file [ryan-nfs] vmware_base/f110bb94-2170-4a3a-ae0d-760f95eb8b47.0.

Environment information:

- 1 datacenter, 1 cluster, 7 hosts
- NFS shared datastore
- was able to spawn 7 instances before errors appeared
- screen log with tracebacks: http://paste.openstack.org/show/47410/

Comment 1 Matthew Booth 2014-04-14 13:33:30 UTC
This is resolved in the following 4 commits:

https://code.engineering.redhat.com/gerrit/23107
https://code.engineering.redhat.com/gerrit/23108
https://code.engineering.redhat.com/gerrit/23109
https://code.engineering.redhat.com/gerrit/23110

N.B. This is an invasive, non-trivial backport, and I can't run tempest in my setup.

Comment 5 Jaroslav Henner 2014-06-17 14:46:33 UTC
Created attachment 909615 [details]
logs

After twiddling with the cache parameters:
image_cache_manager_interval=10
remove_unused_original_minimum_age_seconds=10

I have got the VMs in ERROR state. 

Sometimes the outputs of nova list are flapping:
[root@jhenner-node ~(keystone_admin)]# nova list
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
| ID                                   | Name                                     | Status | Task State | Power State | Networks                  |
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
| 02ba507a-7cd9-4388-a9ae-514ff1bf8474 | bar-02ba507a-7cd9-4388-a9ae-514ff1bf8474 | BUILD  | spawning   | NOSTATE     | novanetwork=192.168.32.25 |
| 03cbfd8e-765a-44ea-a73d-01e934714c48 | bar-03cbfd8e-765a-44ea-a73d-01e934714c48 | BUILD  | spawning   | NOSTATE     | novanetwork=192.168.32.28 |
| 173b57dc-ba49-400e-be35-88ade97c84f5 | bar-173b57dc-ba49-400e-be35-88ade97c84f5 | BUILD  | spawning   | NOSTATE     | novanetwork=192.168.32.23 |
| 384aec93-3f71-4d75-84e9-46a821ba8f17 | bar-384aec93-3f71-4d75-84e9-46a821ba8f17 | BUILD  | spawning   | NOSTATE     | novanetwork=192.168.32.27 |
| 62322322-de21-40b8-892f-0b8abdd8d691 | bar-62322322-de21-40b8-892f-0b8abdd8d691 | BUILD  | spawning   | NOSTATE     | novanetwork=192.168.32.29 |
| 6a20b378-0cc8-4c9f-9b44-1b176c76523f | bar-6a20b378-0cc8-4c9f-9b44-1b176c76523f | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.30 |
| 6c2924df-03de-4dcf-9faf-a3eb0b599787 | bar-6c2924df-03de-4dcf-9faf-a3eb0b599787 | BUILD  | spawning   | NOSTATE     | novanetwork=192.168.32.26 |
| 97bf0053-38c6-4984-bed5-cf60f7a3f6b9 | bar-97bf0053-38c6-4984-bed5-cf60f7a3f6b9 | BUILD  | spawning   | NOSTATE     | novanetwork=192.168.32.24 |
| 9d5038ec-84fb-4777-88db-e6b31a787459 | bar-9d5038ec-84fb-4777-88db-e6b31a787459 | BUILD  | spawning   | NOSTATE     | novanetwork=192.168.32.22 |
| daae00da-1810-4ea3-bbed-228407577d24 | bar-daae00da-1810-4ea3-bbed-228407577d24 | BUILD  | spawning   | NOSTATE     | novanetwork=192.168.32.31 |
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
[root@jhenner-node ~(keystone_admin)]# nova list
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
| ID                                   | Name                                     | Status | Task State | Power State | Networks                  |
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
| 02ba507a-7cd9-4388-a9ae-514ff1bf8474 | bar-02ba507a-7cd9-4388-a9ae-514ff1bf8474 | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.25 |
| 03cbfd8e-765a-44ea-a73d-01e934714c48 | bar-03cbfd8e-765a-44ea-a73d-01e934714c48 | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.28 |
| 173b57dc-ba49-400e-be35-88ade97c84f5 | bar-173b57dc-ba49-400e-be35-88ade97c84f5 | ERROR  | -          | NOSTATE     |                           |
| 384aec93-3f71-4d75-84e9-46a821ba8f17 | bar-384aec93-3f71-4d75-84e9-46a821ba8f17 | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.27 |
| 62322322-de21-40b8-892f-0b8abdd8d691 | bar-62322322-de21-40b8-892f-0b8abdd8d691 | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.29 |
| 6a20b378-0cc8-4c9f-9b44-1b176c76523f | bar-6a20b378-0cc8-4c9f-9b44-1b176c76523f | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.30 |
| 6c2924df-03de-4dcf-9faf-a3eb0b599787 | bar-6c2924df-03de-4dcf-9faf-a3eb0b599787 | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.26 |
| 97bf0053-38c6-4984-bed5-cf60f7a3f6b9 | bar-97bf0053-38c6-4984-bed5-cf60f7a3f6b9 | ACTIVE | -          | Running     | novanetwork=192.168.32.24 |
| 9d5038ec-84fb-4777-88db-e6b31a787459 | bar-9d5038ec-84fb-4777-88db-e6b31a787459 | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.22 |
| daae00da-1810-4ea3-bbed-228407577d24 | bar-daae00da-1810-4ea3-bbed-228407577d24 | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.31 |
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
[root@jhenner-node ~(keystone_admin)]# nova list
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
| ID                                   | Name                                     | Status | Task State | Power State | Networks                  |
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
| 02ba507a-7cd9-4388-a9ae-514ff1bf8474 | bar-02ba507a-7cd9-4388-a9ae-514ff1bf8474 | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.25 |
| 03cbfd8e-765a-44ea-a73d-01e934714c48 | bar-03cbfd8e-765a-44ea-a73d-01e934714c48 | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.28 |
| 173b57dc-ba49-400e-be35-88ade97c84f5 | bar-173b57dc-ba49-400e-be35-88ade97c84f5 | ERROR  | -          | NOSTATE     |                           |
| 384aec93-3f71-4d75-84e9-46a821ba8f17 | bar-384aec93-3f71-4d75-84e9-46a821ba8f17 | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.27 |
| 62322322-de21-40b8-892f-0b8abdd8d691 | bar-62322322-de21-40b8-892f-0b8abdd8d691 | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.29 |
| 6a20b378-0cc8-4c9f-9b44-1b176c76523f | bar-6a20b378-0cc8-4c9f-9b44-1b176c76523f | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.30 |
| 6c2924df-03de-4dcf-9faf-a3eb0b599787 | bar-6c2924df-03de-4dcf-9faf-a3eb0b599787 | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.26 |
| 97bf0053-38c6-4984-bed5-cf60f7a3f6b9 | bar-97bf0053-38c6-4984-bed5-cf60f7a3f6b9 | ACTIVE | -          | Running     | novanetwork=192.168.32.24 |
| 9d5038ec-84fb-4777-88db-e6b31a787459 | bar-9d5038ec-84fb-4777-88db-e6b31a787459 | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.22 |
| daae00da-1810-4ea3-bbed-228407577d24 | bar-daae00da-1810-4ea3-bbed-228407577d24 | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.31 |
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
[root@jhenner-node ~(keystone_admin)]#

Comment 6 Jaroslav Henner 2014-06-17 14:53:07 UTC
Created attachment 909629 [details]
with_long_caching

I removed my modifications to the caching times (so the defaults were used). restarted all nova components and the problem persists:


---------+-------------+---------------------------+
| ID                                   | Name                                     | Status | Task State | Power State | Networks                  |
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
| 067c969e-ef15-4330-bfde-275eae5be97d | bar-067c969e-ef15-4330-bfde-275eae5be97d | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.11 |
| 10ddc537-e06f-4315-9825-f5eaaf418066 | bar-10ddc537-e06f-4315-9825-f5eaaf418066 | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.3  |
| 223572a5-ce8a-42e2-aeba-33e3bbbec550 | bar-223572a5-ce8a-42e2-aeba-33e3bbbec550 | ACTIVE | -          | Running     | novanetwork=192.168.32.8  |
| 50b805e6-55a9-4050-a6b2-37e3ee58e36f | bar-50b805e6-55a9-4050-a6b2-37e3ee58e36f | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.10 |
| 6a7c7795-331f-4a2d-a47a-7e4c41f8482b | bar-6a7c7795-331f-4a2d-a47a-7e4c41f8482b | ACTIVE | -          | Running     | novanetwork=192.168.32.5  |
| 8138c736-a964-4583-867e-33e12fec4343 | bar-8138c736-a964-4583-867e-33e12fec4343 | ERROR  | -          | NOSTATE     | novanetwork=192.168.32.2  |
| 9daed4a9-67e5-4bd7-9642-c02ca872de75 | bar-9daed4a9-67e5-4bd7-9642-c02ca872de75 | ACTIVE | -          | Running     | novanetwork=192.168.32.4  |
| de1622a6-7ed3-4312-a552-a0b444d3a250 | bar-de1622a6-7ed3-4312-a552-a0b444d3a250 | ACTIVE | -          | Running     | novanetwork=192.168.32.7  |
| f3df82d5-e3fe-4a31-9504-c5d27779bd1e | bar-f3df82d5-e3fe-4a31-9504-c5d27779bd1e | ACTIVE | -          | Running     | novanetwork=192.168.32.9  |
| fad04105-659e-4794-bc89-e70e6191ebdd | bar-fad04105-659e-4794-bc89-e70e6191ebdd | ACTIVE | -          | Running     | novanetwork=192.168.32.6  |
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+

Comment 8 Matthew Booth 2014-06-23 10:55:37 UTC
(In reply to Jaroslav Henner from comment #6)
> Created attachment 909629 [details]
> with_long_caching
> 
> I removed my modifications to the caching times (so the defaults were used).
> restarted all nova components and the problem persists:
> 
> 
> ---------+-------------+---------------------------+
> | ID                                   | Name                               
> | Status | Task State | Power State | Networks                  |
> +--------------------------------------+-------------------------------------
> -----+--------+------------+-------------+---------------------------+
> | 067c969e-ef15-4330-bfde-275eae5be97d |
> bar-067c969e-ef15-4330-bfde-275eae5be97d | ERROR  | -          | NOSTATE    
> | novanetwork=192.168.32.11 |
> | 10ddc537-e06f-4315-9825-f5eaaf418066 |
> bar-10ddc537-e06f-4315-9825-f5eaaf418066 | ERROR  | -          | NOSTATE    
> | novanetwork=192.168.32.3  |
> | 223572a5-ce8a-42e2-aeba-33e3bbbec550 |
> bar-223572a5-ce8a-42e2-aeba-33e3bbbec550 | ACTIVE | -          | Running    
> | novanetwork=192.168.32.8  |
> | 50b805e6-55a9-4050-a6b2-37e3ee58e36f |
> bar-50b805e6-55a9-4050-a6b2-37e3ee58e36f | ERROR  | -          | NOSTATE    
> | novanetwork=192.168.32.10 |
> | 6a7c7795-331f-4a2d-a47a-7e4c41f8482b |
> bar-6a7c7795-331f-4a2d-a47a-7e4c41f8482b | ACTIVE | -          | Running    
> | novanetwork=192.168.32.5  |
> | 8138c736-a964-4583-867e-33e12fec4343 |
> bar-8138c736-a964-4583-867e-33e12fec4343 | ERROR  | -          | NOSTATE    
> | novanetwork=192.168.32.2  |
> | 9daed4a9-67e5-4bd7-9642-c02ca872de75 |
> bar-9daed4a9-67e5-4bd7-9642-c02ca872de75 | ACTIVE | -          | Running    
> | novanetwork=192.168.32.4  |
> | de1622a6-7ed3-4312-a552-a0b444d3a250 |
> bar-de1622a6-7ed3-4312-a552-a0b444d3a250 | ACTIVE | -          | Running    
> | novanetwork=192.168.32.7  |
> | f3df82d5-e3fe-4a31-9504-c5d27779bd1e |
> bar-f3df82d5-e3fe-4a31-9504-c5d27779bd1e | ACTIVE | -          | Running    
> | novanetwork=192.168.32.9  |
> | fad04105-659e-4794-bc89-e70e6191ebdd |
> bar-fad04105-659e-4794-bc89-e70e6191ebdd | ACTIVE | -          | Running    
> | novanetwork=192.168.32.6  |
> +--------------------------------------+-------------------------------------
> -----+--------+------------+-------------+---------------------------+

I have replicated this.

However, I think this is a different bug to the one which was originally reported. Specifically, I don't see any logged errors when the above occurs.

Given that this is going to have a different upstream bug and resolution, it might be better to move this bug on and create a new one for the new issue.

Comment 9 Matthew Booth 2014-06-23 15:35:38 UTC
(In reply to Matthew Booth from comment #8)
> (In reply to Jaroslav Henner from comment #6)
> > Created attachment 909629 [details]
> > with_long_caching
> > 
> > I removed my modifications to the caching times (so the defaults were used).
> > restarted all nova components and the problem persists:
> > 
> > 
> > ---------+-------------+---------------------------+
> > | ID                                   | Name                               
> > | Status | Task State | Power State | Networks                  |
> > +--------------------------------------+-------------------------------------
> > -----+--------+------------+-------------+---------------------------+
> > | 067c969e-ef15-4330-bfde-275eae5be97d |
> > bar-067c969e-ef15-4330-bfde-275eae5be97d | ERROR  | -          | NOSTATE    
> > | novanetwork=192.168.32.11 |
> > | 10ddc537-e06f-4315-9825-f5eaaf418066 |
> > bar-10ddc537-e06f-4315-9825-f5eaaf418066 | ERROR  | -          | NOSTATE    
> > | novanetwork=192.168.32.3  |
> > | 223572a5-ce8a-42e2-aeba-33e3bbbec550 |
> > bar-223572a5-ce8a-42e2-aeba-33e3bbbec550 | ACTIVE | -          | Running    
> > | novanetwork=192.168.32.8  |
> > | 50b805e6-55a9-4050-a6b2-37e3ee58e36f |
> > bar-50b805e6-55a9-4050-a6b2-37e3ee58e36f | ERROR  | -          | NOSTATE    
> > | novanetwork=192.168.32.10 |
> > | 6a7c7795-331f-4a2d-a47a-7e4c41f8482b |
> > bar-6a7c7795-331f-4a2d-a47a-7e4c41f8482b | ACTIVE | -          | Running    
> > | novanetwork=192.168.32.5  |
> > | 8138c736-a964-4583-867e-33e12fec4343 |
> > bar-8138c736-a964-4583-867e-33e12fec4343 | ERROR  | -          | NOSTATE    
> > | novanetwork=192.168.32.2  |
> > | 9daed4a9-67e5-4bd7-9642-c02ca872de75 |
> > bar-9daed4a9-67e5-4bd7-9642-c02ca872de75 | ACTIVE | -          | Running    
> > | novanetwork=192.168.32.4  |
> > | de1622a6-7ed3-4312-a552-a0b444d3a250 |
> > bar-de1622a6-7ed3-4312-a552-a0b444d3a250 | ACTIVE | -          | Running    
> > | novanetwork=192.168.32.7  |
> > | f3df82d5-e3fe-4a31-9504-c5d27779bd1e |
> > bar-f3df82d5-e3fe-4a31-9504-c5d27779bd1e | ACTIVE | -          | Running    
> > | novanetwork=192.168.32.9  |
> > | fad04105-659e-4794-bc89-e70e6191ebdd |
> > bar-fad04105-659e-4794-bc89-e70e6191ebdd | ACTIVE | -          | Running    
> > | novanetwork=192.168.32.6  |
> > +--------------------------------------+-------------------------------------
> > -----+--------+------------+-------------+---------------------------+
> 
> I have replicated this.
> 
> However, I think this is a different bug to the one which was originally
> reported. Specifically, I don't see any logged errors when the above occurs.
> 
> Given that this is going to have a different upstream bug and resolution, it
> might be better to move this bug on and create a new one for the new issue.

Scratch that: I haven't reproduced. I see less than the requested number of instances being created, however that is due to quota. Looking into the error logs in more detail.

Comment 10 Matthew Booth 2014-06-23 16:12:48 UTC
Note to self:

Looking in other.log, why is [datastore1] 172.16.0.13_base/326153d2-1226-415a-a194-2ca47ac3c48b/326153d2-1226-415a-a194-2ca47ac3c48b.1.vmdk locked? Do we have a race when extending a file in the image cache? Where is _copy_virtual_disk() called from, and is it safe?

Comment 11 Matthew Booth 2014-06-24 08:41:43 UTC
Ok, this is a different race condition in spawn() in the VMware driver. Specifically, this happens when multiple spawns race to create a resized copy of a cached image.

Comment 12 Matthew Booth 2014-06-24 13:25:38 UTC
This is definitely a different bug. I've filed a new upstream bug here:

https://bugs.launchpad.net/nova/+bug/1333587

Fix is up for review here:

https://review.openstack.org/102224

Comment 14 Stephen Gordon 2014-06-24 14:15:37 UTC
Matt, as discussed on IRC let's get the new patch on this bug as well.

Jaroslav, can you provide qa_ack+?

Comment 15 Matthew Booth 2014-07-15 09:57:49 UTC
I've backported the fix upstream and submitted for review.

Comment 16 Matthew Booth 2014-08-04 11:22:43 UTC
I've backported the following 2 commits from stable:

https://code.engineering.redhat.com/gerrit/30265
https://code.engineering.redhat.com/gerrit/30266

These resolve the race when spawning a large number of guests simultaneously.

Comment 17 Lon Hohberger 2014-08-14 19:46:32 UTC
Fix is part of upstream 2014.1.2

Comment 19 Jaroslav Henner 2014-08-18 14:44:36 UTC
It didn't reproduce with default nor shortened caching params. All VMs ACTIVE.

Comment 21 errata-xmlrpc 2014-09-02 18:21:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1127.html