Bug 1080618
| Summary: | VMware: spawning large amounts of VMs concurrently sometimes causes "VMDK lock" error | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Stephen Gordon <sgordon> | ||||||
| Component: | openstack-nova | Assignee: | Matthew Booth <mbooth> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Jaroslav Henner <jhenner> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 5.0 (RHEL 7) | CC: | jhenner, mbooth, mlopes, ndipanov, sclewis, sgordon, yeylon | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | 5.0 (RHEL 7) | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | openstack-nova-2014.1.2-1.el7ost | Doc Type: | Bug Fix | ||||||
| Doc Text: |
With this update, the VMware driver has been updated to initially copy cached images to a unique temporary directory before landing. Consequently, only one process is permitted to create the cached copy, thereby restricting resizing of that image to one process only.
As a result, this fix resolves numerous race conditions found in previous releases that occasionally prevented multiple instances from using the same image.
|
Story Points: | --- | ||||||
| Clone Of: | |||||||||
| : | 1088997 (view as bug list) | Environment: | |||||||
| Last Closed: | 2014-09-02 18:21:54 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1055536, 1088997 | ||||||||
| Attachments: |
|
||||||||
This is resolved in the following 4 commits: https://code.engineering.redhat.com/gerrit/23107 https://code.engineering.redhat.com/gerrit/23108 https://code.engineering.redhat.com/gerrit/23109 https://code.engineering.redhat.com/gerrit/23110 N.B. This is an invasive, non-trivial backport, and I can't run tempest in my setup. Created attachment 909615 [details]
logs
After twiddling with the cache parameters:
image_cache_manager_interval=10
remove_unused_original_minimum_age_seconds=10
I have got the VMs in ERROR state.
Sometimes the outputs of nova list are flapping:
[root@jhenner-node ~(keystone_admin)]# nova list
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
| 02ba507a-7cd9-4388-a9ae-514ff1bf8474 | bar-02ba507a-7cd9-4388-a9ae-514ff1bf8474 | BUILD | spawning | NOSTATE | novanetwork=192.168.32.25 |
| 03cbfd8e-765a-44ea-a73d-01e934714c48 | bar-03cbfd8e-765a-44ea-a73d-01e934714c48 | BUILD | spawning | NOSTATE | novanetwork=192.168.32.28 |
| 173b57dc-ba49-400e-be35-88ade97c84f5 | bar-173b57dc-ba49-400e-be35-88ade97c84f5 | BUILD | spawning | NOSTATE | novanetwork=192.168.32.23 |
| 384aec93-3f71-4d75-84e9-46a821ba8f17 | bar-384aec93-3f71-4d75-84e9-46a821ba8f17 | BUILD | spawning | NOSTATE | novanetwork=192.168.32.27 |
| 62322322-de21-40b8-892f-0b8abdd8d691 | bar-62322322-de21-40b8-892f-0b8abdd8d691 | BUILD | spawning | NOSTATE | novanetwork=192.168.32.29 |
| 6a20b378-0cc8-4c9f-9b44-1b176c76523f | bar-6a20b378-0cc8-4c9f-9b44-1b176c76523f | ERROR | - | NOSTATE | novanetwork=192.168.32.30 |
| 6c2924df-03de-4dcf-9faf-a3eb0b599787 | bar-6c2924df-03de-4dcf-9faf-a3eb0b599787 | BUILD | spawning | NOSTATE | novanetwork=192.168.32.26 |
| 97bf0053-38c6-4984-bed5-cf60f7a3f6b9 | bar-97bf0053-38c6-4984-bed5-cf60f7a3f6b9 | BUILD | spawning | NOSTATE | novanetwork=192.168.32.24 |
| 9d5038ec-84fb-4777-88db-e6b31a787459 | bar-9d5038ec-84fb-4777-88db-e6b31a787459 | BUILD | spawning | NOSTATE | novanetwork=192.168.32.22 |
| daae00da-1810-4ea3-bbed-228407577d24 | bar-daae00da-1810-4ea3-bbed-228407577d24 | BUILD | spawning | NOSTATE | novanetwork=192.168.32.31 |
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
[root@jhenner-node ~(keystone_admin)]# nova list
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
| 02ba507a-7cd9-4388-a9ae-514ff1bf8474 | bar-02ba507a-7cd9-4388-a9ae-514ff1bf8474 | ERROR | - | NOSTATE | novanetwork=192.168.32.25 |
| 03cbfd8e-765a-44ea-a73d-01e934714c48 | bar-03cbfd8e-765a-44ea-a73d-01e934714c48 | ERROR | - | NOSTATE | novanetwork=192.168.32.28 |
| 173b57dc-ba49-400e-be35-88ade97c84f5 | bar-173b57dc-ba49-400e-be35-88ade97c84f5 | ERROR | - | NOSTATE | |
| 384aec93-3f71-4d75-84e9-46a821ba8f17 | bar-384aec93-3f71-4d75-84e9-46a821ba8f17 | ERROR | - | NOSTATE | novanetwork=192.168.32.27 |
| 62322322-de21-40b8-892f-0b8abdd8d691 | bar-62322322-de21-40b8-892f-0b8abdd8d691 | ERROR | - | NOSTATE | novanetwork=192.168.32.29 |
| 6a20b378-0cc8-4c9f-9b44-1b176c76523f | bar-6a20b378-0cc8-4c9f-9b44-1b176c76523f | ERROR | - | NOSTATE | novanetwork=192.168.32.30 |
| 6c2924df-03de-4dcf-9faf-a3eb0b599787 | bar-6c2924df-03de-4dcf-9faf-a3eb0b599787 | ERROR | - | NOSTATE | novanetwork=192.168.32.26 |
| 97bf0053-38c6-4984-bed5-cf60f7a3f6b9 | bar-97bf0053-38c6-4984-bed5-cf60f7a3f6b9 | ACTIVE | - | Running | novanetwork=192.168.32.24 |
| 9d5038ec-84fb-4777-88db-e6b31a787459 | bar-9d5038ec-84fb-4777-88db-e6b31a787459 | ERROR | - | NOSTATE | novanetwork=192.168.32.22 |
| daae00da-1810-4ea3-bbed-228407577d24 | bar-daae00da-1810-4ea3-bbed-228407577d24 | ERROR | - | NOSTATE | novanetwork=192.168.32.31 |
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
[root@jhenner-node ~(keystone_admin)]# nova list
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
| 02ba507a-7cd9-4388-a9ae-514ff1bf8474 | bar-02ba507a-7cd9-4388-a9ae-514ff1bf8474 | ERROR | - | NOSTATE | novanetwork=192.168.32.25 |
| 03cbfd8e-765a-44ea-a73d-01e934714c48 | bar-03cbfd8e-765a-44ea-a73d-01e934714c48 | ERROR | - | NOSTATE | novanetwork=192.168.32.28 |
| 173b57dc-ba49-400e-be35-88ade97c84f5 | bar-173b57dc-ba49-400e-be35-88ade97c84f5 | ERROR | - | NOSTATE | |
| 384aec93-3f71-4d75-84e9-46a821ba8f17 | bar-384aec93-3f71-4d75-84e9-46a821ba8f17 | ERROR | - | NOSTATE | novanetwork=192.168.32.27 |
| 62322322-de21-40b8-892f-0b8abdd8d691 | bar-62322322-de21-40b8-892f-0b8abdd8d691 | ERROR | - | NOSTATE | novanetwork=192.168.32.29 |
| 6a20b378-0cc8-4c9f-9b44-1b176c76523f | bar-6a20b378-0cc8-4c9f-9b44-1b176c76523f | ERROR | - | NOSTATE | novanetwork=192.168.32.30 |
| 6c2924df-03de-4dcf-9faf-a3eb0b599787 | bar-6c2924df-03de-4dcf-9faf-a3eb0b599787 | ERROR | - | NOSTATE | novanetwork=192.168.32.26 |
| 97bf0053-38c6-4984-bed5-cf60f7a3f6b9 | bar-97bf0053-38c6-4984-bed5-cf60f7a3f6b9 | ACTIVE | - | Running | novanetwork=192.168.32.24 |
| 9d5038ec-84fb-4777-88db-e6b31a787459 | bar-9d5038ec-84fb-4777-88db-e6b31a787459 | ERROR | - | NOSTATE | novanetwork=192.168.32.22 |
| daae00da-1810-4ea3-bbed-228407577d24 | bar-daae00da-1810-4ea3-bbed-228407577d24 | ERROR | - | NOSTATE | novanetwork=192.168.32.31 |
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
[root@jhenner-node ~(keystone_admin)]#
Created attachment 909629 [details]
with_long_caching
I removed my modifications to the caching times (so the defaults were used). restarted all nova components and the problem persists:
---------+-------------+---------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
| 067c969e-ef15-4330-bfde-275eae5be97d | bar-067c969e-ef15-4330-bfde-275eae5be97d | ERROR | - | NOSTATE | novanetwork=192.168.32.11 |
| 10ddc537-e06f-4315-9825-f5eaaf418066 | bar-10ddc537-e06f-4315-9825-f5eaaf418066 | ERROR | - | NOSTATE | novanetwork=192.168.32.3 |
| 223572a5-ce8a-42e2-aeba-33e3bbbec550 | bar-223572a5-ce8a-42e2-aeba-33e3bbbec550 | ACTIVE | - | Running | novanetwork=192.168.32.8 |
| 50b805e6-55a9-4050-a6b2-37e3ee58e36f | bar-50b805e6-55a9-4050-a6b2-37e3ee58e36f | ERROR | - | NOSTATE | novanetwork=192.168.32.10 |
| 6a7c7795-331f-4a2d-a47a-7e4c41f8482b | bar-6a7c7795-331f-4a2d-a47a-7e4c41f8482b | ACTIVE | - | Running | novanetwork=192.168.32.5 |
| 8138c736-a964-4583-867e-33e12fec4343 | bar-8138c736-a964-4583-867e-33e12fec4343 | ERROR | - | NOSTATE | novanetwork=192.168.32.2 |
| 9daed4a9-67e5-4bd7-9642-c02ca872de75 | bar-9daed4a9-67e5-4bd7-9642-c02ca872de75 | ACTIVE | - | Running | novanetwork=192.168.32.4 |
| de1622a6-7ed3-4312-a552-a0b444d3a250 | bar-de1622a6-7ed3-4312-a552-a0b444d3a250 | ACTIVE | - | Running | novanetwork=192.168.32.7 |
| f3df82d5-e3fe-4a31-9504-c5d27779bd1e | bar-f3df82d5-e3fe-4a31-9504-c5d27779bd1e | ACTIVE | - | Running | novanetwork=192.168.32.9 |
| fad04105-659e-4794-bc89-e70e6191ebdd | bar-fad04105-659e-4794-bc89-e70e6191ebdd | ACTIVE | - | Running | novanetwork=192.168.32.6 |
+--------------------------------------+------------------------------------------+--------+------------+-------------+---------------------------+
(In reply to Jaroslav Henner from comment #6) > Created attachment 909629 [details] > with_long_caching > > I removed my modifications to the caching times (so the defaults were used). > restarted all nova components and the problem persists: > > > ---------+-------------+---------------------------+ > | ID | Name > | Status | Task State | Power State | Networks | > +--------------------------------------+------------------------------------- > -----+--------+------------+-------------+---------------------------+ > | 067c969e-ef15-4330-bfde-275eae5be97d | > bar-067c969e-ef15-4330-bfde-275eae5be97d | ERROR | - | NOSTATE > | novanetwork=192.168.32.11 | > | 10ddc537-e06f-4315-9825-f5eaaf418066 | > bar-10ddc537-e06f-4315-9825-f5eaaf418066 | ERROR | - | NOSTATE > | novanetwork=192.168.32.3 | > | 223572a5-ce8a-42e2-aeba-33e3bbbec550 | > bar-223572a5-ce8a-42e2-aeba-33e3bbbec550 | ACTIVE | - | Running > | novanetwork=192.168.32.8 | > | 50b805e6-55a9-4050-a6b2-37e3ee58e36f | > bar-50b805e6-55a9-4050-a6b2-37e3ee58e36f | ERROR | - | NOSTATE > | novanetwork=192.168.32.10 | > | 6a7c7795-331f-4a2d-a47a-7e4c41f8482b | > bar-6a7c7795-331f-4a2d-a47a-7e4c41f8482b | ACTIVE | - | Running > | novanetwork=192.168.32.5 | > | 8138c736-a964-4583-867e-33e12fec4343 | > bar-8138c736-a964-4583-867e-33e12fec4343 | ERROR | - | NOSTATE > | novanetwork=192.168.32.2 | > | 9daed4a9-67e5-4bd7-9642-c02ca872de75 | > bar-9daed4a9-67e5-4bd7-9642-c02ca872de75 | ACTIVE | - | Running > | novanetwork=192.168.32.4 | > | de1622a6-7ed3-4312-a552-a0b444d3a250 | > bar-de1622a6-7ed3-4312-a552-a0b444d3a250 | ACTIVE | - | Running > | novanetwork=192.168.32.7 | > | f3df82d5-e3fe-4a31-9504-c5d27779bd1e | > bar-f3df82d5-e3fe-4a31-9504-c5d27779bd1e | ACTIVE | - | Running > | novanetwork=192.168.32.9 | > | fad04105-659e-4794-bc89-e70e6191ebdd | > bar-fad04105-659e-4794-bc89-e70e6191ebdd | ACTIVE | - | Running > | novanetwork=192.168.32.6 | > +--------------------------------------+------------------------------------- > -----+--------+------------+-------------+---------------------------+ I have replicated this. However, I think this is a different bug to the one which was originally reported. Specifically, I don't see any logged errors when the above occurs. Given that this is going to have a different upstream bug and resolution, it might be better to move this bug on and create a new one for the new issue. (In reply to Matthew Booth from comment #8) > (In reply to Jaroslav Henner from comment #6) > > Created attachment 909629 [details] > > with_long_caching > > > > I removed my modifications to the caching times (so the defaults were used). > > restarted all nova components and the problem persists: > > > > > > ---------+-------------+---------------------------+ > > | ID | Name > > | Status | Task State | Power State | Networks | > > +--------------------------------------+------------------------------------- > > -----+--------+------------+-------------+---------------------------+ > > | 067c969e-ef15-4330-bfde-275eae5be97d | > > bar-067c969e-ef15-4330-bfde-275eae5be97d | ERROR | - | NOSTATE > > | novanetwork=192.168.32.11 | > > | 10ddc537-e06f-4315-9825-f5eaaf418066 | > > bar-10ddc537-e06f-4315-9825-f5eaaf418066 | ERROR | - | NOSTATE > > | novanetwork=192.168.32.3 | > > | 223572a5-ce8a-42e2-aeba-33e3bbbec550 | > > bar-223572a5-ce8a-42e2-aeba-33e3bbbec550 | ACTIVE | - | Running > > | novanetwork=192.168.32.8 | > > | 50b805e6-55a9-4050-a6b2-37e3ee58e36f | > > bar-50b805e6-55a9-4050-a6b2-37e3ee58e36f | ERROR | - | NOSTATE > > | novanetwork=192.168.32.10 | > > | 6a7c7795-331f-4a2d-a47a-7e4c41f8482b | > > bar-6a7c7795-331f-4a2d-a47a-7e4c41f8482b | ACTIVE | - | Running > > | novanetwork=192.168.32.5 | > > | 8138c736-a964-4583-867e-33e12fec4343 | > > bar-8138c736-a964-4583-867e-33e12fec4343 | ERROR | - | NOSTATE > > | novanetwork=192.168.32.2 | > > | 9daed4a9-67e5-4bd7-9642-c02ca872de75 | > > bar-9daed4a9-67e5-4bd7-9642-c02ca872de75 | ACTIVE | - | Running > > | novanetwork=192.168.32.4 | > > | de1622a6-7ed3-4312-a552-a0b444d3a250 | > > bar-de1622a6-7ed3-4312-a552-a0b444d3a250 | ACTIVE | - | Running > > | novanetwork=192.168.32.7 | > > | f3df82d5-e3fe-4a31-9504-c5d27779bd1e | > > bar-f3df82d5-e3fe-4a31-9504-c5d27779bd1e | ACTIVE | - | Running > > | novanetwork=192.168.32.9 | > > | fad04105-659e-4794-bc89-e70e6191ebdd | > > bar-fad04105-659e-4794-bc89-e70e6191ebdd | ACTIVE | - | Running > > | novanetwork=192.168.32.6 | > > +--------------------------------------+------------------------------------- > > -----+--------+------------+-------------+---------------------------+ > > I have replicated this. > > However, I think this is a different bug to the one which was originally > reported. Specifically, I don't see any logged errors when the above occurs. > > Given that this is going to have a different upstream bug and resolution, it > might be better to move this bug on and create a new one for the new issue. Scratch that: I haven't reproduced. I see less than the requested number of instances being created, however that is due to quota. Looking into the error logs in more detail. Note to self: Looking in other.log, why is [datastore1] 172.16.0.13_base/326153d2-1226-415a-a194-2ca47ac3c48b/326153d2-1226-415a-a194-2ca47ac3c48b.1.vmdk locked? Do we have a race when extending a file in the image cache? Where is _copy_virtual_disk() called from, and is it safe? Ok, this is a different race condition in spawn() in the VMware driver. Specifically, this happens when multiple spawns race to create a resized copy of a cached image. This is definitely a different bug. I've filed a new upstream bug here: https://bugs.launchpad.net/nova/+bug/1333587 Fix is up for review here: https://review.openstack.org/102224 Matt, as discussed on IRC let's get the new patch on this bug as well. Jaroslav, can you provide qa_ack+? I've backported the fix upstream and submitted for review. I've backported the following 2 commits from stable: https://code.engineering.redhat.com/gerrit/30265 https://code.engineering.redhat.com/gerrit/30266 These resolve the race when spawning a large number of guests simultaneously. Fix is part of upstream 2014.1.2 It didn't reproduce with default nor shortened caching params. All VMs ACTIVE. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1127.html |
Description of problem: When using the VMwareVCDriver, spawning large amounts of virtual machines concurrently causes some instances to spawn with status ERROR. The number of machines that fail to build is unpredictable and sometimes all instances do end up spawning successfully. The issue can be reproduced by running: nova boot --image debian-2.6.32-i686 --flavor 1 --num-instances 32 nameless The number of instances that causes the errors differ from environment to environment. Start with 30-40. There are two errors seen in the logs that are causing the instance spawn failures. The first is the ESX host not finding the image in the nfs datastore (even though it is there, otherwise other instances couldn't be spawned). The second is the ESX host not being able to access the vmdk image because it is locked. Image not found error: Traceback (most recent call last): File "/opt/stack/nova/nova/compute/manager.py", line 1408, in _spawn block_device_info) File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 609, in spawn admin_password, network_info, block_device_info) File "/opt/stack/nova/nova/virt/vmwareapi/vmops.py", line 440, in spawn vmdk_file_size_in_kb, linked_clone) File "/opt/stack/nova/nova/virt/vmwareapi/volumeops.py", line 71, in attach_disk_to_vm self._session._wait_for_task(instance_uuid, reconfig_task) File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 795, in _wait_for_task ret_val = done.wait() File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait return hubs.get_hub().switch() File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 187, in switch return self.greenlet.switch() NovaException: File [ryan-nfs] vmware_base/e8c42ed8-05e7-45bc-90c3-49a34e5a37c6.vmdk was not found Image locked error: Traceback (most recent call last): File "/opt/stack/nova/nova/compute/manager.py", line 1407, in _spawn block_device_info) File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 623, in spawn admin_password, network_info, block_device_info) File "/opt/stack/nova/nova/virt/vmwareapi/vmops.py", line 504, in spawn root_gb_in_kb, linked_clone) File "/opt/stack/nova/nova/virt/vmwareapi/volumeops.py", line 71, in attach_disk_to_vm self._session._wait_for_task(instance_uuid, reconfig_task) File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 900, in _wait_for_task ret_val = done.wait() File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait return hubs.get_hub().switch() File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 187, in switch return self.greenlet.switch() NovaException: Unable to access file [ryan-nfs] vmware_base/f110bb94-2170-4a3a-ae0d-760f95eb8b47.0. Environment information: - 1 datacenter, 1 cluster, 7 hosts - NFS shared datastore - was able to spawn 7 instances before errors appeared - screen log with tracebacks: http://paste.openstack.org/show/47410/