Bug 1694972
Summary: | Reading same image on block storage concurrently using vdsm HTTP server may fail one of the requests | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Kobi Hakimi <khakimi> | ||||||
Component: | Core | Assignee: | Nir Soffer <nsoffer> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Nikolai Sednev <nsednev> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 4.30.12 | CC: | bugs, derez, frolland, nsoffer, royoung, tnisan | ||||||
Target Milestone: | ovirt-4.4.1 | Keywords: | Triaged | ||||||
Target Release: | 4.40.21 | Flags: | sbonazzo:
ovirt-4.5?
|
||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | vdsm-4.40.21 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2020-08-05 06:25:19 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Kobi Hakimi
2019-04-02 08:06:03 UTC
Created attachment 1550916 [details]
vdsm log
Created attachment 1550917 [details]
engine log
ovirt-engine-4.3.3.1-0.1.el7.noarch vdsm-4.30.12-1.el7ev.x86_64 I'm not sure I understood what's the flow here. How volume transfer to the engine and 3rd host Hosted engine deployment are related? Seems like this is not easily reproducible and has a workaround of just re-trying to re-deploy the host. Keeping it for further investigations. reproduced in 4.2.8-10 with ge-6 and ge-7: ovirt-imageio-proxy-1.4.5-0.el7ev.noarch ovirt-imageio-proxy-setup-1.4.5-0.el7ev.noarch ovirt-imageio-common-1.4.5-0.el7ev.x86_64 Nir, can you please look into this? Looks not strictly related to hosted engine, seems to be a storage operation failing Daniel, Is it something that was fixed lately? (In reply to Fred Rolland from comment #8) > Daniel, > Is it something that was fixed lately? Not sure, it's not related to imageio.. Seems like an issue on downloading an OVF that isn't exist yet. Can you reproduce with 4.4? we didn't see this issue lately with 4.4 please close it as not reproduced. if we will see it again we can reopen it ;-) We can close this, but I think that this is an issue in hosted engine. hosted engine is trying to read OVF disk when the disk is empty. The reason the call failed is that the caller asked for: 'length': 20480 Which do not exist on storage according to: MiscFileReadException: Internal file read failure: ('partial data 0 from 20480',) In this case the client will get "500 Internal server error" and the agent should retry again in the next check interval. When engine uploads OVF, it updates the OVF disk description: Here is description from OVF disks in my setup: $ grep OVF_STORE /rhev/data-center/mnt/nfs1\:_export_1/56ecc03c-4bb5-4792-8971-3c51ea924d2e/images/*/*.meta /rhev/data-center/mnt/nfs1:_export_1/56ecc03c-4bb5-4792-8971-3c51ea924d2e/images/d6ac0ab5-bd39-4e19-8306-fd3bd36d64ac/adc60bd9-c0f3-4a49-a859-caf6562c79c4.meta:DESCRIPTION={"Updated":true,"Size":23040,"Last Updated":"Wed Jun 17 21:26:32 IDT 2020","Storage Domains":[{"uuid":"56ecc03c-4bb5-4792-8971-3c51ea924d2e"}],"Disk Description":"OVF_STORE"} /rhev/data-center/mnt/nfs1:_export_1/56ecc03c-4bb5-4792-8971-3c51ea924d2e/images/d8932fa5-d229-44af-826a-19c4989243dd/f3401def-65a5-438f-a3a6-2f40162ae646.meta:DESCRIPTION={"Updated":true,"Size":23040,"Last Updated":"Wed Jun 17 21:26:32 IDT 2020","Storage Domains":[{"uuid":"56ecc03c-4bb5-4792-8971-3c51ea924d2e"}],"Disk Description":"OVF_STORE"} So we have: Disk ID: d6ac0ab5-bd39-4e19-8306-fd3bd36d64ac volume ID: adc60bd9-c0f3-4a49-a859-caf6562c79c4 Updated: true Size: 23040 Last Updated: Wed Jun 17 21:26:32 IDT 2020 Disk ID: d8932fa5-d229-44af-826a-19c4989243dd Volume ID: f3401def-65a5-438f-a3a6-2f40162ae646 Updated: true Size: 23040 Last Updated: Wed Jun 17 21:26:32 IDT 2020 By parsing the disk description, we can tell which disk contains the most recent OVF data (both look the same here), and the size of the data, and use the right size in the request. So this looks like 2 issues: - Getting the wrong size of OVF volume - Not handling expected failures Looking in vdsm log I think the issue is in vdsm after all. 1. First uploadImageToStream from ::ffff:10.46.16.247:49588 starting 2019-04-02 02:07:40,730+0300 INFO (http) [vds.http.Server] Starting request handler for ::ffff:10.46.16.247:49588 (http:294) 2019-04-02 02:07:40,731+0300 INFO (Thread-7) [vds.http.Server] Request handler for ::ffff:10.46.16.247:49588 started (http:304) 2019-04-02 02:07:40,733+0300 INFO (Thread-7) [vdsm.api] START uploadImageToStream(methodArgs={'length': 20480, 'fileObj': <socket._fileobject object at 0x7f78a43b9d50>}, ca llback=<function setCallback at 0x7f78c41082a8>, startEvent=<threading._Event object at 0x7f78c417d1d0>, spUUID='b780e05d-13a5-4b34-8ca6-dcd1e3db6e79', sdUUID='04be3824-bc78 -4be0-bcdb-c9397e054cc0', imgUUID='5cdbfac5-25c7-4b3b-a57b-141fee6d728b', volUUID='b45c1a18-2701-4c0a-b1b3-9552792c9309') from=internal, task_id=2b23acec-57ac-4058-82fb-efb75524ad49 (api:48) 2. This request starts task 2b23acec-57ac-4058-82fb-efb75524ad49, running in thread tasks/4 2019-04-02 02:07:40,908+0300 INFO (tasks/4) [storage.ThreadPool.WorkerThread] START task 2b23acec-57ac-4058-82fb-efb75524ad49 (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x7f78a41a9518>>, args=None) (threadPool:208) 3. One second later, another uploadImageFromStream from ::ffff:10.46.16.247:35434 2019-04-02 02:07:41,414+0300 INFO (http) [vds.http.Server] Starting request handler for ::ffff:10.46.16.247:35434 (http:294) 2019-04-02 02:07:41,425+0300 INFO (Thread-8) [vdsm.api] START uploadImageToStream(methodArgs={'length': 20480, 'fileObj': <socket._fileobject object at 0x7f78a43b9a50>}, callback=<function setCallback at 0x7f78c41080c8>, startEvent=<threading._Event object at 0x7f78a428d8d0>, spUUID='b780e05d-13a5-4b34-8ca6-dcd1e3db6e79', sdUUID='04be3824-bc78-4be0-bcdb-c9397e054cc0', imgUUID='5cdbfac5-25c7-4b3b-a57b-141fee6d728b', volUUID='b45c1a18-2701-4c0a-b1b3-9552792c9309') from=internal, task_id=4b79d567-f2e6-4d26-9c35-c7f9579e54fc (api:48) 4. This request starts task 4b79d567-f2e6-4d26-9c35-c7f9579e54fc, running in thread tasks/0 2019-04-02 02:07:41,493+0300 INFO (tasks/0) [storage.ThreadPool.WorkerThread] START task 4b79d567-f2e6-4d26-9c35-c7f9579e54fc (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x7f78a423cfc8>>, args=None) (threadPool:208) Both tasks are reading the volume b45c1a18-2701-4c0a-b1b3-9552792c9309 5. tasks/4 activates the volume 2019-04-02 02:07:41,164+0300 INFO (tasks/4) [storage.LVM] Refreshing active lvs: vg=04be3824-bc78-4be0-bcdb-c9397e054cc0 lvs=['b45c1a18-2701-4c0a-b1b3-9552792c9309'] (lvm:1 410) 6. tasks/4 finish reading the image, so it deactivates the volume 2019-04-02 02:07:41,658+0300 INFO (tasks/4) [storage.LVM] Deactivating lvs: vg=04be3824-bc78-4be0-bcdb-c9397e054cc0 lvs=['b45c1a18-2701-4c0a-b1b3-9552792c9309'] (lvm:1422) 7. tasks/0 found the the volume was active so it refresh the volume 2019-04-02 02:07:41,723+0300 INFO (tasks/0) [storage.LVM] Refreshing active lvs: vg=04be3824-bc78-4be0-bcdb-c9397e054cc0 lvs=['b45c1a18-2701-4c0a-b1b3-9552792c9309'] (lvm:1 Because this happens 70 milliseconds after tasks/4 deactivated the volume, tasks/0 does not see the deactivate lv. Running lvm command takes more than 70 milliseconds. So at this point the lv was deactivated by thread tasks/4... 8. tasks/0 fail to read from non-existing device 2019-04-02 02:07:41,945+0300 ERROR (tasks/0) [storage.ImageSharing] partial data 0 from 20480 (imageSharing:122) The issue is bad locking. uploadImageToStream and downloadImageFromStream must take a lock on the volume for the entire operation, so one task cannot deactivate a volume use by another task. Looking in the code, uploadImageFromStream and downloadImageFromStream do use a lock on the image: 1742 def uploadImageToStream(self, methodArgs, callback, startEvent, sdUUID, 1743 imgUUID, volUUID=None): 1744 """ 1745 Retrieves an image from to a given file the specified method 1746 and methodArgs. 1747 """ 1748 while not startEvent.is_set(): 1749 startEvent.wait() 1750 1751 img_ns = rm.getNamespace(sc.IMAGE_NAMESPACE, sdUUID) 1752 with rm.acquireResource(img_ns, imgUUID, rm.SHARED): 1753 try: 1754 img = image.Image(self.poolPath) 1755 return img.copyFromImage(methodArgs, sdUUID, imgUUID, volUUID) 1756 finally: 1757 callback() 1758 1759 def downloadImageFromStream(self, methodArgs, callback, sdUUID, imgUUID, 1760 volUUID=None): 1761 """ 1762 Download an image from a stream. 1763 """ 1764 img_ns = rm.getNamespace(sc.IMAGE_NAMESPACE, sdUUID) 1765 with rm.acquireResource(img_ns, imgUUID, rm.EXCLUSIVE): 1766 try: 1767 img = image.Image(self.poolPath) 1768 return img.copyToImage(methodArgs, sdUUID, imgUUID, volUUID) 1769 finally: 1770 callback() downloadImageFromStream takes exclusive lock, so there is no race between upload and download, but uploadImageToStream takes a shared lock, so we can have multiple readers. The issue is that Image.copyFromImage is activating the volume before the copy and deactivating the volume after the copy. 985 def copyFromImage(self, methodArgs, sdUUID, imgUUID, volUUID): 986 domain = sdCache.produce(sdUUID) 987 988 vol = self._activateVolumeForImportExport(domain, imgUUID, volUUID) 989 try: 990 imageSharing.copyFromImage(vol.getVolumePath(), methodArgs) 991 finally: 992 domain.deactivateImage(imgUUID) So while 2 tasks are holding a shared lock, they activate and deactivate the same volume, so one task can deactivate a volume just before the other task try to read it. Looks like we can allow only single reader for uploadImageToStream. If we want to allow more than one reader we need to implement reference counting so avoid uncontrolled deactivation. Moving to 4.4.2 since this is an easy fix. Works for me on latest ovirt-engine-4.4.1.8-0.7.el8ev.noarch Reported issue no longer exists. This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |