Bug 1516689
Summary: | Sparsify doesn't work on iSCSI | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Liran Rotenberg <lrotenbe> | ||||||||
Component: | Documentation | Assignee: | Shmuel Melamud <smelamud> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Liran Rotenberg <lrotenbe> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | --- | CC: | adahms, amureini, bugs, ishaby, istein, lrotenbe, lsurette, lsvaty, mavital, michal.skrivanek, mlipchuk, obockows, rbalakri, Rhev-m-bugs, smelamud, srevivo, tjelinek, tnisan, ykaul, ylavi | ||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||
Target Release: | --- | Flags: | rule-engine:
ovirt-4.3+
rule-engine: devel_ack+ |
||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2018-03-22 07:42:54 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 1522692 | ||||||||||
Bug Blocks: | 1525242 | ||||||||||
Attachments: |
|
Description
Liran Rotenberg
2017-11-23 09:41:47 UTC
it looks a lot like https://bugzilla.redhat.com/show_bug.cgi?id=1469134 Can you please check the XDG_RUNTIME_DIR env variable on the vdsm side for the vdsm user? # su -p vdsm -c libguestfs-test-tool do not do that, as explained in bug 1469134 Also, specify entities you worked with and roughly when. Attaching a bunch of log files covering a week of random QE testing is not helpful Added a shorter version on vdsm.log of host1. Detailed entities and times: 2017-11-22 17:41:10 host_mixed_3 2017-11-23 10:37:38 host_mixed_1 Theses are examples of two runs. Also, the sparsify doesn't work on nfs/gluster. Nothing useful in logs so far. But please do describe your claim about the fact that it passes without any sparsification in more detail. How did you determine that it doesn't work? Also note NFS has limitations (look up the original RFE/doc) @shmuel, why there's nothing useful in logs? (In reply to Michal Skrivanek from comment #7) > @shmuel, why there's nothing useful in logs? It looks like virt-sparsify wasn't executed at all. This is very strange and I'd like to investigate it on the host where it's happening. I see the output from execCmd() for other commands that were executed, and I see that the verb was called, the job was run, the volume was prepared and locked and then unlocked and released, but nothing in the middle. (In reply to Michal Skrivanek from comment #7) > Nothing useful in logs so far. > But please do describe your claim about the fact that it passes without any > sparsification in more detail. How did you determine that it doesn't work? > Also note NFS has limitations (look up the original RFE/doc) > > @shmuel, why there's nothing useful in logs? The claim appears when creating a VM, checking the used space on the VM disk. After writing a file to the VM, checking the used space again(let's say the first check was 1.7GB and after writing a file of 400MB it's 2.1GB on the second check). After deleting the file and using sparsify (which pass successfully), checking for the third time the used space and it stays as the second check (2.1GB). When I'm expecting: used space on third check < used space on second check. right, it works only on NFS V4.2 or iscsi. What was your nfs version? For quick reference, here is the original RFE: https://bugzilla.redhat.com/show_bug.cgi?id=734120 (In reply to Tomas Jelinek from comment #2) > it looks a lot like https://bugzilla.redhat.com/show_bug.cgi?id=1469134 > > Can you please check the XDG_RUNTIME_DIR env variable on the vdsm side for > the vdsm user? echo $XDG_RUNTIME_DIR /run/user/0 I unset XDG_RUNTIME_DIR and ran the libguestfs-test-tool the result was test finished ok. (In reply to Tomas Jelinek from comment #10) > right, it works only on NFS V4.2 or iscsi. What was your nfs version? > For quick reference, here is the original RFE: > https://bugzilla.redhat.com/show_bug.cgi?id=734120 It doesn't work on any type: NFS/iscsi/gluster. Checking the nfs version: ❯❯❯ mount -v | grep /mnt yellow-vdsb.qa.lab.tlv.redhat.com:/Compute_NFS on /mnt type nfs4 (rw,relatime,seclabel,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.35.206.45,local_lock=none,addr=10.35.80.5) Also, I checked if SELinux has anything to do about it by disabling it but this is not the issue. Adding the libguestfs-test-tool log. Created attachment 1360290 [details]
libguestfs-test-tool log file
(In reply to Liran Rotenberg from comment #11) I don't see NFS mounted with v4.2 on the host you've been using for testing (virt-nested-vm08.scl.lab.tlv.redhat.com). All NFS filesystems are mounted with v3: yellow-vdsb.qa.lab.tlv.redhat.com:/Compute_NFS/GE/41/ENV_2/nfs_0 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_41_ENV__2_nfs__0 nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,mountaddr=10.35.80.5,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=10.35.80.5 0 0 yellow-vdsb.qa.lab.tlv.redhat.com:/Compute_NFS/GE/41/ENV_2/nfs_1 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_41_ENV__2_nfs__1 nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,mountaddr=10.35.80.5,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=10.35.80.5 0 0 yellow-vdsb.qa.lab.tlv.redhat.com:/Compute_NFS/GE/41/ENV_2/nfs_2 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_41_ENV__2_nfs__2 nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,mountaddr=10.35.80.5,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=10.35.80.5 0 0 yellow-vdsb.qa.lab.tlv.redhat.com:/Compute_NFS/GE/41/ENV_2/export_domain /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_41_ENV__2_export__domain nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,mountaddr=10.35.80.5,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=10.35.80.5 0 0 I also don't see in the storage configuration in the Engine that the NFS storage is forced to be mounted with version 4.2. This seems to me as the root of the problem. I still cannot figure out why execution of virt-sparsify is not logged. But it is executed as I have tested. Quick update: Liran and I talked face-to-face about this issue yesterday. I seems we're afflicted by platform bug 1489262 that prevents us from mounting storage domain with NFS 4.2. Using the not-yet-released nfs-utils package from that bug resolved the mounting issue for us (and then virt-sparsify should be tested - I didn't continue the investigation past the successful mounting). I've opened bug 1522692 to track requiring this nfs-utils fix in oVirt. @QE Stakeholders: You may want to block this BZ on one of the aforementioned ones - your call Another update: I did manage to mount the NFS as v4.2 according to bug 1489262. I checked each case scenario. NFS/Gluster/ISCSI. The results are: Test failing on: -Engine version 4.1.8.2-0.1.el7 -Storage type: ISCSI This is the same to the logs that are already here in the bug, nothing new(the whole logs i already put here are with iscsi 4.1 testing and the same results). -Engine version 4.2.0-0.5.master.el7 -Storage type: NFS(mounted as version 4.2). This time i'm getting Error on the sparsify operation. I can see in the vdsm log: 2017-12-06 11:39:45,846+0200 DEBUG (tasks/9) [storage.ResourceManager] Trying to release resource '00_storage.8333b00c-c6f4-4e9a-8eed-ab743ed953c3' (resourceManager:567) 2017-12-06 11:39:45,846+0200 DEBUG (tasks/9) [storage.ResourceManager] Released resource '00_storage.8333b00c-c6f4-4e9a-8eed-ab743ed953c3' (0 active users) (resourceManager:585) 2017-12-06 11:39:45,846+0200 DEBUG (tasks/9) [storage.ResourceManager] Resource '00_storage.8333b00c-c6f4-4e9a-8eed-ab743ed953c3' is free, finding out if anyone is waiting for it. (resourceManager:591) 2017-12-06 11:39:45,847+0200 DEBUG (tasks/9) [storage.ResourceManager] No one is waiting for resource '00_storage.8333b00c-c6f4-4e9a-8eed-ab743ed953c3', Clearing records. (resourceManager:599) 2017-12-06 11:39:45,847+0200 ERROR (tasks/9) [root] Job 'e44930b3-ec80-4555-818c-6b1758658e2b' failed (jobs:221) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/jobs.py", line 157, in run self._run() File "/usr/lib/python2.7/site-packages/vdsm/storage/sdm/api/sparsify_volume.py", line 52, in _run with guarded.context(self._vol_info.locks): File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 102, in __enter__ six.reraise(*exc) File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 93, in __enter__ lock.acquire() File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1518, in acquire dom.acquireVolumeLease(self._host_id, self._img_id, self._vol_id) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 475, in acquireVolumeLease self._domainLock.acquire(hostId, lease) File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 389, in acquire % (lease, hostId)) AcquireHostIdFailure: Cannot acquire host id: ("Timeout acquiring host id, cannot acquire Lease(name='c28cd9aa-09c0-4c7b-aa16-61b4b9e01f84', path=u'/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-ge-4_nfs__for__test/8333b00c-c6f4-4e9a-8eed-ab743ed953c3/images/00d7a4be-b95d-476b-ade7-3fd02e834809/c28cd9aa-09c0-4c7b-aa16-61b4b9e01f84.lease', offset=0) (id=1)",) and from the engine log: 2017-12-06 11:39:49,429+02 INFO [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-15) [dbb67bd7-0a36-4314-bac1-5485022e71ae] Command SparsifyImage id: '941d0d42-0d88-48bf-930a-af45695e1e62': job 'e44930b3-ec80-4555-818c-6b1758658e2b' execution was completed with VDSM job status 'failed' 2017-12-06 11:39:49,440+02 INFO [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-15) [dbb67bd7-0a36-4314-bac1-5485022e71ae] Command SparsifyImage id: '941d0d42-0d88-48bf-930a-af45695e1e62': execution was completed, the command status is 'FAILED' 2017-12-06 11:39:50,465+02 ERROR [org.ovirt.engine.core.bll.storage.disk.SparsifyImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-23) [dbb67bd7-0a36-4314-bac1-5485022e71ae] Ending command 'org.ovirt.engine.core.bll.storage.disk.SparsifyImageCommand' with failure. 2017-12-06 11:39:50,499+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-23) [dbb67bd7-0a36-4314-bac1-5485022e71ae] EVENT_ID: USER_SPARSIFY_IMAGE_FINISH_FAILURE(1,327), Failed to sparsify GlanceDisk-903ba23. I hope this might be helpful. Adding the vdsm and engine logs for the 4.2 engine with NFS. Created attachment 1363692 [details]
logs for the nfs
(In reply to Liran Rotenberg from comment #15) > File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 93, > in __enter__ > lock.acquire() > File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1518, > in acquire > dom.acquireVolumeLease(self._host_id, self._img_id, self._vol_id) > File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 475, in > acquireVolumeLease > self._domainLock.acquire(hostId, lease) > File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line > 389, in acquire > % (lease, hostId)) > AcquireHostIdFailure: Cannot acquire host id: ("Timeout acquiring host id, > cannot acquire Lease(name='c28cd9aa-09c0-4c7b-aa16-61b4b9e01f84', > path=u'/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com: > _Compute__NFS_GE_compute-ge-4_nfs__for__test/8333b00c-c6f4-4e9a-8eed- > ab743ed953c3/images/00d7a4be-b95d-476b-ade7-3fd02e834809/c28cd9aa-09c0-4c7b- > aa16-61b4b9e01f84.lease', offset=0) (id=1)",) Seems it cannot get a volume lease. Need somebody from storage team to take a look on it to understand why this happens. (In reply to Shmuel Melamud from comment #17) > (In reply to Liran Rotenberg from comment #15) > > File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 93, > > in __enter__ > > lock.acquire() > > File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1518, > > in acquire > > dom.acquireVolumeLease(self._host_id, self._img_id, self._vol_id) > > File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 475, in > > acquireVolumeLease > > self._domainLock.acquire(hostId, lease) > > File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line > > 389, in acquire > > % (lease, hostId)) > > AcquireHostIdFailure: Cannot acquire host id: ("Timeout acquiring host id, > > cannot acquire Lease(name='c28cd9aa-09c0-4c7b-aa16-61b4b9e01f84', > > path=u'/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com: > > _Compute__NFS_GE_compute-ge-4_nfs__for__test/8333b00c-c6f4-4e9a-8eed- > > ab743ed953c3/images/00d7a4be-b95d-476b-ade7-3fd02e834809/c28cd9aa-09c0-4c7b- > > aa16-61b4b9e01f84.lease', offset=0) (id=1)",) > > Seems it cannot get a volume lease. Need somebody from storage team to take > a look on it to understand why this happens. Tal, could you please? Maor, you're this week's QE contact, please have a look Liran, can you please attach the sanlock logs. Should probably be located in /var/log/sanlock.log Created attachment 1367132 [details]
sanlock log
Sanlock log of host_mixed_1 is added.
Based on the sanlock logs It looks like there is a problem in the storage configuration, every 10 seconds we get a permission denied error for the 'ids' volume: 2017-12-06 11:39:31 4052 [10234]: s394 open_disk /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-ge-4_nfs__for__test/8333b00c-c6f4-4e9a-8eed-ab743ed953c3/dom_md/ids error -13 ... 2017-12-06 11:39:41 4062 [10248]: s395 open_disk /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-ge-4_nfs__for__test/8333b00c-c6f4-4e9a-8eed-ab743ed953c3/dom_md/ids error -13 Can you please try to look at your storage configuration, maybe it is related to selinux? It was a problem with my environment. I checked both with selinux enforced and disabled and it's working fine now. The issue with iscsi still stands. *** Bug 1530223 has been marked as a duplicate of this bug. *** (In reply to Liran Rotenberg from comment #23) > The issue with iscsi still stands. As far as I know the configuration of the iSCSI storage domain you are using, every VM disk it provides is a partition on the iSCSI server side. virt-sparsify will not have any effect in such configuration, because the block it frees in the partition are not returned to the storage domain. virt-sparsify works as follows: it creates a temporary VM, attaches the disk to it and runs it. Inside the VM a script walks over the filesystem and discards all unused blocks. Whether the discarded blocks are returned to the storage depends on the type of the storage domain. Idan, is it possible to configure an iSCSI storage domain in such a way that the blocks discarded by virt-sparsify are returned to the storage? Does iSCSI protocol propagates the discard command to the server? Short answer - no. Long answer: I wrote about discard in block vs. file storage domains in [1]. When you discard a block in a disk that resides on a block domain, the UNMAP command goes down to the storage array and frees the block from the thinly provisioned LUN in the underlying storage. Since in oVirt we use LVM and not thin LVM in block domains, there is no way to reduce the size of the LV (the disk), only enlarge it. So if you want to check that the sparsification worked, go to the storage array and check that the size of the thinly provisioned LUN (or one of the LUNs) that this iSCSI domain is built from got smaller. [1] https://www.ovirt.org/develop/release-management/features/storage/pass-discard-from-guest-to-underlying-storage/#possible-consumers-of-the-discarded-blocks it can't be fixed on client side, but we still should document the feature either doesn't work or suggest required server side changes Updated the feature page. (In reply to Liran Rotenberg from comment #23) > It was a problem with my environment. I checked both with selinux enforced > and disabled and it's working fine now. > > The issue with iscsi still stands. Liran, what is the exact error that you are getting with iSCSI (from comment 23 )? (In reply to Yaniv Kaul from comment #29) > > Liran, what is the exact error that you are getting with iSCSI (from comment > 23 )? The error is that the sparsify feature doesn't do what it suppose to. In the logs, everything is OK. The sparsify passed successfully. But, checking if it actual did what it suppose to failed. In this scenario we do these steps: 1. create a new LUN 2. add ISCSI storage domain with the new lun 3. copy a VM template disk to the new storage domain(disk size is about 2.2GB) 4. create a vm from this template on the new SD 5. start the VM 6. write a file on it(400MB) 7. fetch actual size(around 2.6GB) 8. delete the file from step 6 9. shutdown VM 10. sparsify the VM disk 11. fetch actual size From step 11 we expect to see the size now is lower than 2.6GB(and in the best result is less 400MB). The actual - which is the error, is getting the same size as step 7. I hope it's more understandable now. It checked both on XtreamIO and Netapp. |