+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1477717 +++ ====================================================================== Description of problem: The sanlock initialization sometimes fails without any valid error messages in sanlock log or any exception message in the vdsm log. 2017-07-24 14:56:32,958+0530 ERROR (tasks/3) [storage.Volume] Unexpected error (volume:1110) Traceback (most recent call last): File "/usr/share/vdsm/storage/volume.py", line 1104, in create cls.newVolumeLease(metaId, sdUUID, volUUID) File "/usr/share/vdsm/storage/volume.py", line 1387, in newVolumeLease return cls.manifestClass.newVolumeLease(metaId, sdUUID, volUUID) File "/usr/share/vdsm/storage/blockVolume.py", line 319, in newVolumeLease sanlock.init_resource(sdUUID, volUUID, [(leasePath, leaseOffset)]) SanlockException: (-202, 'Sanlock resource init failure', 'Sanlock exception') 2017-07-24 14:56:32,958+0530 ERROR (tasks/3) [storage.Image] Unexpected error (image:892) Traceback (most recent call last): File "/usr/share/vdsm/storage/image.py", line 882, in copyCollapsed srcVolUUID=sc.BLANK_UUID) File "/usr/share/vdsm/storage/sd.py", line 758, in createVolume initialSize=initialSize) File "/usr/share/vdsm/storage/volume.py", line 1112, in create (volUUID, e)) VolumeCreationError: Error creating a new volume: (u"Volume creation 5a9a3fb8-9560-4240-aadf-1d2536c8dfe1 failed: (-202, 'Sanlock resource init failure', 'Sanlock exception')",) As per the vdsm log, this happens four times for the customer and two of them was while creating the OVF_STORE. There is nothing logged in the sanlock log regarding the error. I am unable to find any reason why this is happening as logs don't have any info. May be we have to add new log entries in the related area to understand the root cause. Version-Release number of selected component (if applicable): vdsm-4.19.24-1.el7ev.x86_64 sanlock-3.4.0-1.el7.x86_64 How reproducible: Rarely for the customer while creating the disks. Actual results: Sometimes disk creation fails with sanlock error Expected results: Disk creation should work. Additional info: (Originally by Nijin Ashok)
I see a bunch of errors about storage connections timing out, and sanlock renewal failures (which makes sense given the connection timeout). Nir - anything we can do from our end? (Originally by Allon Mureinik)
(In reply to nijin ashok from comment #0) > Description of problem: > > The sanlock initialization sometimes fails without any valid error messages > in sanlock log Initializing a volume lease is done using sanlock_direct_init - this is done in the client caller process, so sanlock daemon does not know anything about it and cannot log failures. > or any exception message in the vdsm log. > > 2017-07-24 14:56:32,958+0530 ERROR (tasks/3) [storage.Volume] Unexpected > error (volume:1110) > Traceback (most recent call last): > File "/usr/share/vdsm/storage/volume.py", line 1104, in create > cls.newVolumeLease(metaId, sdUUID, volUUID) > File "/usr/share/vdsm/storage/volume.py", line 1387, in newVolumeLease > return cls.manifestClass.newVolumeLease(metaId, sdUUID, volUUID) > File "/usr/share/vdsm/storage/blockVolume.py", line 319, in newVolumeLease > sanlock.init_resource(sdUUID, volUUID, [(leasePath, leaseOffset)]) > SanlockException: (-202, 'Sanlock resource init failure', 'Sanlock > exception') This looks like an exception message to me. Maybe you are referring to the unhelpful 'Sanlock exception' message? this was fixed in sanlock 3.5.0, see: https://pagure.io/sanlock/c/b79bd2ac317427908ced4834cc08a1198ce327a1?branch=master Error -202 means: $ git grep '\-202' src/sanlock_rv.h:#define SANLK_AIO_TIMEOUT -202 Which probably means that sanlock had a timeout accessing storage for initializing a volume lease. (Originally by Nir Soffer)
David, do you think we can do anything here, or given the SANLK_AIO_TIMEOUT's and renewal errors, this is a storage side issue that must be fixed by the system admin? (Originally by Nir Soffer)
Initializing new lease areas doesn't retry on i/o timeouts, and I'm not sure that it should. If you can't initialize leases without timeouts, it means that you probably need to fix the environment first. i/o timeouts should not be a normal state on the system, and handling i/o timeouts is designed to allow the system (or storage) to be fixed gracefully. I don't know what sort of advice we give admins about storage that times out. That being said, we can of course improve the error messages to make clear what happened. Just reporting -202 as an i/o timeout would probably help. The recent improvements in error messages were limited because the project ended up being larger than expected. The more significant improvements are still planned for the next release. (Originally by David Teigland)
With sanlock 3.5.0, this error will be reported as: SanlockException: (-202, 'Sanlock resource init failure', 'IO timeout') I don't think there is anything to improve for this error. (Originally by Nir Soffer)
Can you propose verification steps here? (Originally by Lukas Svaty)
https://gerrit.ovirt.org/#/c/82482/ was merged onto ovirt-4.1 quite some time ago. Should the bug be in NEW state still?
(In reply to Yaniv Kaul from comment #19) > https://gerrit.ovirt.org/#/c/82482/ was merged onto ovirt-4.1 quite some > time ago. Should the bug be in NEW state still? Nope.
CentOS7# yum list sanlock ... Available Packages sanlock.x86_64 3.5.0-1.el7 base RHEL7# yum list sanlock Loaded plugins: product-id, search-disabled-repos, versionlock Available Packages sanlock.x86_64 3.5.0-1.el7 pulp_rhel-7-server-rhv-4-mgmt-agent-rpms
The fix attached is on VSDM, not sanlock, hence the component should match, otherwise we can't add this to ET.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3139