Bug 1477717
Summary: | Sanlock init failed with unhelpful error message "Sanlock exception" | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | nijin ashok <nashok> | |
Component: | vdsm | Assignee: | Dan Kenigsberg <danken> | |
Status: | CLOSED ERRATA | QA Contact: | Raz Tamir <ratamir> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.1.4 | CC: | aefrat, eedri, gveitmic, lsurette, lsvaty, nsoffer, pbrilla, ratamir, srevivo, teigland, tnisan, ycui, ykaul | |
Target Milestone: | ovirt-4.2.0 | Keywords: | ZStream | |
Target Release: | --- | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | vdsm-4.20.4 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1497940 (view as bug list) | Environment: | ||
Last Closed: | 2018-05-15 17:51:57 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1497940 |
Description
nijin ashok
2017-08-02 17:19:08 UTC
I see a bunch of errors about storage connections timing out, and sanlock renewal failures (which makes sense given the connection timeout). Nir - anything we can do from our end? (In reply to nijin ashok from comment #0) > Description of problem: > > The sanlock initialization sometimes fails without any valid error messages > in sanlock log Initializing a volume lease is done using sanlock_direct_init - this is done in the client caller process, so sanlock daemon does not know anything about it and cannot log failures. > or any exception message in the vdsm log. > > 2017-07-24 14:56:32,958+0530 ERROR (tasks/3) [storage.Volume] Unexpected > error (volume:1110) > Traceback (most recent call last): > File "/usr/share/vdsm/storage/volume.py", line 1104, in create > cls.newVolumeLease(metaId, sdUUID, volUUID) > File "/usr/share/vdsm/storage/volume.py", line 1387, in newVolumeLease > return cls.manifestClass.newVolumeLease(metaId, sdUUID, volUUID) > File "/usr/share/vdsm/storage/blockVolume.py", line 319, in newVolumeLease > sanlock.init_resource(sdUUID, volUUID, [(leasePath, leaseOffset)]) > SanlockException: (-202, 'Sanlock resource init failure', 'Sanlock > exception') This looks like an exception message to me. Maybe you are referring to the unhelpful 'Sanlock exception' message? this was fixed in sanlock 3.5.0, see: https://pagure.io/sanlock/c/b79bd2ac317427908ced4834cc08a1198ce327a1?branch=master Error -202 means: $ git grep '\-202' src/sanlock_rv.h:#define SANLK_AIO_TIMEOUT -202 Which probably means that sanlock had a timeout accessing storage for initializing a volume lease. David, do you think we can do anything here, or given the SANLK_AIO_TIMEOUT's and renewal errors, this is a storage side issue that must be fixed by the system admin? Initializing new lease areas doesn't retry on i/o timeouts, and I'm not sure that it should. If you can't initialize leases without timeouts, it means that you probably need to fix the environment first. i/o timeouts should not be a normal state on the system, and handling i/o timeouts is designed to allow the system (or storage) to be fixed gracefully. I don't know what sort of advice we give admins about storage that times out. That being said, we can of course improve the error messages to make clear what happened. Just reporting -202 as an i/o timeout would probably help. The recent improvements in error messages were limited because the project ended up being larger than expected. The more significant improvements are still planned for the next release. With sanlock 3.5.0, this error will be reported as: SanlockException: (-202, 'Sanlock resource init failure', 'IO timeout') I don't think there is anything to improve for this error. Can you propose verification steps here? Hi Nir, I see that the change made here in this bug is sanlock version requirement (now I see that sanlock-3.5.0-1 is installed). Is there anything else to verify here? Any reasonable scenario I can get this specific SanlockException: (-202, 'Sanlock resource init failure', 'IO timeout')? (In reply to Natalie Gavrielov from comment #19) > Any reasonable scenario I can get this specific SanlockException: (-202, > 'Sanlock resource init failure', 'IO timeout')? Looks verified to me. Following comment 20, moving to verified. Used build: rhvm-4.2.0.2-0.1.el7.noarch Patch was in fact on vdsm side, changing component Adding relevant VDSM patch ( existing one was for 4.1 ) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1489 BZ<2>Jira Resync qe_test_coverage is '-' as this bug has no clear scenario for verification and rarely occurs. Regression tests did not catch it last time so no qe_test_coverage is '+' as well. |