Bug 1477717 - Sanlock init failed with unhelpful error message "Sanlock exception"
Summary: Sanlock init failed with unhelpful error message "Sanlock exception"
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.1.4
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ovirt-4.2.0
: ---
Assignee: Dan Kenigsberg
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks: 1497940
TreeView+ depends on / blocked
 
Reported: 2017-08-02 17:19 UTC by nijin ashok
Modified: 2020-09-10 11:06 UTC (History)
13 users (show)

Fixed In Version: vdsm-4.20.4
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1497940 (view as bug list)
Environment:
Last Closed: 2018-05-15 17:51:57 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2018:1489 0 None None None 2018-05-15 17:53:24 UTC
oVirt gerrit 76224 0 'None' 'MERGED' 'spec: Require sanlock 3.5' 2019-11-26 21:07:21 UTC
oVirt gerrit 82482 0 'None' 'MERGED' 'spec: Require sanlock 3.5 on EL' 2019-11-26 21:07:22 UTC

Description nijin ashok 2017-08-02 17:19:08 UTC
Description of problem:

The sanlock initialization sometimes fails without any valid error messages in sanlock log or any exception message in the vdsm log.  

2017-07-24 14:56:32,958+0530 ERROR (tasks/3) [storage.Volume] Unexpected error (volume:1110)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/volume.py", line 1104, in create
    cls.newVolumeLease(metaId, sdUUID, volUUID)
  File "/usr/share/vdsm/storage/volume.py", line 1387, in newVolumeLease
    return cls.manifestClass.newVolumeLease(metaId, sdUUID, volUUID)
  File "/usr/share/vdsm/storage/blockVolume.py", line 319, in newVolumeLease
    sanlock.init_resource(sdUUID, volUUID, [(leasePath, leaseOffset)])
SanlockException: (-202, 'Sanlock resource init failure', 'Sanlock exception')
2017-07-24 14:56:32,958+0530 ERROR (tasks/3) [storage.Image] Unexpected error (image:892)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/image.py", line 882, in copyCollapsed
    srcVolUUID=sc.BLANK_UUID)
  File "/usr/share/vdsm/storage/sd.py", line 758, in createVolume
    initialSize=initialSize)
  File "/usr/share/vdsm/storage/volume.py", line 1112, in create
    (volUUID, e))
VolumeCreationError: Error creating a new volume: (u"Volume creation 5a9a3fb8-9560-4240-aadf-1d2536c8dfe1 failed: (-202, 'Sanlock resource init failure', 'Sanlock exception')",)

As per the vdsm log, this happens four times for the customer and two of them was while creating the OVF_STORE. There is nothing logged in the sanlock log regarding the error.  I am unable to find any reason why this is happening as logs don't have any info. May be we have to add new log entries in the related area to understand the root cause.


Version-Release number of selected component (if applicable):

vdsm-4.19.24-1.el7ev.x86_64
sanlock-3.4.0-1.el7.x86_64

How reproducible:

Rarely for the customer while creating the disks.

Actual results:

Sometimes disk creation fails with sanlock error

Expected results:

Disk creation should work.

Additional info:

Comment 2 Allon Mureinik 2017-08-03 07:48:22 UTC
I see a bunch of errors about storage connections timing out, and sanlock renewal failures (which makes sense given the connection timeout).

Nir - anything we can do from our end?

Comment 3 Nir Soffer 2017-08-03 18:56:42 UTC
(In reply to nijin ashok from comment #0)
> Description of problem:
> 
> The sanlock initialization sometimes fails without any valid error messages
> in sanlock log 

Initializing a volume lease is done using sanlock_direct_init - this is done
in the client caller process, so sanlock daemon does not know anything about it
and cannot log failures.

> or any exception message in the vdsm log.  
> 
> 2017-07-24 14:56:32,958+0530 ERROR (tasks/3) [storage.Volume] Unexpected
> error (volume:1110)
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/volume.py", line 1104, in create
>     cls.newVolumeLease(metaId, sdUUID, volUUID)
>   File "/usr/share/vdsm/storage/volume.py", line 1387, in newVolumeLease
>     return cls.manifestClass.newVolumeLease(metaId, sdUUID, volUUID)
>   File "/usr/share/vdsm/storage/blockVolume.py", line 319, in newVolumeLease
>     sanlock.init_resource(sdUUID, volUUID, [(leasePath, leaseOffset)])
> SanlockException: (-202, 'Sanlock resource init failure', 'Sanlock
> exception')

This looks like an exception message to me.

Maybe you are referring to the unhelpful 'Sanlock exception' message? this was
fixed in sanlock 3.5.0, see:
https://pagure.io/sanlock/c/b79bd2ac317427908ced4834cc08a1198ce327a1?branch=master

Error -202 means:
$ git grep '\-202'
src/sanlock_rv.h:#define SANLK_AIO_TIMEOUT      -202

Which probably means that sanlock had a timeout accessing storage for
initializing a volume lease.

Comment 4 Nir Soffer 2017-08-03 18:58:10 UTC
David, do you think we can do anything here, or given the SANLK_AIO_TIMEOUT's and
renewal errors, this is a storage side issue that must be fixed by the system
admin?

Comment 5 David Teigland 2017-08-03 19:33:50 UTC
Initializing new lease areas doesn't retry on i/o timeouts, and I'm not sure that it should.  If you can't initialize leases without timeouts, it means that you probably need to fix the environment first.  i/o timeouts should not be a normal state on the system, and handling i/o timeouts is designed to allow the system (or storage) to be fixed gracefully.  I don't know what sort of advice we give admins about storage that times out.

That being said, we can of course improve the error messages to make clear what happened.  Just reporting -202 as an i/o timeout would probably help.

The recent improvements in error messages were limited because the project ended up being larger than expected.  The more significant improvements are still planned for the next release.

Comment 6 Nir Soffer 2017-08-03 19:43:21 UTC
With sanlock 3.5.0, this error will be reported as:

    SanlockException: (-202, 'Sanlock resource init failure', 'IO timeout')

I don't think there is anything to improve for this error.

Comment 13 Lukas Svaty 2017-08-28 06:21:05 UTC
Can you propose verification steps here?

Comment 19 Natalie Gavrielov 2017-12-27 13:27:40 UTC
Hi Nir,

I see that the change made here in this bug is sanlock version requirement (now I see that sanlock-3.5.0-1 is installed).

Is there anything else to verify here? 
Any reasonable scenario I can get this specific SanlockException: (-202, 'Sanlock resource init failure', 'IO timeout')?

Comment 20 Nir Soffer 2017-12-27 22:46:40 UTC
(In reply to Natalie Gavrielov from comment #19)
> Any reasonable scenario I can get this specific SanlockException: (-202,
> 'Sanlock resource init failure', 'IO timeout')?

Looks verified to me.

Comment 21 Natalie Gavrielov 2017-12-31 15:56:36 UTC
Following comment 20, moving to verified.
Used build: rhvm-4.2.0.2-0.1.el7.noarch

Comment 22 Pavol Brilla 2018-05-10 08:45:45 UTC
Patch was in fact on vdsm side, changing component

Comment 23 Eyal Edri 2018-05-10 11:01:32 UTC
Adding relevant VDSM patch ( existing one was for 4.1 )

Comment 26 errata-xmlrpc 2018-05-15 17:51:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1489

Comment 27 Franta Kust 2019-05-16 13:06:03 UTC
BZ<2>Jira Resync

Comment 28 Avihai 2019-08-25 14:33:46 UTC
qe_test_coverage is '-' as this bug has no clear scenario for verification and rarely occurs.
Regression tests did not catch it last time so no qe_test_coverage is '+' as well.


Note You need to log in before you can comment on or make changes to this bug.