Bug 1356667

Summary: libsanlock does not handle EINTR, causing failures in client
Product: Red Hat Enterprise Linux 7 Reporter: Nir Soffer <nsoffer>
Component: sanlockAssignee: David Teigland <teigland>
Status: CLOSED CURRENTRELEASE QA Contact: Aharon Canan <acanan>
Severity: high Docs Contact:
Priority: high    
Version: 7.4CC: acanan, agk, amureini, bmcclain, cluster-maint, jbrassow, nsoffer, salmy, snagar, teigland
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1357883 (view as bug list) Environment:
Last Closed: 2016-09-29 13:58:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1356676, 1357883, 1416336    
Attachments:
Description Flags
sanlock test script
none
sanlock test script none

Description Nir Soffer 2016-07-14 15:59:14 UTC
Description of problem:

If signal is received when libsanlock api is blocking, the call will fail
with EINTR.

Examples from vdsm (using libsanlock via the python bindings):

jsonrpc.Executor/3::ERROR::2016-06-08 00:03:48,803::task::868::Storage.TaskManager.Task::(_setError) Task=`1bb2fee0-9a56-49e4-890b-3c1b051683f5`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 875, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 606, in getSpmStatus
    status = self._getSpmStatusInfo(pool)
  File "/usr/share/vdsm/storage/hsm.py", line 600, in _getSpmStatusInfo
    (pool.spmRole,) + pool.getSpmStatus()))
  File "/usr/share/vdsm/storage/sp.py", line 114, in getSpmStatus
    return self._backend.getSpmStatus()
  File "/usr/share/vdsm/storage/spbackends.py", line 430, in getSpmStatus
    lVer, spmId = self.masterDomain.inquireClusterLock()
  File "/usr/share/vdsm/storage/sd.py", line 688, in inquireClusterLock
    return self._manifest.inquireDomainLock()
  File "/usr/share/vdsm/storage/sd.py", line 436, in inquireDomainLock
    return self._domainLock.inquire()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 327, in inquire
    resource = sanlock.read_resource(self._leasesPath, SDM_LEASE_OFFSET)
SanlockException: (4, 'Sanlock resource read failure', 'Interrupted system call')


jsonrpc.Executor/1::ERROR::2016-07-12 07:40:52,131::task::868::Storage.TaskManager.Task::(_setError) Task=`7cca3210-568f-4449-8890-34a3ee104952`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 875, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1161, in attachStorageDomain
    pool.attachSD(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 951, in attachSD
    dom.releaseHostId(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 664, in releaseHostId
    self._manifest.releaseHostId(hostId, async)
  File "/usr/share/vdsm/storage/sd.py", line 403, in releaseHostId
    self._domainLock.releaseHostId(hostId, async, False)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 266, in releaseHostId
    raise se.ReleaseHostIdFailure(self._sdUUID, e)
ReleaseHostIdFailure: Cannot release host id: (u'3755b6ce-4e6a-405b-b529-c01e25bac03e', SanlockException(4, 'Sanlock lockspace remove failure', 'Interrupted system call'))

Vdsm receive lot of signals, since it runs many short lived child
processes. Each time a child process finish, Vdsm receives a SIGCHLD
signal.

Version-Release number of selected component (if applicable):
Current sanlock version on 7.2
(I got these reports by mail, no info available on specific version)

How reproducible:
Rare.

Steps to Reproduce:
1. Any flow that include sanlock operations (acquire, release, inquire)

Actual results:
Sanlock call fails, failing the operation in vdsm.

Expected results:
Sanlock should handle EINTR for the client.

Additional info:

This was fixed upstream in:

commit f520991e83d0a05d1670abba4561c3de86a09c5f
Author: David Teigland <teigland>
Date:   Wed Jun 8 12:06:22 2016 -0500

    libsanlock: ignore EINTR
    
    libsanlock calls were returning EINTR if there was a
    signal during send() or recv().  Now just restart the
    syscall on EINTR.

We like a backport to 7.2.z if possible.

Comment 1 Allon Mureinik 2016-07-14 16:13:02 UTC
Bronce - can you assist with pushing this for 7.2.z?

Comment 2 David Teigland 2016-07-14 16:14:00 UTC
This fix is already in 7.3, but there was no previous 7.3 bz for it. This bz was created after the fact as a formality en route to 7.2.z.

Comment 4 Nir Soffer 2016-07-14 16:36:54 UTC
David, can you explain how this can be tested?

Comment 5 Nir Soffer 2016-07-14 16:38:31 UTC
Aharon, can you ack this?

Comment 7 David Teigland 2016-07-14 18:59:13 UTC
Here's how I verified the libsanlock EINTR patch:

1. fallocate -l 1048576 /tmp/foo

2. losetup -f /tmp/foo

3. sanlock direct init -s foo:0:/dev/loop0:0

4. python ./addls

addls code that Nir sent me:

import os
import signal
import threading
 
import sanlock
 
def handle(signo, frame):
    print "received signal: %d" % signo
 
def interrupt():
    os.kill(os.getpid(), signal.SIGUSR1)
 
signal.signal(signal.SIGUSR1, handle)
 
path = "/dev/loop0"
open(path, "w").close()
 
threading.Timer(0.5, interrupt).start()
 
sanlock.add_lockspace("foo", 1, path)


After addls completes successfully, it should print: "received signal: 10",
and 'sanlock status' should show the line: "s foo:1:/dev/loop0:0"
Run 'sanlock rem_lockspace -s foo:1:/dev/loop0:0' to clear it.

If addls returns quickly with "Interrupted system call", then the patch is missing.

Comment 8 Nir Soffer 2016-07-14 19:45:56 UTC
Created attachment 1179943 [details]
sanlock test script

This is a test script for testing the fix.

With sanlock version from git (commit cf37f526de8b74ae41695f1d5f68b208ddac3a89),
the output should be:

$ sudo python sanlock-eintr.py 
Creating lockspace file...
Initializing lockspace...
Adding lockspace...
(received signal: 10)
Checking lockspace...
[{'flags': 0,
  'host_id': 1,
  'lockspace': 'test',
  'offset': 0,
  'path': '/var/tmp/lockspace'}]
Removing lockspace...
(received signal: 10)
OK

With sanlock version before this fix, you will get the
exception seen in the bug description.

Comment 10 Nir Soffer 2016-07-17 12:50:17 UTC
Comment on attachment 1179943 [details]
sanlock test script

Wrong script

Comment 11 Nir Soffer 2016-07-17 12:52:57 UTC
Created attachment 1180756 [details]
sanlock test script

This is the correct test script, see comment 8 for instructions.

Comment 18 Aharon Canan 2016-09-29 13:58:52 UTC
We (Rhev storage QE) executed our automation using below sanlock pkgs and didn't face anything special.

[root@lynx09 ~]# rpm -qa |grep sanlock
sanlock-lib-3.4.0-1.el7.x86_64
sanlock-3.4.0-1.el7.x86_64
libvirt-lock-sanlock-2.0.0-10.el7.x86_64
sanlock-python-3.4.0-1.el7.x86_64


Closing.