Bug 1356667
| Summary: | libsanlock does not handle EINTR, causing failures in client | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Nir Soffer <nsoffer> | ||||||
| Component: | sanlock | Assignee: | David Teigland <teigland> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Aharon Canan <acanan> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 7.4 | CC: | acanan, agk, amureini, bmcclain, cluster-maint, jbrassow, nsoffer, salmy, snagar, teigland | ||||||
| Target Milestone: | rc | Keywords: | ZStream | ||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | |||||||||
| : | 1357883 (view as bug list) | Environment: | |||||||
| Last Closed: | 2016-09-29 13:58:52 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1356676, 1357883, 1416336 | ||||||||
| Attachments: |
|
||||||||
Bronce - can you assist with pushing this for 7.2.z? This fix is already in 7.3, but there was no previous 7.3 bz for it. This bz was created after the fact as a formality en route to 7.2.z. David, can you explain how this can be tested? Aharon, can you ack this? Here's how I verified the libsanlock EINTR patch:
1. fallocate -l 1048576 /tmp/foo
2. losetup -f /tmp/foo
3. sanlock direct init -s foo:0:/dev/loop0:0
4. python ./addls
addls code that Nir sent me:
import os
import signal
import threading
import sanlock
def handle(signo, frame):
print "received signal: %d" % signo
def interrupt():
os.kill(os.getpid(), signal.SIGUSR1)
signal.signal(signal.SIGUSR1, handle)
path = "/dev/loop0"
open(path, "w").close()
threading.Timer(0.5, interrupt).start()
sanlock.add_lockspace("foo", 1, path)
After addls completes successfully, it should print: "received signal: 10",
and 'sanlock status' should show the line: "s foo:1:/dev/loop0:0"
Run 'sanlock rem_lockspace -s foo:1:/dev/loop0:0' to clear it.
If addls returns quickly with "Interrupted system call", then the patch is missing.
Created attachment 1179943 [details]
sanlock test script
This is a test script for testing the fix.
With sanlock version from git (commit cf37f526de8b74ae41695f1d5f68b208ddac3a89),
the output should be:
$ sudo python sanlock-eintr.py
Creating lockspace file...
Initializing lockspace...
Adding lockspace...
(received signal: 10)
Checking lockspace...
[{'flags': 0,
'host_id': 1,
'lockspace': 'test',
'offset': 0,
'path': '/var/tmp/lockspace'}]
Removing lockspace...
(received signal: 10)
OK
With sanlock version before this fix, you will get the
exception seen in the bug description.
Comment on attachment 1179943 [details]
sanlock test script
Wrong script
Created attachment 1180756 [details] sanlock test script This is the correct test script, see comment 8 for instructions. We (Rhev storage QE) executed our automation using below sanlock pkgs and didn't face anything special. [root@lynx09 ~]# rpm -qa |grep sanlock sanlock-lib-3.4.0-1.el7.x86_64 sanlock-3.4.0-1.el7.x86_64 libvirt-lock-sanlock-2.0.0-10.el7.x86_64 sanlock-python-3.4.0-1.el7.x86_64 Closing. |
Description of problem: If signal is received when libsanlock api is blocking, the call will fail with EINTR. Examples from vdsm (using libsanlock via the python bindings): jsonrpc.Executor/3::ERROR::2016-06-08 00:03:48,803::task::868::Storage.TaskManager.Task::(_setError) Task=`1bb2fee0-9a56-49e4-890b-3c1b051683f5`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 875, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 606, in getSpmStatus status = self._getSpmStatusInfo(pool) File "/usr/share/vdsm/storage/hsm.py", line 600, in _getSpmStatusInfo (pool.spmRole,) + pool.getSpmStatus())) File "/usr/share/vdsm/storage/sp.py", line 114, in getSpmStatus return self._backend.getSpmStatus() File "/usr/share/vdsm/storage/spbackends.py", line 430, in getSpmStatus lVer, spmId = self.masterDomain.inquireClusterLock() File "/usr/share/vdsm/storage/sd.py", line 688, in inquireClusterLock return self._manifest.inquireDomainLock() File "/usr/share/vdsm/storage/sd.py", line 436, in inquireDomainLock return self._domainLock.inquire() File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 327, in inquire resource = sanlock.read_resource(self._leasesPath, SDM_LEASE_OFFSET) SanlockException: (4, 'Sanlock resource read failure', 'Interrupted system call') jsonrpc.Executor/1::ERROR::2016-07-12 07:40:52,131::task::868::Storage.TaskManager.Task::(_setError) Task=`7cca3210-568f-4449-8890-34a3ee104952`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 875, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1161, in attachStorageDomain pool.attachSD(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 951, in attachSD dom.releaseHostId(self.id) File "/usr/share/vdsm/storage/sd.py", line 664, in releaseHostId self._manifest.releaseHostId(hostId, async) File "/usr/share/vdsm/storage/sd.py", line 403, in releaseHostId self._domainLock.releaseHostId(hostId, async, False) File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 266, in releaseHostId raise se.ReleaseHostIdFailure(self._sdUUID, e) ReleaseHostIdFailure: Cannot release host id: (u'3755b6ce-4e6a-405b-b529-c01e25bac03e', SanlockException(4, 'Sanlock lockspace remove failure', 'Interrupted system call')) Vdsm receive lot of signals, since it runs many short lived child processes. Each time a child process finish, Vdsm receives a SIGCHLD signal. Version-Release number of selected component (if applicable): Current sanlock version on 7.2 (I got these reports by mail, no info available on specific version) How reproducible: Rare. Steps to Reproduce: 1. Any flow that include sanlock operations (acquire, release, inquire) Actual results: Sanlock call fails, failing the operation in vdsm. Expected results: Sanlock should handle EINTR for the client. Additional info: This was fixed upstream in: commit f520991e83d0a05d1670abba4561c3de86a09c5f Author: David Teigland <teigland> Date: Wed Jun 8 12:06:22 2016 -0500 libsanlock: ignore EINTR libsanlock calls were returning EINTR if there was a signal during send() or recv(). Now just restart the syscall on EINTR. We like a backport to 7.2.z if possible.