1243935 – [BLOCKED] VDSM terminated by sanlock when external process writes to an overloaded storage

Bug 1243935 - [BLOCKED] VDSM terminated by sanlock when external process writes to an overloaded storage

Summary: [BLOCKED] VDSM terminated by sanlock when external process writes to an overl...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	ovirt-3.6.3
Target Release:	3.6.0
Assignee:	Allon Mureinik
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:	1245181
Blocks:	1236075
TreeView+	depends on / blocked

Reported:	2015-07-16 15:40 UTC by Arik
Modified:	2016-03-10 06:21 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-09-07 21:13:40 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vdsm log (3.21 MB, text/plain) 2015-07-20 06:44 UTC, Arik	no flags	Details
sanlock log (120.14 KB, text/plain) 2015-07-20 22:04 UTC, Nir Soffer	no flags	Details
output of journalctl -b (665.64 KB, text/x-vhdl) 2015-07-20 22:05 UTC, Nir Soffer	no flags	Details
sched_setcheduler test (334 bytes, text/x-csrc) 2015-07-20 23:07 UTC, Nir Soffer	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	43817	0	master	MERGED	v2v: Try to be nicer to other processes	2021-01-02 16:03:44 UTC

Description Arik 2015-07-16 15:40:01 UTC

Description of problem:
VDSM crashes while virt-v2v copies the disks.
It used to work few weeks ago but it seems that recent changes in VDSM break it - so it needs to be adjusted.

Version-Release number of selected component (if applicable):


How reproducible:
100%


Steps to Reproduce:
1. start to import VM from VMware
2.
3.

Actual results:
While the disk is copied, VDSM crashes

Expected results:
The conversion should succeed

Additional info:
The last log before the crash is:
vdsm::68::vds::(sigtermHandler) Received signal 15
Few seconds before that we see the following exception:
Thread-47::ERROR::2015-07-16 18:19:38,377::monitor::366::Storage.Monitor::(_releaseHostId) Error releasing host id 2 for domain f4f54f47-9ccf-4978-a9a7-12a6d89bf94e
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 363, in _releaseHostId
    self.domain.releaseHostId(self.hostId, unused=True)
  File "/usr/share/vdsm/storage/sd.py", line 535, in releaseHostId
    self._clusterLock.releaseHostId(hostId, async, unused)
  File "/usr/share/vdsm/storage/clusterlock.py", line 252, in releaseHostId
    raise se.ReleaseHostIdFailure(self._sdUUID, e)
ReleaseHostIdFailure: Cannot release host id: (u'f4f54f47-9ccf-4978-a9a7-12a6d89bf94e', SanlockException(16, 'Sanlock lockspace remove failure', 'Device or resource busy'))

I tried to lock the image using resource-manager (exclusively) but it didn't help.

Moreover, the conversion works fine when invoking virt-v2v from the command line.

Comment 1 Richard W.M. Jones 2015-07-16 16:23:34 UTC

Just copying here what I said in the email:

> Assuming storage = the data domain, then virt-v2v itself doesn't write
> to it, it runs 'qemu-img convert' which writes to it.  I would need to
> take a much closer look to work out what block size qemu-img is using,
> but virt-v2v sets the source block size of the copy to 64 MB.

By which I mean, virt-v2v invokes qemu-img convert with a source disk
which is https, and it sets the https readahead to 64 MB.  This just
means that we don't have to keep going back to the origin vCenter
server, but probably doesn't influence the qemu-img convert block size.

> As far as I know, neither virt-v2v nor qemu-img interact in any way
> with sanlock.

Comment 2 Allon Mureinik 2015-07-19 07:14:45 UTC

Nir, can you take a look please?

Comment 3 Nir Soffer 2015-07-19 18:14:38 UTC

(In reply to Arik from comment #0)
> The last log before the crash is:
> vdsm::68::vds::(sigtermHandler) Received signal 15

This happen when someone is stopping vdsm. According to Liron who looked
at the logs, sanlock lost access to the storage, so it fenced vdsm
which was the spm.

> Few seconds before that we see the following exception:
> ReleaseHostIdFailure: Cannot release host id:
> (u'f4f54f47-9ccf-4978-a9a7-12a6d89bf94e', SanlockException(16, 'Sanlock
> lockspace remove failure', 'Device or resource busy'))

This is expected when stopping the spm host. Not pretty, but this is the
current behavior. This is not a crash, just how unclean shutdown looks.

Please attach these logs:
/var/log/vdsm/vdsm.log
/var/log/messages
/var/log/sanlock.log

Comment 4 Nir Soffer 2015-07-19 18:40:39 UTC

Arik, can you reproduce this issue when using the attached patch?

Comment 5 Arik 2015-07-20 06:41:43 UTC

Nir, VDSM crashes again but with different error:
Thread-22::ERROR::2015-07-20 09:31:26,316::monitor::250::Storage.Monitor::(_monitorDomain) Error monitoring domain f4f54f47-9ccf-4978-a9a7-12a6d89bf94e
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 246, in _monitorDomain
    self._performDomainSelftest()
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 759, in wrapper
    value = meth(self, *a, **kw)
  File "/usr/share/vdsm/storage/monitor.py", line 313, in _performDomainSelftest
    self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 123, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 142, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
    return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/fileSD.py", line 194, in __init__
    validateFileSystemFeatures(manifest.sdUUID, manifest.mountpoint)
  File "/usr/share/vdsm/storage/fileSD.py", line 89, in validateFileSystemFeatures
    oop.getProcessPool(sdUUID).directTouch(testFilePath)
  File "/usr/share/vdsm/storage/outOfProcess.py", line 350, in directTouch
    ioproc.touch(path, flags, mode)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 507, in touch
    self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 385, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out

and few seconds afterwards it gets:
MainThread::DEBUG::2015-07-20 09:31:28,029::vdsm::68::vds::(sigtermHandler) Received signal 15

I figured out that it is related to the amount of free space in the storage. when I freed more space, the import suddenly worked. I reproduced it again by creating a large disk on the storage.
Note that the free space should be sufficient to store the disk on the imported VM (9g are available).

Comment 6 Arik 2015-07-20 06:44:41 UTC

Created attachment 1053729 [details]
vdsm log

vdsm log when trying to verify Nir's patch.
Note that at 2015-07-20 09:25:20,318 I got the SanlockException when trying to import the VM without the patch,
and at 2015-07-20 09:31:26,316 I got timeout which lead to VDSM termination when doing the same import with the patch.

Comment 7 Nir Soffer 2015-07-20 06:51:16 UTC

(In reply to Arik from comment #6)
> Created attachment 1053729 [details]
> vdsm log

Please provide the other logs mentioned in comment 3.

Comment 8 Nir Soffer 2015-07-20 06:59:37 UTC

(In reply to Arik from comment #5)
> Nir, VDSM crashes again but with different error:

Vdsm does not crash, it was terminated (received signal 15).

> Thread-22::ERROR::2015-07-20
> 09:31:26,316::monitor::250::Storage.Monitor::(_monitorDomain) Error
> monitoring domain f4f54f47-9ccf-4978-a9a7-12a6d89bf94e
> Traceback (most recent call last):
> Timeout: Connection timed out

ioprocess timed out trying to touch a file - connection to storage is lost,
or too slow.

> and few seconds afterwards it gets:
> MainThread::DEBUG::2015-07-20 09:31:28,029::vdsm::68::vds::(sigtermHandler)
> Received signal 15

Another hint that connection to storage is too slow.

> I figured out that it is related to the amount of free space in the storage.
> when I freed more space, the import suddenly worked. I reproduced it again
> by creating a large disk on the storage.
> Note that the free space should be sufficient to store the disk on the
> imported VM (9g are available).

Can you reproduce this with another NFS server? for example, try run your own
server on some machine.

Can you reproduce this with iSCSI stoarage domain?

Comment 9 Michal Skrivanek 2015-07-20 09:11:41 UTC

Nir,
we played with it a bit more, and it seems it really is happening only when NFS servers start to be short on space.
Also tried directly invoking "qemu-img convert" to see if it has anything to do with v2v or not and indeed it reproduced the same.
There is a difference between v2v and regular import flow during these tests, though - import is from NFS to NFS, whereas v2v copies from the host to NFS. It seemed to be going much faster in case of host->NFS. So I suppose the stress on NFS side is simply higher and it responds slowly (simple directory listing on host from the NFS server took several seconds during qemu-img convert)
Since it does work ok when the server has more space I think indeed the behavior on other NFS servers or iSCSI is different

Don't know what can we do about it other than the proposed patch...which doesn't seem to address it completely, but should not hurt anyway.

Since we are seeing a lot of issues in storage monitoring in loaded deployments - I would perhaps only suggest to review the values and possibly increase defaults to more than 30s for SD monitoring and also for sanlock.

Comment 10 Nir Soffer 2015-07-20 11:25:54 UTC

(In reply to Michal Skrivanek from comment #9)
> Nir,
> we played with it a bit more, and it seems it really is happening only when
> NFS servers start to be short on space.

Can I have access to the system reproducing this issue, and/or the vmware
setup where you import vm from?

> Since we are seeing a lot of issues in storage monitoring in loaded
> deployments - I would perhaps only suggest to review the values and possibly
> increase defaults to more than 30s for SD monitoring and also for sanlock.

We cannot increase the timeouts for sanlock - if writing and reading from
the lease area (done each 10 seconds) fails for 40 seconds, the storage
is considered lost, and sanlock terminate the spm.

The storage must work within these limits, otherwise we cannot use it.

Comment 11 Nir Soffer 2015-07-20 11:26:58 UTC

David, do you think we can change sanlock configuration to support overloaded storage server?

Comment 13 David Teigland 2015-07-20 14:29:20 UTC

The sanlock io timeout can be specified when adding a lockspace, see sanlock_add_lockspace_timeout().  Doing that affects all of the other timeouts.

Comment 14 Nir Soffer 2015-07-20 19:15:37 UTC

(In reply to David Teigland from comment #13)
> The sanlock io timeout can be specified when adding a lockspace, see
> sanlock_add_lockspace_timeout().  Doing that affects all of the other
> timeouts.

Sure, today the time to join a cluster is up to 20 seconds, with timeout 
of 10 seconds. If we change the timeout to 20 seconds, this delay will be
up to 40 seconds - right?

I think the key is not having larger timeouts, but preventing storage
operations (such as importing a vm) from overloading the machine, leading
to sanlock timeouts.

Does sanlock use the highest priority for its io operations? can we ensure
that sanlock io get higher priority compared with any other io?

Comment 15 David Teigland 2015-07-20 20:08:09 UTC

> Sure, today the time to join a cluster is up to 20 seconds, with timeout 
> of 10 seconds. If we change the timeout to 20 seconds, this delay will be
> up to 40 seconds - right?

right
 
> I think the key is not having larger timeouts, but preventing storage
> operations (such as importing a vm) from overloading the machine, leading
> to sanlock timeouts.

yes

> Does sanlock use the highest priority for its io operations? can we ensure
> that sanlock io get higher priority compared with any other io?

sanlock does not use io priorities (ioprio_set).  It's been a long time so I don't remember the reasons... it looks like a few of the possible issues would be:

- It only works with the cfq scheduler. Is that used for RHEV storage?
- Would the deadline scheduler be a better solution in any case?
- I think it would only apply to block devices, not nfs.
- I mistakenly thought it would not apply to any aio, but it looks like it would apply to aio reads, assuming all of the previous points made it relevant.

Comment 16 Nir Soffer 2015-07-20 20:53:40 UTC

(In reply to David Teigland from comment #15)
> > Does sanlock use the highest priority for its io operations? can we ensure
> > that sanlock io get higher priority compared with any other io?
> 
> sanlock does not use io priorities (ioprio_set).  It's been a long time so I
> don't remember the reasons... it looks like a few of the possible issues
> would be:
> 
> - It only works with the cfq scheduler. Is that used for RHEV storage?

We use the deadline scheduler for block devices using a udev rule.

> - Would the deadline scheduler be a better solution in any case?

This was performance team advice (according to Federico).

> - I think it would only apply to block devices, not nfs.

This bug is about nfs, in particular, nfs server that become very slow when
almost full (see comment 9).

Can we improve the situation for nfs?

Comment 17 David Teigland 2015-07-20 21:04:03 UTC

I don't have enough nfs knowledge to help with that.

Comment 18 Nir Soffer 2015-07-20 22:04:23 UTC

Created attachment 1054043 [details]
sanlock log

Comment 19 Nir Soffer 2015-07-20 22:05:16 UTC

Created attachment 1054046 [details]
output of journalctl -b

Comment 20 Nir Soffer 2015-07-20 22:16:48 UTC

David, can you check sanlock log (attachment 1054043 [details])?

This line is suspicious (selinux?):
2015-07-13 12:03:03+0300 1622 [13612]: set scheduler RR|RESET_ON_FORK priority 99 failed: Operation not permitted

This log is from fedora 22. I don't see this on rhel 7.1 or fedora 21.

Is it possible that sanlock, not using RR scheduler is missing deadlines when the machine is overloaded, leading to bogus io timeouts?

Comment 21 Nir Soffer 2015-07-20 23:07:03 UTC

Created attachment 1054080 [details]
sched_setcheduler test

This succeeds when running on rhel 7.1 and fedora 21, fails on fedora 22.

Comment 22 Nir Soffer 2015-07-21 19:22:37 UTC

Arik, this issue seems to be cause by sanlock, failing to run using SCHED_RR scheduler.

After rebooting your host (bamba.tlv), I cannot reproduce the error in sanlock, and it is running now using the RR scheduler:

# ps axf -o pid,stat,cmd,class,rtprio
  747 SLsl sanlock daemon -U sanlock - RR      99
  748 S     \_ sanlock daemon -U sanlo TS       -

Can you reproduce now the same issue?

Comment 23 Arik 2015-07-26 15:09:14 UTC

(In reply to Nir Soffer from comment #22)
I'm sure reboot does not fix it - the host has been rebooted several times and was even installed with different OS, but the problem still happened (in both rhel and fedora).

It does not reproduce now, but there is a lot of free space in multipass.eng.lab now and it never reproduced when the storage was not short in free space.

As far as I'm concerned, this bug is not urgent anymore since it doesn't block us and the problem is not related to the import using virt-v2v. IMO, a fix for this bug should be verified by checking that the problem does not happen when the amount of free space in the storage is low and external process, such as virt-v2v but might be other, writes to it.

Comment 24 Nir Soffer 2015-07-26 15:26:49 UTC

The root cause seems to be failure of sched_setscheduler(2), causing sanlock
to run without real-time shceduling. This can cause sanlock to miss deadlines
when the machine is overloaded with io, which can cause io timeouts.
The sched_setscheduler(2) issue disappeared after reboot, and we cannot 
reproduce this issue now.

This may be also related to low space on nfs server, which cause the server
to be very slow, but we don't have enough information to tell.

Closing for now, until we have more data.

Comment 25 Arik 2015-07-27 06:23:42 UTC

(In reply to Nir Soffer from comment #24)
I don't think that closing this bug without trying to reproduce it properly is the right thing to do. It will be a serious issue once it happens to users and customers, so we should try to prevent it.

Nir, what additional data do you need in order to reproduce this issue?

Comment 26 Nir Soffer 2015-07-27 10:14:43 UTC

(In reply to Arik from comment #25)
> (In reply to Nir Soffer from comment #24)
> I don't think that closing this bug without trying to reproduce it properly
> is the right thing to do. It will be a serious issue once it happens to
> users and customers, so we should try to prevent it.
> 
> Nir, what additional data do you need in order to reproduce this issue?

This bug cannot be reproduced currently, since the sched_setcheduler() call
does not fail any more. Until it fails again, we have nothing to do here.

Closing it since I'm not going to invest more time in this. Feel free
to take this bug if you want.

Comment 27 Nir Soffer 2015-08-11 19:10:26 UTC

This is not a sanlock issue, removing "sanlock" category.

Also lowering severity, as termination by sanlock when
loosing access to storage is expected behavior.

Comment 28 Yaniv Lavi 2015-09-07 14:46:39 UTC

Should this be on MODIFIED?

Comment 29 Yaniv Lavi 2015-09-07 14:47:07 UTC

Should this bug be closed?

Comment 30 Nir Soffer 2015-09-07 14:51:03 UTC

(In reply to Yaniv Dary from comment #29)
> Should this bug be closed?

It should.

Note You need to log in before you can comment on or make changes to this bug.