1247135 – [RFE] Ensure quality of service for sanlock io when using file-based storage

Bug 1247135 - [RFE] Ensure quality of service for sanlock io when using file-based storage

Summary: [RFE] Ensure quality of service for sanlock io when using file-based storage

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	RFEs
Sub Component:
Version:	---
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Nir Soffer
QA Contact:	Lukas Svaty
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-07-27 12:15 UTC by Nir Soffer
Modified:	2020-08-11 10:57 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-06-23 08:34:55 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	ylavi: ovirt-future? rule-engine: planning_ack? rule-engine: devel_ack? rule-engine: testing_ack?

Attachments	(Terms of Use)
sanlock iotimeout in create nfs storage domain (1.36 KB, text/plain) 2020-04-22 13:06 UTC, Amit Bawer	no flags	Details
vdsm log for iotimeout in create nfs storage domain (769.72 KB, application/x-xz) 2020-04-22 13:08 UTC, Amit Bawer	no flags	Details
OST runs from 28 Apr 2020 with sanlock iotimeout for NFS storage domain creation (1.27 MB, application/gzip) 2020-05-03 07:21 UTC, Amit Bawer	no flags	Details
View All

Description Nir Soffer 2015-07-27 12:15:53 UTC

Description of problem:

Sanlock is writing and reading 1MiB to every storage domain,
using 10 seconds io timeout. If writes/read to storage are too slow
for 40 seconds (4 retries based on 10 seconds timeout), sanlock will
fence the SPM (or in 3.6, any host holding the domain resource),
aborting the current vdsm operations on this host.

We are ok if storage become slow for a while, io opetaion taking many
minutes or hours will simply be slower. However, because of sanlock,
such operation running on the SPM (or any host holding a resource on
the domain) will fail when io is too slow.

We want to ensure that sanlock io have priority over other io in
the system. 

For block-based storage, we use ionice with ioclass=Idle when doing
io-heavy operations like copying and converting images (e.g using dd
and qemu-img). This helps only when using cfq scheduler, but we are
using the deadline scheduler which ignore io priority. However the
deadline scheduler should serve sanlock io request on time, since 
they are small. We probably should revisit the need for using ionice.

For file-based storage such as NFS and gluster, we are not doing 
anything to ensure heavy io operation are not delaying sanlock io.

If seems that we can use cgrougs for this, as described in:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Resource_Management_Guide/sec-prioritizing_network_traffic.html

Version-Release number of selected component (if applicable):
3.6

Comment 1 Nir Soffer 2015-07-27 12:18:48 UTC

David, can you take a look and add missing information?

Comment 2 Nir Soffer 2015-07-27 12:21:30 UTC

Sahina, can you add more details regarding glusterfs?

Comment 3 Nir Soffer 2015-07-27 12:23:05 UTC

This may be related to bug 1243935.

Comment 4 Sahina Bose 2016-12-06 14:13:00 UTC

Vijay, is there a way to prioritize some I/O over other in gluster? Will QOS feature in gluster address this?

Comment 5 Vijay Bellur 2017-01-24 04:32:08 UTC

(In reply to Sahina Bose from comment #4)
> Vijay, is there a way to prioritize some I/O over other in gluster? Will QOS
> feature in gluster address this?

Currently we do not have any internal mechanism in Gluster to prioritize I/O based on client or application. The proposed QoS feature will provide some means but we are yet to finalize the design.

Comment 6 Doron Fediuck 2018-05-27 13:06:56 UTC

Vijay, any updates?

Comment 7 Vijay Bellur 2018-11-05 09:26:37 UTC

(In reply to Doron Fediuck from comment #6)
> Vijay, any updates?

Unfortunately, not much has changed with respect to QoS in gluster upstream. I will provide an update after some more analysis. Thanks!

Comment 8 Michal Skrivanek 2020-03-19 15:41:23 UTC

We didn't get to this bug for more than 2 years, and it's not being considered for the upcoming 4.4. It's unlikely that it will ever be addressed so I'm suggesting to close it.
If you feel this needs to be addressed and want to work on it please remove cond nack and target accordingly.

Comment 9 Michal Skrivanek 2020-04-01 14:44:44 UTC

ok, closing. Please reopen if still relevant/you want to work on it.

Comment 10 Michal Skrivanek 2020-04-01 14:49:37 UTC

ok, closing. Please reopen if still relevant/you want to work on it.

Comment 11 Amit Bawer 2020-04-22 13:06:46 UTC

Created attachment 1680850 [details]
sanlock iotimeout in create nfs storage domain

Adding vdsm.log and corresponding sanlock client debug_log dump for sanlock iotimeout in writing lockspace for nfs storage domain creation.

Taken from Michal's OST run on 19-Apr-2020.

Comment 12 Amit Bawer 2020-04-22 13:08:06 UTC

Created attachment 1680851 [details]
vdsm log for iotimeout in create nfs storage domain

Adding vdsm.log and corresponding sanlock client debug_log dump for sanlock iotimeout in writing lockspace for nfs storage domain creation.

Taken from Michal's OST run on 19-Apr-2020.

Comment 13 Amit Bawer 2020-05-03 07:21:38 UTC

Created attachment 1684388 [details]
OST runs from 28 Apr 2020 with sanlock iotimeout for NFS storage domain creation

Comment 14 Nir Soffer 2020-05-03 08:35:49 UTC

I found possibly related issue in sanlock - sanlock try to use the realtime
scheduler, but it always fails and run with the normal scheduler. So it may
be delayed by unimportant processes if the system running OST is overloaded.

I posted a fix here:
https://lists.fedorahosted.org/archives/list/sanlock-devel@lists.fedorahosted.org/message/J64ENSHB6Q3AST4H3PDJARDNYLSTLWAF/

I think we need to test this fix on OST, it may make a difference.

Comment 15 Amit Bawer 2020-05-03 08:38:47 UTC

(In reply to Amit Bawer from comment #13)
> Created attachment 1684388 [details]
> OST runs from 28 Apr 2020 with sanlock iotimeout for NFS storage domain
> creation


All the 3 dev-Z bootstrap test runs show the same problem with sanlock IO timeout during creation of NFS storage domain lease,
to make the mail shorter I am only showing dev-20 traces here.

What I find confusing in all 3 cases is that creation of nfs domain with uid X fails in vdsm.log for sanlock IO timeout error at time T,
but sanlock.log on same host shows IO timeout for another domain lockspace Y from another NFS path at the very same time T.

The messages log shows a strange  "kernel: FS-Cache: Duplicate cookie detected message" next to time T.
Looking it up seems to be a known NFS issue on el8: https://access.redhat.com/solutions/4644911
but no one says if could actually end up with a mixup in accessing NFS mounts.


For example:

dev-20 host 0

X=a3a050d6-cd2b-4cb8-a2d4-ca57525cedd0   mount-path=192.168.207.2:/exports/nfs/share2
Y=s2:r3 resource bdb91cce-a018-4661-915b-3214940524f5:SDM:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/leases:1048576
T=2020-04-26 16:16:54

vdsm.log 


2020-04-26 16:16:22,215+0200 INFO  (jsonrpc/1) [vdsm.api] START createStorageDomain(storageType=1, sdUUID='a3a050d6-cd2b-4cb8-a2d4-ca57525cedd0', domainName='second-nfs', typeSpecificArg='192.168.207.2:/exports/nfs/share2', domClass=1, domVersion='5', block_size=512, max_hosts=250, options=None) from=::ffff:192.168.205.2,54384, flow_id=24f4c415, task_id=760cf549-8781-4447-8f28-fe4c77cddacc (api:48)
...
2020-04-26 16:16:54,196+0200 ERROR (jsonrpc/1) [storage.TaskManager.Task] (Task='760cf549-8781-4447-8f28-fe4c77cddacc') Unexpected error (task:880)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/clusterlock.py", line 250, in initSANLock
    lockspace_name, idsPath, align=alignment, sector=block_size)
sanlock.SanlockException: (-202, 'Sanlock lockspace write failure', 'IO timeout')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run
    return fn(*args, **kargs)
  File "<decorator-gen-121>", line 2, in createStorageDomain
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2644, in createStorageDomain
    max_hosts=max_hosts)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/nfsSD.py", line 120, in create
    fsd.initSPMlease()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 992, in initSPMlease
    return self._manifest.initDomainLock()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 620, in initDomainLock
    self._domainLock.initLock(self.getDomainLease())
  File "/usr/lib/python3.6/site-packages/vdsm/storage/clusterlock.py", line 308, in initLock
    block_size=self._block_size)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/clusterlock.py", line 260, in initSANLock
    raise se.ClusterLockInitError()
vdsm.storage.exception.ClusterLockInitError: Could not initialize cluster lock: ()
2020-04-26 16:16:54,196+0200 INFO  (jsonrpc/1) [storage.TaskManager.Task] (Task='760cf549-8781-4447-8f28-fe4c77cddacc') aborting: Task is aborted: 'value=Could not initialize cluster lock: () abortedcode=701' (task:1190)
2020-04-26 16:16:54,198+0200 ERROR (jsonrpc/1) [storage.Dispatcher] FINISH createStorageDomain error=Could not initialize cluster lock: () (dispatcher:83)


sanlock.log 


2020-04-26 16:09:52 740 [25762]: sanlock daemon started 3.8.0 host 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic
2020-04-26 16:12:57 925 [25790]: s1 lockspace bdb91cce-a018-4661-915b-3214940524f5:250:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/ids:0
2020-04-26 16:13:18 946 [25791]: s1:r2 resource bdb91cce-a018-4661-915b-3214940524f5:SDM:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/leases:1048576
2020-04-26 16:13:19 947 [25762]: s1 host 250 1 925 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic
2020-04-26 16:13:24 952 [25791]: s2 lockspace bdb91cce-a018-4661-915b-3214940524f5:1:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/ids:0
2020-04-26 16:13:45 973 [25790]: s2:r3 resource bdb91cce-a018-4661-915b-3214940524f5:SDM:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/leases:1048576
2020-04-26 16:13:45 973 [25762]: s2 host 1 1 952 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic
2020-04-26 16:13:45 973 [25762]: s2 host 250 1 0 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic
2020-04-26 16:14:05 993 [25762]: s2 host 2 1 975 ada599fd-6219-4bb3-90e5-bf0e8e58bea4.lago-basic
2020-04-26 16:16:54 1162 [25790]: worker0 aio timeout WR 0x7f71a8000b60:0x7f71a8000b70:0x7f71b5557000 ioto 10 to_count 1
2020-04-26 16:17:33 1202 [25790]: worker0 aio collect WR 0x7f71a8000b60:0x7f71a8000b70:0x7f71b5557000 result 1048576:0 other free
2020-04-26 16:18:56 1285 [25791]: s3 lockspace 90480be9-523b-4332-9b96-10954ab1b3ce:1:/dev/90480be9-523b-4332-9b96-10954ab1b3ce/ids:0
2020-04-26 16:19:17 1306 [25762]: s3 host 1 1 1285 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic
2020-04-26 16:19:17 1306 [25790]: s3:r4 resource 90480be9-523b-4332-9b96-10954ab1b3ce:SDM:/dev/90480be9-523b-4332-9b96-10954ab1b3ce/leases:1048576
2020-04-26 16:19:31 1319 [25791]: s4 lockspace 90480be9-523b-4332-9b96-10954ab1b3ce:1:/dev/90480be9-523b-4332-9b96-10954ab1b3ce/ids:0


messages


Apr 26 16:15:22 lago-basic-suite-master-host-0 systemd-logind[832]: Session 22 logged out. Waiting for processes to exit.
Apr 26 16:15:22 lago-basic-suite-master-host-0 systemd-logind[832]: Removed session 22.
Apr 26 16:16:20 lago-basic-suite-master-host-0 systemd[1]: Started Session c35 of user root.
Apr 26 16:16:21 lago-basic-suite-master-host-0 systemd[1]: Started Session c36 of user root.
Apr 26 16:16:22 lago-basic-suite-master-host-0 systemd[1]: Started Session c37 of user root.
Apr 26 16:16:22 lago-basic-suite-master-host-0 systemd[1]: Started Session c38 of user root.
Apr 26 16:16:54 lago-basic-suite-master-host-0 systemd[1]: Started Session c39 of user root.
Apr 26 16:16:56 lago-basic-suite-master-host-0 kernel: FS-Cache: Duplicate cookie detected
Apr 26 16:16:56 lago-basic-suite-master-host-0 kernel: FS-Cache: O-cookie c=00000000abc4ace3 [p=00000000102970b8 fl=222 nc=0 na=1]
Apr 26 16:16:56 lago-basic-suite-master-host-0 kernel: FS-Cache: O-cookie d=00000000d6816861 n=000000002acd7ec4
Apr 26 16:16:56 lago-basic-suite-master-host-0 kernel: FS-Cache: O-key=[10] '040002000801c0a8cf02'
Apr 26 16:16:56 lago-basic-suite-master-host-0 kernel: FS-Cache: N-cookie c=0000000063b0020c [p=00000000102970b8 fl=2 nc=0 na=1]
Apr 26 16:16:56 lago-basic-suite-master-host-0 kernel: FS-Cache: N-cookie d=00000000d6816861 n=00000000aeecb256
Apr 26 16:16:56 lago-basic-suite-master-host-0 kernel: FS-Cache: N-key=[10] '040002000801c0a8cf02'

Comment 16 Nir Soffer 2020-05-03 09:02:36 UTC

(In reply to Amit Bawer from comment #15)
> What I find confusing in all 3 cases is that creation of nfs domain with uid
> X fails in vdsm.log for sanlock IO timeout error at time T,
> but sanlock.log on same host shows IO timeout for another domain lockspace Y
> from another NFS path at the very same time T.

Why is this confusing? It seems like expected condition. On time T there was
some issue on the host or on the remote storage, causing I/O timeouts.

What is missing to understand this issue, is description of the environment
having these failures:

- What is the real hardware
- What are the hosts (VMs) involved in this How many levels of virtualization
  is used?
- Where the storage server is running? What else is running on that VM?
- What is the load on the host and storage server when this error occur?
- Do we have sar output from these systems? it may be useful to understand
  what happened at the time of the timeout.
- What is the actual storage used by the NFS server?
- What are the mounts options used for the existing/new storage domains?

The next step is to try to reproduce this issue outside of OST, best by
running "sanlock direct init" and measuring the time to create a lockspace.

I think we have 2 possible issue related to this bug:
- host is overloaded, sanlock is delayed by other processes and miss a deadline
  this may be improved by patch mentioned in comment 14.
- storage server is overloaded by other processes, probably swapping.

We can simulate these issues by running the "stress" command with various
configurations on the host or storage server, or both.

An interesting case is generating lot of IO to the same NFS mount. Does it
cause sanlock timeouts?

Comment 17 Amit Bawer 2020-05-03 09:28:31 UTC

Regarding sar dumps of hosts (included in the attachment):

dev 20 had sanlock iotimeout at 

2020-04-26 16:16:54 1162 [25790]: worker0 aio timeout WR 0x7f71a8000b60:0x7f71a8000b70:0x7f71b5557000 ioto 10 to_count 1

$ sar  -f ./sarstats.bin -b | more 
Linux 4.18.0-193.el8.x86_64 (lago-basic-suite-master-host-0) 	04/26/2020 	_x86_64_	(2 CPU)

02:58:15 PM       tps      rtps      wtps   bread/s   bwrtn/s
02:58:16 PM    183.17    114.85     68.32   4863.37   1856.44
04:58:17 PM    147.00    140.00      7.00  15456.00     81.00
04:58:18 PM    265.00    242.00     23.00  11416.00    481.00
04:58:19 PM    199.00    189.00     10.00  23456.00    207.00
04:58:20 PM    339.00    337.00      2.00  13360.00     61.00
04:58:21 PM    170.30    168.32      1.98   7809.90    277.23
04:58:22 PM    150.00    135.00     15.00  29712.00  17352.00
04:58:23 PM    204.00    199.00      5.00  45696.00     40.00
04:58:24 PM    221.00    221.00      0.00  45920.00      0.00
04:58:25 PM     67.00     58.00      9.00   9728.00    456.00
04:58:26 PM     76.00     76.00      0.00  11424.00      0.00
04:58:27 PM    116.00     69.00     47.00  10840.00  36840.00
04:58:28 PM    127.00    126.00      1.00  63280.00     16.00
04:58:29 PM    163.37    159.41      3.96   3247.52    253.47

dev09 had sanlock iotimeout at

2020-04-26 14:03:32 882 [25712]: worker0 aio timeout WR 0x7fb004000b60:0x7fb004000b70:0x7fb01014b000 ioto 10 to_count 1

$ sar  -f ./sarstats.bin -b | more 
Linux 4.18.0-193.el8.x86_64 (lago-basic-suite-master-host-1) 	04/26/2020 	_x86_64_	(2 CPU)

12:49:21 PM       tps      rtps      wtps   bread/s   bwrtn/s
12:49:22 PM    166.00    164.00      2.00  10720.00     32.00
02:49:23 PM    143.00    134.00      9.00  14824.00    144.00
02:49:24 PM    213.00    200.00     13.00  12920.00    380.00
02:49:25 PM    250.00    235.00     15.00   8760.00    381.00
02:49:26 PM    304.95    296.04      8.91  16625.74    170.30


dev18 had sanlock iotimeout at

2020-04-25 21:00:27 1117 [25733]: worker1 aio timeout WR 0x7f155c000b60:0x7f155c000b70:0x7f1570525000 ioto 10 to_count 1

$ sar  -f ./sarstats.bin -b | more
Linux 4.18.0-193.el8.x86_64 (lago-basic-suite-master-host-1) 	04/25/2020 	_x86_64_	(2 CPU)

07:42:24 PM       tps      rtps      wtps   bread/s   bwrtn/s
07:42:25 PM    208.91    184.16     24.75  22201.98    228.71
09:42:27 PM    304.00    301.00      3.00  15952.00     71.00
09:42:28 PM     97.00     97.00      0.00   4528.00      0.00
09:42:29 PM    187.13    187.13      0.00   6788.12      0.00
09:42:30 PM    724.00    660.00     64.00 116824.00  54400.00
09:42:31 PM     34.00     34.00      0.00  10120.00      0.00
09:42:32 PM     27.00     22.00      5.00    848.00     40.00
09:42:33 PM     82.00     82.00      0.00   3560.00      0.00


sar dumps will not show stats of the timeout time, system was to hung to do so or had some rotation at that point?

Comment 18 Amit Bawer 2020-05-03 11:47:16 UTC

(In reply to Nir Soffer from comment #16)
> An interesting case is generating lot of IO to the same NFS mount. Does it
> cause sanlock timeouts?
Seems like it:

nfs server setup 

# exportfs -v
/exports/data 	<world>(sync,wdelay,hide,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)

nfs client mount

10.35.18.192:/exports/data on /mnt type nfs4 (rw,relatime,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.35.18.54,local_lock=none,addr=10.35.18.192)

Testing sanlock direct init on /mnt/leases from client with no stressing:

# time sanlock direct init -s LS:1:leases:0
init done 0

real	0m0.109s
user	0m0.005s
sys	0m0.006s

Stressing the server with 10K 1GB writes:

# stress -d 10K
stress: info: [10648] dispatching hogs: 0 cpu, 0 io, 0 vm, 10240 hdd


Client would have iotimeout (-202) for writing the leases file:

# time sanlock direct init -s LS:1:leases:0
init done -202

real	0m16.293s
user	0m0.003s
sys	0m0.008s


Stressing only the client causes the direct init command to stall, but sanlock direct returns "init done 0":

# stress -d 10K -t 60 & 
# stress: info: [1357] dispatching hogs: 0 cpu, 0 io, 0 vm, 10240 hdd

# time sanlock direct init -s LS:1:leases:0 
init done 0

real	1m3.757s
user	0m0.004s
sys	0m0.027s

Comment 19 Nir Soffer 2020-05-03 13:10:23 UTC

(In reply to Amit Bawer from comment #17)
> Regarding sar dumps of hosts (included in the attachment):

Looks like we don't have any useful info in these logs, or we don't extract
the info correctly.

> sar dumps will not show stats of the timeout time, system was to hung to do
> so or had some rotation at that point?

Why do you think system was hung? If it was, the issue is not in sanlock
or vdsm, and we need to understand why the storage server was not responsive.

Comment 20 Nir Soffer 2020-05-03 13:32:13 UTC

(In reply to Amit Bawer from comment #18)
> (In reply to Nir Soffer from comment #16)
> # time sanlock direct init -s LS:1:leases:0
> init done 0
> 
> real	0m0.109s
> user	0m0.005s
> sys	0m0.006s

This show that 10 seconds is a reasonable timeout.

> Stressing the server with 10K 1GB writes:
> 
> # stress -d 10K
> stress: info: [10648] dispatching hogs: 0 cpu, 0 io, 0 vm, 10240 hdd

This create 10k threads, each writing 1G of data. I think this is a bit
too extreme.

> Client would have iotimeout (-202) for writing the leases file:
> 
> # time sanlock direct init -s LS:1:leases:0
> init done -202
> 
> real	0m16.293s
> user	0m0.003s
> sys	0m0.008s

Good, we see that overloaded server cause timeouts.

It is interesting to see what is the minimum load that cause timeouts.

I would try with --hdd 1, 2, 4, 8, 16, 32, 64, 128, and run
"sanlock direct init" 100 times and see how the times are distributed.

> Stressing only the client causes the direct init command to stall, but
> sanlock direct returns "init done 0":
> 
> # stress -d 10K -t 60 & 
> # stress: info: [1357] dispatching hogs: 0 cpu, 0 io, 0 vm, 10240 hdd
> 
> # time sanlock direct init -s LS:1:leases:0 
> init done 0
> 
> real	1m3.757s
> user	0m0.004s
> sys	0m0.027s

Do we see this pattern in vdsm log? How much time we waited for sanlock
when it failed with a timeout?

This is again a bit too extreme, but all the io was local. stress performs
io to files in the current directory. I think it would be more useful to 
to do:

    cd /mnt
    stress --hdd N  # N from 1 to 128

So we not only stress the host but create I/O to the same NFS server.

When creating a storage domain, we never have any I/O to the same mount,
but we may have I/O to other mounts, so maybe use 2 mounts:

    mount ... /tmp/domain1
    mount ... /tmp/domain2

    cd /tmp/domain1
    stress --hdd N

And test snalock with another mount:

    sanlock direct init -s LS:1:/tmp/domain2/ids:0

Comment 21 Amit Bawer 2020-05-03 13:36:26 UTC

(In reply to Nir Soffer from comment #19)
> (In reply to Amit Bawer from comment #17)
> > Regarding sar dumps of hosts (included in the attachment):
> 
> Looks like we don't have any useful info in these logs, or we don't extract
> the info correctly.
> 
> > sar dumps will not show stats of the timeout time, system was to hung to do
> > so or had some rotation at that point?
> 
> Why do you think system was hung? If it was, the issue is not in sanlock
> or vdsm, and we need to understand why the storage server was not responsive.

because of the sar monitoring gap missing for the iotimeout period.
I don't know what is the setup, maybe the storage server is on the same machine but on another VM
probably Michal could say better.

Comment 22 Nir Soffer 2020-05-03 15:10:50 UTC

May be caused by bug 1830738

Comment 23 Amit Bawer 2020-05-04 09:10:09 UTC

(In reply to Nir Soffer from comment #20)
> (In reply to Amit Bawer from comment #18)
> > (In reply to Nir Soffer from comment #16)
> > # time sanlock direct init -s LS:1:leases:0
> > init done 0
> > 
> > real	0m0.109s
> > user	0m0.005s
> > sys	0m0.006s
> 
> This show that 10 seconds is a reasonable timeout.
> 
> > Stressing the server with 10K 1GB writes:
> > 
> > # stress -d 10K
> > stress: info: [10648] dispatching hogs: 0 cpu, 0 io, 0 vm, 10240 hdd
> 
> This create 10k threads, each writing 1G of data. I think this is a bit
> too extreme.
> 
> > Client would have iotimeout (-202) for writing the leases file:
> > 
> > # time sanlock direct init -s LS:1:leases:0
> > init done -202
> > 
> > real	0m16.293s
> > user	0m0.003s
> > sys	0m0.008s
> 
> Good, we see that overloaded server cause timeouts.
> 
> It is interesting to see what is the minimum load that cause timeouts.
> 
> I would try with --hdd 1, 2, 4, 8, 16, 32, 64, 128, and run
> "sanlock direct init" 100 times and see how the times are distributed.

Tested as advised, used two mounts /mnt1 /mnt2
where path of /mnt2 is stressed and /mnt1/ids is the sanlock direct init -s path:

not stressing
running 100 trials
Done 100 trials with 0 hdd hogs
avg time:  0.1917
max time:  1.87
min time:  0.09
iotimeouts: 0

stressing nfs server with 1 1GB writers
stress: info: [28110] dispatching hogs: 0 cpu, 0 io, 0 vm, 1 hdd
running 100 trials
Done 100 trials with 1 hdd hogs
avg time:  0.6773
max time:  12.81
min time:  0.35
iotimeouts: 1

stressing nfs server with 2 1GB writers
stress: info: [28194] dispatching hogs: 0 cpu, 0 io, 0 vm, 2 hdd
running 100 trials
Done 100 trials with 2 hdd hogs
avg time:  0.769
max time:  18.96
min time:  0.22
iotimeouts: 1

stressing nfs server with 4 1GB writers
stress: info: [28292] dispatching hogs: 0 cpu, 0 io, 0 vm, 4 hdd
running 100 trials
Done 100 trials with 4 hdd hogs
avg time:  0.8635
max time:  9.17
min time:  0.40
iotimeouts: 0
stopping
stopping
stressing nfs server with 8 1GB writers
stress: info: [28420] dispatching hogs: 0 cpu, 0 io, 0 vm, 8 hdd
running 100 trials
Done 100 trials with 8 hdd hogs
avg time:  1.0828
max time:  5.21
min time:  0.42
iotimeouts: 0

stressing nfs server with 16 1GB writers
stress: info: [28638] dispatching hogs: 0 cpu, 0 io, 0 vm, 16 hdd
running 100 trials
Done 100 trials with 16 hdd hogs
avg time:  1.6873
max time:  8.46
min time:  0.43
iotimeouts: 0

stressing nfs server with 32 1GB writers
stress: info: [29059] dispatching hogs: 0 cpu, 0 io, 0 vm, 32 hdd
running 100 trials
Done 100 trials with 32 hdd hogs
avg time:  3.0077
max time:  138.84
min time:  0.48
iotimeouts: 1

stressing nfs server with 64 1GB writers
stress: info: [30702] dispatching hogs: 0 cpu, 0 io, 0 vm, 64 hdd
running 100 trials
Done 100 trials with 64 hdd hogs
avg time:  3.5373
max time:  100.75
min time:  0.46
iotimeouts: 1

stressing nfs server with 128 1GB writers
stress: info: [32528] dispatching hogs: 0 cpu, 0 io, 0 vm, 128 hdd
running 100 trials
Done 100 trials with 128 hdd hogs
avg time:  2.5558
max time:  29.80
min time:  0.46
iotimeouts: 1


When timeouts occured, they were always for the first lockspace write attempt for the test and the following ones made it through,
for example, when used 2 hdd hogs:

init done -202
real 18.96
user 0.00
sys 0.00

init done 0
real 3.35
user 0.00
sys 0.00

init done 0
real 0.75
user 0.00
sys 0.00

init done 0
real 0.58
user 0.00
sys 0.00
...

stress --hdd X is expected to run X writers periodically, so the load should
be kept on the nfs server during entire test, i wonder why it times-out once and for the first
attempt when it happens. maybe there is some sort of nfs/sanlock write caching? 

> 
> > Stressing only the client causes the direct init command to stall, but
> > sanlock direct returns "init done 0":
> > 
> > # stress -d 10K -t 60 & 
> > # stress: info: [1357] dispatching hogs: 0 cpu, 0 io, 0 vm, 10240 hdd
> > 
> > # time sanlock direct init -s LS:1:leases:0 
> > init done 0
> > 
> > real	1m3.757s
> > user	0m0.004s
> > sys	0m0.027s
> 
> Do we see this pattern in vdsm log? How much time we waited for sanlock
> when it failed with a timeout?

Not really, just wanted to see if host load itself has something to do with it.

dev20 host0 had timeout for write_lockspace at 2020-04-26 16:16:54

2020-04-26 16:16:54,195+0200 ERROR (jsonrpc/1) [storage.initSANLock] Cannot initialize SANLock for domain a3a050d6-cd2b-4cb8-a2d4-ca57525cedd0 (clusterlock:259)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/clusterlock.py", line 250, in initSANLock
    lockspace_name, idsPath, align=alignment, sector=block_size)
sanlock.SanlockException: (-202, 'Sanlock lockspace write failure', 'IO timeout')
2020-04-26 16:16:54,196+0200 INFO  (jsonrpc/1) [vdsm.api] FINISH createStorageDomain error=Could not initialize cluster lock: () from=::ffff:192.168.205.2,54384, flow_id=24f4c415, task_id=760cf549-8781-4447-8f28-fe4c77cddacc (api:52)
2020-04-26 16:16:54,196+0200 ERROR (jsonrpc/1) [storage.TaskManager.Task] (Task='760cf549-8781-4447-8f28-fe4c77cddacc') Unexpected error (task:880)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/clusterlock.py", line 250, in initSANLock
    lockspace_name, idsPath, align=alignment, sector=block_size)
sanlock.SanlockException: (-202, 'Sanlock lockspace write failure', 'IO timeout')

According to sanlock.log for thread [25790] it first started on 2020-04-26 16:12:57 and alerted timeout on 2020-04-26 16:16:54, so 3:57 minutes

2020-04-26 16:09:52 740 [25762]: sanlock daemon started 3.8.0 host 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic
2020-04-26 16:12:57 925 [25790]: s1 lockspace bdb91cce-a018-4661-915b-3214940524f5:250:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/ids:0
2020-04-26 16:13:18 946 [25791]: s1:r2 resource bdb91cce-a018-4661-915b-3214940524f5:SDM:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/leases:1048576
2020-04-26 16:13:19 947 [25762]: s1 host 250 1 925 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic
2020-04-26 16:13:24 952 [25791]: s2 lockspace bdb91cce-a018-4661-915b-3214940524f5:1:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/ids:0
2020-04-26 16:13:45 973 [25790]: s2:r3 resource bdb91cce-a018-4661-915b-3214940524f5:SDM:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/leases:1048576
2020-04-26 16:13:45 973 [25762]: s2 host 1 1 952 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic
2020-04-26 16:13:45 973 [25762]: s2 host 250 1 0 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic
2020-04-26 16:14:05 993 [25762]: s2 host 2 1 975 ada599fd-6219-4bb3-90e5-bf0e8e58bea4.lago-basic
2020-04-26 16:16:54 1162 [25790]: worker0 aio timeout WR 0x7f71a8000b60:0x7f71a8000b70:0x7f71b5557000 ioto 10 to_count 1

Comment 24 Amit Bawer 2020-05-04 09:55:20 UTC

BTW, the "set scheduler RR|RESET_ON_FORK priority 99 failed: Operation not permitted" of 1830738 is not seen in sanlock.log 

2020-04-26 16:09:52 740 [25762]: sanlock daemon started 3.8.0 host 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic
2020-04-26 16:12:57 925 [25790]: s1 lockspace bdb91cce-a018-4661-915b-3214940524f5:250:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/ids:0

Comment 25 Amit Bawer 2020-05-04 13:14:52 UTC

(In reply to Amit Bawer from comment #23)
> (In reply to Nir Soffer from comment #20)
> > (In reply to Amit Bawer from comment #18)
> > > (In reply to Nir Soffer from comment #16)
> > > # time sanlock direct init -s LS:1:leases:0
> > > init done 0
> > > 
> > > real	0m0.109s
> > > user	0m0.005s
> > > sys	0m0.006s
> > 
> > This show that 10 seconds is a reasonable timeout.
> > 
> > > Stressing the server with 10K 1GB writes:
> > > 
> > > # stress -d 10K
> > > stress: info: [10648] dispatching hogs: 0 cpu, 0 io, 0 vm, 10240 hdd
> > 
> > This create 10k threads, each writing 1G of data. I think this is a bit
> > too extreme.
> > 
> > > Client would have iotimeout (-202) for writing the leases file:
> > > 
> > > # time sanlock direct init -s LS:1:leases:0
> > > init done -202
> > > 
> > > real	0m16.293s
> > > user	0m0.003s
> > > sys	0m0.008s
> > 
> > Good, we see that overloaded server cause timeouts.
> > 
> > It is interesting to see what is the minimum load that cause timeouts.
> > 
> > I would try with --hdd 1, 2, 4, 8, 16, 32, 64, 128, and run
> > "sanlock direct init" 100 times and see how the times are distributed.
> 
> Tested as advised, used two mounts /mnt1 /mnt2
> where path of /mnt2 is stressed and /mnt1/ids is the sanlock direct init -s
> path:
> 
> not stressing
> running 100 trials
> Done 100 trials with 0 hdd hogs
> avg time:  0.1917
> max time:  1.87
> min time:  0.09
> iotimeouts: 0
> 
> stressing nfs server with 1 1GB writers
> stress: info: [28110] dispatching hogs: 0 cpu, 0 io, 0 vm, 1 hdd
> running 100 trials
> Done 100 trials with 1 hdd hogs
> avg time:  0.6773
> max time:  12.81
> min time:  0.35
> iotimeouts: 1
> 
> stressing nfs server with 2 1GB writers
> stress: info: [28194] dispatching hogs: 0 cpu, 0 io, 0 vm, 2 hdd
> running 100 trials
> Done 100 trials with 2 hdd hogs
> avg time:  0.769
> max time:  18.96
> min time:  0.22
> iotimeouts: 1
> 
> stressing nfs server with 4 1GB writers
> stress: info: [28292] dispatching hogs: 0 cpu, 0 io, 0 vm, 4 hdd
> running 100 trials
> Done 100 trials with 4 hdd hogs
> avg time:  0.8635
> max time:  9.17
> min time:  0.40
> iotimeouts: 0
> stopping
> stopping
> stressing nfs server with 8 1GB writers
> stress: info: [28420] dispatching hogs: 0 cpu, 0 io, 0 vm, 8 hdd
> running 100 trials
> Done 100 trials with 8 hdd hogs
> avg time:  1.0828
> max time:  5.21
> min time:  0.42
> iotimeouts: 0
> 
> stressing nfs server with 16 1GB writers
> stress: info: [28638] dispatching hogs: 0 cpu, 0 io, 0 vm, 16 hdd
> running 100 trials
> Done 100 trials with 16 hdd hogs
> avg time:  1.6873
> max time:  8.46
> min time:  0.43
> iotimeouts: 0
> 
> stressing nfs server with 32 1GB writers
> stress: info: [29059] dispatching hogs: 0 cpu, 0 io, 0 vm, 32 hdd
> running 100 trials
> Done 100 trials with 32 hdd hogs
> avg time:  3.0077
> max time:  138.84
> min time:  0.48
> iotimeouts: 1
> 
> stressing nfs server with 64 1GB writers
> stress: info: [30702] dispatching hogs: 0 cpu, 0 io, 0 vm, 64 hdd
> running 100 trials
> Done 100 trials with 64 hdd hogs
> avg time:  3.5373
> max time:  100.75
> min time:  0.46
> iotimeouts: 1
> 
> stressing nfs server with 128 1GB writers
> stress: info: [32528] dispatching hogs: 0 cpu, 0 io, 0 vm, 128 hdd
> running 100 trials
> Done 100 trials with 128 hdd hogs
> avg time:  2.5558
> max time:  29.80
> min time:  0.46
> iotimeouts: 1
> 
> 
> When timeouts occured, they were always for the first lockspace write
> attempt for the test and the following ones made it through,
> for example, when used 2 hdd hogs:
> 
> init done -202
> real 18.96
> user 0.00
> sys 0.00
> 
> init done 0
> real 3.35
> user 0.00
> sys 0.00
> 
> init done 0
> real 0.75
> user 0.00
> sys 0.00
> 
> init done 0
> real 0.58
> user 0.00
> sys 0.00
> ...
> 
> stress --hdd X is expected to run X writers periodically, so the load should
> be kept on the nfs server during entire test, i wonder why it times-out once
> and for the first
> attempt when it happens. maybe there is some sort of nfs/sanlock write
> caching? 

Running stress --hdd 10 on the server in path of /mnt2

1. Waiting 2 seconds between client writes makes most of them time-out:

# for x in `seq 10`; do  sanlock direct init -s LS:1:/mnt1/ids:0 ; sleep 2 ; done
init done -202
init done -202
init done -202
init done -202
init done -202
init done -202
init done 0
init done -202
init done -202
init done -202

2. Shorter wait of 1 seconds between writes causes all the other 9 to make it after the first timed-out:

# for x in `seq 10`; do  sanlock direct init -s LS:1:/mnt1/ids:0 ; sleep 1 ; done
init done -202
init done 0
init done 0
init done 0
init done 0
init done 0
init done 0
init done 0
init done 0
init done 0

Far from being an NFS expert but seems it does cache the server replies (the timeout status is a sanlock decision),
as repeating the same writes in short period works.

Comment 26 Michal Skrivanek 2020-05-04 18:38:47 UTC

(In reply to Amit Bawer from comment #21)
> I don't know what is the setup, maybe the storage server is on the same
> machine but on another VM
> probably Michal could say better.

yes, it's always the same machine, different VM (engine VM). Usual OST setup...

Comment 27 Amit Bawer 2020-05-05 11:40:42 UTC

From dev20/host0 vdsm.log


1. Starting task da6ef21e-ef66-4923-8260-99f37df8ef6b for downloadImage job on SD bdb91cce-a018-4661-915b-3214940524f5

2020-04-26 16:14:36,076+0200 INFO  (jsonrpc/7) [vdsm.api] START downloadImage(methodArgs={'headers': {}, 'method': 'http', 'url': 'http://glance.ovirt.org:9292//v1/images/fe03ebf0-2cc9-4e20-bb2f-9d30e7abc0fd'}, spUUID='191a467c-ecf9-4010-a4ac-bd3ef862e893', sdUUID='bdb91cce-a018-4661-915b-3214940524f5', imgUUID='27389daf-258e-406e-86a9-a824faf637dd', volUUID='2974175c-8174-4e9c-b31e-b4f75e42b829') from=::ffff:192.168.205.2,54456, flow_id=e42a259c-9b1b-4cea-b381-a82fc28d57f7, task_id=da6ef21e-ef66-4923-8260-99f37df8ef6b (api:48)
2020-04-26 16:14:36,556+0200 INFO  (jsonrpc/7) [vdsm.api] FINISH downloadImage return=None from=::ffff:192.168.205.2,54456, flow_id=e42a259c-9b1b-4cea-b381-a82fc28d57f7, task_id=da6ef21e-ef66-4923-8260-99f37df8ef6b (api:54)
2020-04-26 16:14:37,840+0200 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Image.download succeeded in 1.77 seconds (__init__:312)
2020-04-26 16:14:37,840+0200 INFO  (tasks/3) [storage.ThreadPool.WorkerThread] START task da6ef21e-ef66-4923-8260-99f37df8ef6b (cmd=<bound method Task.commit of <vdsm.storage.task.Task object at 0x7f9bf005f550>>, args=None) (threadPool:149)
...


2. Starting another task e757f316-e20f-4ab0-aa19-fa52a09f857a for donloadImage job on SD bdb91cce-a018-4661-915b-3214940524f5:

2020-04-26 16:14:41,686+0200 INFO  (jsonrpc/6) [vdsm.api] START downloadImage(methodArgs={'headers': {}, 'method': 'http', 'url': 'http://glance.ovirt.org:9292//v1/images/fe03ebf0-2cc9-4e20-bb2f-9d30e7abc0fd'}, spUUID='191a467c-ecf9-4010-a4ac-bd3ef862e893', sdUUID='bdb91cce-a018-4661-915b-3214940524f5', imgUUID='5cb1beca-63a5-4864-9a2d-41bc9c9c68e8', volUUID='b180b9a2-e1fa-4fd1-ad82-e606ee34a154') from=::ffff:192.168.205.2,54456, flow_id=ce97b813-bb4d-404f-be58-2922c60037f0, task_id=e757f316-e20f-4ab0-aa19-fa52a09f857a (api:48)
2020-04-26 16:14:42,716+0200 INFO  (jsonrpc/6) [vdsm.api] FINISH downloadImage return=None from=::ffff:192.168.205.2,54456, flow_id=ce97b813-bb4d-404f-be58-2922c60037f0, task_id=e757f316-e20f-4ab0-aa19-fa52a09f857a (api:54)
2020-04-26 16:14:43,834+0200 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Image.download succeeded in 2.15 seconds (__init__:312)
2020-04-26 16:14:43,835+0200 INFO  (tasks/6) [storage.ThreadPool.WorkerThread] START task e757f316-e20f-4ab0-aa19-fa52a09f857a (cmd=<bound method Task.commit of <vdsm.storage.task.Task object at 0x7f9bf0146b70>>, args=None) (threadPool:149)
....


3. Attempting to create another SD a3a050d6-cd2b-4cb8-a2d4-ca57525cedd0 on NFS share2 and writing lockspace fails due to sanlock io timeout:

2020-04-26 16:16:22,215+0200 INFO  (jsonrpc/1) [vdsm.api] START createStorageDomain(storageType=1, sdUUID='a3a050d6-cd2b-4cb8-a2d4-ca57525cedd0', domainName='second-nfs', typeSpecificArg='192.168.207.2:/exports/nfs/share2', domClass=1, domVersion='5', block_size=512, max_hosts=250, options=None) from=::ffff:192.168.205.2,54384, flow_id=24f4c415, task_id=760cf549-8781-4447-8f28-fe4c77cddacc (api:48)
...
2020-04-26 16:16:54,195+0200 ERROR (jsonrpc/1) [storage.initSANLock] Cannot initialize SANLock for domain a3a050d6-cd2b-4cb8-a2d4-ca57525cedd0 (clusterlock:259)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/clusterlock.py", line 250, in initSANLock
    lockspace_name, idsPath, align=alignment, sector=block_size)
sanlock.SanlockException: (-202, 'Sanlock lockspace write failure', 'IO timeout')
2020-04-26 16:16:54,196+0200 INFO  (jsonrpc/1) [vdsm.api] FINISH createStorageDomain error=Could not initialize cluster lock: () from=::ffff:192.168.205.2,54384, flow_id=24f4c415, task_id=760cf549-8781-4447-8f28-fe4c77cddacc (api:52)
2020-04-26 16:16:54,196+0200 ERROR (jsonrpc/1) [storage.TaskManager.Task] (Task='760cf549-8781-4447-8f28-fe4c77cddacc') Unexpected error (task:880)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/clusterlock.py", line 250, in initSANLock
    lockspace_name, idsPath, align=alignment, sector=block_size)
sanlock.SanlockException: (-202, 'Sanlock lockspace write failure', 'IO timeout')
...


4. 938 MB Write to NFS share1 for download image task da6ef21e-ef66-4923-8260-99f37df8ef6b has finished:

2020-04-26 16:17:48,825+0200 DEBUG (tasks/3) [root] SUCCESS: <err> = b"223+22550 records in\n223+22550 records out\n938409984 bytes (938 MB, 895 MiB) copied, 189.594 s, 4.9 MB/s\nfork_exec('/usr/bin/curl', 'curl', '-q', '--silent', '--fail', '--show-error', 'http://glance.ovirt.org:9292//v1/images/fe03ebf0-2cc9-4e20-bb2f-9d30e7abc0fd')\nfork_exec('/bin/dd', 'dd', 'bs=2M', 'of=/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/images/27389daf-258e-406e-86a9-a824faf637dd/2974175c-8174-4e9c-b31e-b4f75e42b829')\n"; <rc> = 0 (commands:224)
...
2020-04-26 16:17:51,077+0200 INFO  (tasks/3) [storage.ThreadPool.WorkerThread] FINISH task da6ef21e-ef66-4923-8260-99f37df8ef6b (threadPool:151)


5. 938 MB Write to NFS share1 for download image task e757f316-e20f-4ab0-aa19-fa52a09f857a has finished:

2020-04-26 16:17:56,556+0200 DEBUG (tasks/6) [root] SUCCESS: <err> = b"220+18621 records in\n220+18621 records out\n938409984 bytes (938 MB, 895 MiB) copied, 191.032 s, 4.9 MB/s\nfork_exec('/usr/bin/curl', 'curl', '-q', '--silent', '--fail', '--show-error', 'http://glance.ovirt.org:9292//v1/images/fe03ebf0-2cc9-4e20-bb2f-9d30e7abc0fd')\nfork_exec('/bin/dd', 'dd', 'bs=2M', 'of=/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/images/5cb1beca-63a5-4864-9a2d-41bc9c9c68e8/b180b9a2-e1fa-4fd1-ad82-e606ee34a154')\n"; <rc> = 0 (commands:224)
...
2020-04-26 16:17:58,231+0200 INFO  (tasks/6) [storage.ThreadPool.WorkerThread] FINISH task e757f316-e20f-4ab0-aa19-fa52a09f857a (threadPool:151)


Which correlates with the scenario of multiple writes to one mount path while having timed-out call for sanlock direct init -s on the other.

Comment 28 Michal Skrivanek 2020-05-05 17:20:14 UTC

hm, so rather than other traffic this could be related to the single OST run itself causing high I/O due to the parallelization? the 1GB downloads are from test_add_glance_images most likely, and that doesn't have a completeness check until test_verify_glance_import in 004_basic_sanity_pytest.py, so they can run concurrently...

if we don't have a solution to make SD creation reliable we can shift things around in OST and e.g. wait for glance import before adding secondary SDs. That might help a bit, but of course won't solve the real problem, as well as it likely won't solve an issue in CI where we run multiple concurrent OSTs on larger machines

Comment 29 Nir Soffer 2020-05-06 19:28:02 UTC

(In reply to Amit Bawer from comment #27)

Very interesting!

Lets convert the raw data to something more meaningful:

2020-04-26 16:14:36,076 task 1 started
2020-04-26 16:14:41,686 task 2 started
2020-04-26 16:16:22,215 creating storage domain
2020-04-26 16:16:54,195 sanlock timeout in 34 seconds
2020-04-26 16:17:48,825 task 1 finished, 189.594 s, 895 MiB, 4.9 MB/s
2020-04-26 16:17:56,556 task 2 finished, 191.032 s, 895 MiB, 4.9 MB/s

So we wrote 1.75 GiB to storage in extremely slow rate (4.9 MB/s). We
don't know if this was slow because of the Glance server, bad network,
or very slow storage.

During the download, we had sanlock timeout writing and reading 1MiB
storage.

We download using combination of dd and curl:

curl -q --silent --fail --show-error src_url | dd bs=2M of=/path

Note that we don't use direct I/O, so what this does is write data to
host page cache. The kernel flush data from the page cache when it likes
so what can happen is:

1. page cache becomes too full
2. kernel flushes data to storage
3. sanlock try write 1 MiB to storage using direct I/O
4. storage server is slow, so flushing is slow...
5. sanlock I/O is delayed and it fails with a timeout

This is a very old issue that we had in the past with qemu-img. This
is the reason we use direct I/O when copying images with qemu-img, and
in imageio.

To verify this theory, we need to bring up this setup and stop it
when the host is ready and connected to NFS server.

Then check the rate we can write data to this storage:

dd if=/dev/zero bs=2M count=512 of=/rhev/data-center/mnt/server:_path/tmp/file

Now we can reproduce by running "sanlock direct init" in a loop with
some concurrent uploads.

If my theory is correct using:

dd if=/dev/zero bs=2M count=512 of=/rhev/data-center/mnt/server:_path/tmp/file oflag=direct

Will not reproduce this issue, or maybe only when using very large
amount of dd jobs. We can have only 10 concurrent curl tasks since
we have a 10 workers in the storage thread pool.

It this solves the issue, we can update the glance import/export code
to use direct I/O, similar to the way we use dd to upload OVF data.

Comment 30 Amit Bawer 2020-05-07 12:28:02 UTC

(In reply to Nir Soffer from comment #29)

No OST setup to stop, but theory seems to hold for NFS writes test.


Test #1: Running 10 cached IO writes:

# for x in `seq 10`; do (dd if=/dev/zero of=/mnt2/${x} bs=2M count=512 &) ; done 
# for x in `seq 10`; do pkill -USR1 -x dd ; sleep 5 ; done
...
18874368 bytes (19 MB, 18 MiB) copied, 4.25177 s, 4.4 MB/s
9+0 records in
9+0 records out
18874368 bytes (19 MB, 18 MiB) copied, 4.57601 s, 4.1 MB/s
6+0 records in
6+0 records out
12582912 bytes (13 MB, 12 MiB) copied, 4.76733 s, 2.6 MB/s
16+0 records in
16+0 records out
33554432 bytes (34 MB, 32 MiB) copied, 5.08976 s, 6.6 MB/s
7+0 records in
7+0 records out
14680064 bytes (15 MB, 14 MiB) copied, 5.13963 s, 2.9 MB/s
16+0 records in
16+0 records out
33554432 bytes (34 MB, 32 MiB) copied, 5.17378 s, 6.5 MB/s
12+0 records in
12+0 records out
25165824 bytes (25 MB, 24 MiB) copied, 5.2562 s, 4.8 MB/s
12+0 records in
12+0 records out
25165824 bytes (25 MB, 24 MiB) copied, 5.45471 s, 4.6 MB/s
16+0 records in
16+0 records out
33554432 bytes (34 MB, 32 MiB) copied, 5.50963 s, 6.1 MB/s
...

Had 80% sanlock timeout failure rate while writes were in progress:

# for x in `seq 10`; do sanlock direct init -s LS:1:/mnt1/ids:0 ; done
init done -202
init done 0
init done -202
init done 0
init done -202
init done -202
init done -202
init done -202
init done -202
init done -202


Test #2:  Running 10 direct IO writes (oflag=direct):

# for x in `seq 10`; do (dd if=/dev/zero of=/mnt2/${x} bs=2M count=512 oflag=direct &) ; done 
# for x in `seq 10`; do pkill -USR1 -x dd ; sleep 5 ; done 
...
60817408 bytes (61 MB, 58 MiB) copied, 17.2965 s, 3.5 MB/s
30+0 records in
30+0 records out
62914560 bytes (63 MB, 60 MiB) copied, 17.5872 s, 3.6 MB/s
29+0 records in
29+0 records out
60817408 bytes (61 MB, 58 MiB) copied, 17.4189 s, 3.5 MB/s
31+0 records in
31+0 records out
65011712 bytes (65 MB, 62 MiB) copied, 17.3991 s, 3.7 MB/s
31+0 records in
31+0 records out
65011712 bytes (65 MB, 62 MiB) copied, 17.4462 s, 3.7 MB/s
30+0 records in
30+0 records out
62914560 bytes (63 MB, 60 MiB) copied, 17.4899 s, 3.6 MB/s
31+0 records in
31+0 records out
65011712 bytes (65 MB, 62 MiB) copied, 17.5135 s, 3.7 MB/s
31+0 records in
31+0 records out
65011712 bytes (65 MB, 62 MiB) copied, 18.1368 s, 3.6 MB/s
30+0 records in
30+0 records out
62914560 bytes (63 MB, 60 MiB) copied, 17.515 s, 3.6 MB/s
30+0 records in
30+0 records out
62914560 bytes (63 MB, 60 MiB) copied, 17.7085 s, 3.6 MB/s
...

Had no sanlock timeouts while writes were in progress:

# for x in `seq 10`; do sanlock direct init -s LS:1:/mnt1/ids:0 ; done
init done 0
init done 0
init done 0
init done 0
init done 0
init done 0
init done 0
init done 0
init done 0
init done 0

Comment 31 Nir Soffer 2020-05-07 13:49:42 UTC

(In reply to Amit Bawer from comment #30)
> (In reply to Nir Soffer from comment #29)
Nice!

So the next step would be to modify glance code to use direct I/O.

Unfortunately using direct I/O with current code will not work, since
images from glance may be unaligned to storage block size (qcow2 images
are typically unaligned). So we need to rewrite this code.

I think we used the wrong bug for this issue, since the issue is not
qos for sanlock, but bad implementation of vdsm curl wrapper, so moving
this to new bug would be best.

Comment 32 David Teigland 2020-05-07 14:21:20 UTC

If the issue is pagecache writeback, then I wonder if using synchronous i/o would have the same effect as direct i/o?

Comment 33 Nir Soffer 2020-05-07 15:19:44 UTC

(In reply to David Teigland from comment #32)
> If the issue is pagecache writeback, then I wonder if using synchronous i/o
> would have the same effect as direct i/o?

In sanlock? dd?

Comment 34 David Teigland 2020-05-07 15:46:15 UTC

Using dd with oflag=sync (or O_SYNC elsewhere) might prevent the large amount of writeback that seems to delay the sanlock i/o.

Comment 35 Nir Soffer 2020-05-07 15:58:57 UTC

(In reply to David Teigland from comment #34)
> Using dd with oflag=sync (or O_SYNC elsewhere) might prevent the large
> amount of writeback that seems to delay the sanlock i/o.

Right! this can be a minimal fix. Will probably be slower because of the 
additional fsync() every 2M, but may be good enough. We can use bs=8M
to mitigate this.

Amit, do you want to try this?

Comment 36 Amit Bawer 2020-05-07 17:22:16 UTC

(In reply to Nir Soffer from comment #35)
> (In reply to David Teigland from comment #34)
> > Using dd with oflag=sync (or O_SYNC elsewhere) might prevent the large
> > amount of writeback that seems to delay the sanlock i/o.
> 
> Right! this can be a minimal fix. Will probably be slower because of the 
> additional fsync() every 2M, but may be good enough. We can use bs=8M
> to mitigate this.
> 
> Amit, do you want to try this?



# for x in `seq 10`; do (dd if=/dev/zero of=/mnt2/${x} bs=8M count=125 oflag=sync &) ; done 
# for x in `seq 10`; do pkill -USR1 -x dd ; sleep 5 ; done 
...
176160768 bytes (176 MB, 168 MiB) copied, 44.87 s, 3.9 MB/s
25+0 records in
25+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 45.0507 s, 4.7 MB/s
26+0 records in
26+0 records out
218103808 bytes (218 MB, 208 MiB) copied, 46.0356 s, 4.7 MB/s
21+0 records in
21+0 records out
176160768 bytes (176 MB, 168 MiB) copied, 46.0015 s, 3.8 MB/s
25+0 records in
25+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 46.0474 s, 4.6 MB/s
24+0 records in
24+0 records out
201326592 bytes (201 MB, 192 MiB) copied, 45.9168 s, 4.4 MB/s
24+0 records in
24+0 records out
201326592 bytes (201 MB, 192 MiB) copied, 46.0064 s, 4.4 MB/s
25+0 records in
25+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 46.0559 s, 4.6 MB/s
23+0 records in
23+0 records out
192937984 bytes (193 MB, 184 MiB) copied, 46.276 s, 4.2 MB/s
23+0 records in
23+0 records out
192937984 bytes (193 MB, 184 MiB) copied, 46.2688 s, 4.2 MB/s
...


Had no timeouts while sync writes were ongoing:

# for x in `seq 10`; do sanlock direct init -s LS:1:/mnt1/ids:0 ; done
init done 0
init done 0
init done 0
init done 0
init done 0
init done 0
init done 0
init done 0
init done 0
init done 0

Comment 37 Nir Soffer 2020-05-07 17:44:26 UTC

(In reply to Amit Bawer from comment #36)
Looks like we have a minimal and easy fix then.

Comment 38 Nir Soffer 2020-05-07 18:42:23 UTC

But we should use oflag=dsync and not oflag=sync, since sync may be slower,
writing also file metadata which we don't care about until the upload was
completed.

There is also oflag=nocache, which uses POSIX_FADV_DONTNEED to advise the
kernel that data will not be needed in the near future. This may minimize
the amount of buffered data, which may minimize unpredictable delays.
I don't know it it will help but worth to try.

Comment 39 Nir Soffer 2020-05-08 17:53:03 UTC

The glance issue is tracked at bug 1832967.

Comment 40 Nir Soffer 2020-06-17 13:41:39 UTC

I think we can close this. The glance issue was fixed (bug 1832967)
and we don't have any way to ensure QOS for sanlock. Maybe we need 
to file separate bugs for kernel, since this issue may affects other
components.

Comment 41 Avihai 2020-06-18 10:34:57 UTC

(In reply to Nir Soffer from comment #40)
> I think we can close this. The glance issue was fixed (bug 1832967)
> and we don't have any way to ensure QOS for sanlock. Maybe we need 
> to file separate bugs for kernel, since this issue may affects other
> components.

Can you please close this one?

Comment 42 Nir Soffer 2020-06-23 08:34:55 UTC

Closing since we don't have a way to fix this issue.

Note You need to log in before you can comment on or make changes to this bug.