Description of problem: Sanlock is writing and reading 1MiB to every storage domain, using 10 seconds io timeout. If writes/read to storage are too slow for 40 seconds (4 retries based on 10 seconds timeout), sanlock will fence the SPM (or in 3.6, any host holding the domain resource), aborting the current vdsm operations on this host. We are ok if storage become slow for a while, io opetaion taking many minutes or hours will simply be slower. However, because of sanlock, such operation running on the SPM (or any host holding a resource on the domain) will fail when io is too slow. We want to ensure that sanlock io have priority over other io in the system. For block-based storage, we use ionice with ioclass=Idle when doing io-heavy operations like copying and converting images (e.g using dd and qemu-img). This helps only when using cfq scheduler, but we are using the deadline scheduler which ignore io priority. However the deadline scheduler should serve sanlock io request on time, since they are small. We probably should revisit the need for using ionice. For file-based storage such as NFS and gluster, we are not doing anything to ensure heavy io operation are not delaying sanlock io. If seems that we can use cgrougs for this, as described in: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Resource_Management_Guide/sec-prioritizing_network_traffic.html Version-Release number of selected component (if applicable): 3.6
David, can you take a look and add missing information?
Sahina, can you add more details regarding glusterfs?
This may be related to bug 1243935.
Vijay, is there a way to prioritize some I/O over other in gluster? Will QOS feature in gluster address this?
(In reply to Sahina Bose from comment #4) > Vijay, is there a way to prioritize some I/O over other in gluster? Will QOS > feature in gluster address this? Currently we do not have any internal mechanism in Gluster to prioritize I/O based on client or application. The proposed QoS feature will provide some means but we are yet to finalize the design.
Vijay, any updates?
(In reply to Doron Fediuck from comment #6) > Vijay, any updates? Unfortunately, not much has changed with respect to QoS in gluster upstream. I will provide an update after some more analysis. Thanks!
We didn't get to this bug for more than 2 years, and it's not being considered for the upcoming 4.4. It's unlikely that it will ever be addressed so I'm suggesting to close it. If you feel this needs to be addressed and want to work on it please remove cond nack and target accordingly.
ok, closing. Please reopen if still relevant/you want to work on it.
Created attachment 1680850 [details] sanlock iotimeout in create nfs storage domain Adding vdsm.log and corresponding sanlock client debug_log dump for sanlock iotimeout in writing lockspace for nfs storage domain creation. Taken from Michal's OST run on 19-Apr-2020.
Created attachment 1680851 [details] vdsm log for iotimeout in create nfs storage domain Adding vdsm.log and corresponding sanlock client debug_log dump for sanlock iotimeout in writing lockspace for nfs storage domain creation. Taken from Michal's OST run on 19-Apr-2020.
Created attachment 1684388 [details] OST runs from 28 Apr 2020 with sanlock iotimeout for NFS storage domain creation
I found possibly related issue in sanlock - sanlock try to use the realtime scheduler, but it always fails and run with the normal scheduler. So it may be delayed by unimportant processes if the system running OST is overloaded. I posted a fix here: https://lists.fedorahosted.org/archives/list/sanlock-devel@lists.fedorahosted.org/message/J64ENSHB6Q3AST4H3PDJARDNYLSTLWAF/ I think we need to test this fix on OST, it may make a difference.
(In reply to Amit Bawer from comment #13) > Created attachment 1684388 [details] > OST runs from 28 Apr 2020 with sanlock iotimeout for NFS storage domain > creation All the 3 dev-Z bootstrap test runs show the same problem with sanlock IO timeout during creation of NFS storage domain lease, to make the mail shorter I am only showing dev-20 traces here. What I find confusing in all 3 cases is that creation of nfs domain with uid X fails in vdsm.log for sanlock IO timeout error at time T, but sanlock.log on same host shows IO timeout for another domain lockspace Y from another NFS path at the very same time T. The messages log shows a strange "kernel: FS-Cache: Duplicate cookie detected message" next to time T. Looking it up seems to be a known NFS issue on el8: https://access.redhat.com/solutions/4644911 but no one says if could actually end up with a mixup in accessing NFS mounts. For example: dev-20 host 0 X=a3a050d6-cd2b-4cb8-a2d4-ca57525cedd0 mount-path=192.168.207.2:/exports/nfs/share2 Y=s2:r3 resource bdb91cce-a018-4661-915b-3214940524f5:SDM:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/leases:1048576 T=2020-04-26 16:16:54 vdsm.log 2020-04-26 16:16:22,215+0200 INFO (jsonrpc/1) [vdsm.api] START createStorageDomain(storageType=1, sdUUID='a3a050d6-cd2b-4cb8-a2d4-ca57525cedd0', domainName='second-nfs', typeSpecificArg='192.168.207.2:/exports/nfs/share2', domClass=1, domVersion='5', block_size=512, max_hosts=250, options=None) from=::ffff:192.168.205.2,54384, flow_id=24f4c415, task_id=760cf549-8781-4447-8f28-fe4c77cddacc (api:48) ... 2020-04-26 16:16:54,196+0200 ERROR (jsonrpc/1) [storage.TaskManager.Task] (Task='760cf549-8781-4447-8f28-fe4c77cddacc') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/clusterlock.py", line 250, in initSANLock lockspace_name, idsPath, align=alignment, sector=block_size) sanlock.SanlockException: (-202, 'Sanlock lockspace write failure', 'IO timeout') During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "<decorator-gen-121>", line 2, in createStorageDomain File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2644, in createStorageDomain max_hosts=max_hosts) File "/usr/lib/python3.6/site-packages/vdsm/storage/nfsSD.py", line 120, in create fsd.initSPMlease() File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 992, in initSPMlease return self._manifest.initDomainLock() File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 620, in initDomainLock self._domainLock.initLock(self.getDomainLease()) File "/usr/lib/python3.6/site-packages/vdsm/storage/clusterlock.py", line 308, in initLock block_size=self._block_size) File "/usr/lib/python3.6/site-packages/vdsm/storage/clusterlock.py", line 260, in initSANLock raise se.ClusterLockInitError() vdsm.storage.exception.ClusterLockInitError: Could not initialize cluster lock: () 2020-04-26 16:16:54,196+0200 INFO (jsonrpc/1) [storage.TaskManager.Task] (Task='760cf549-8781-4447-8f28-fe4c77cddacc') aborting: Task is aborted: 'value=Could not initialize cluster lock: () abortedcode=701' (task:1190) 2020-04-26 16:16:54,198+0200 ERROR (jsonrpc/1) [storage.Dispatcher] FINISH createStorageDomain error=Could not initialize cluster lock: () (dispatcher:83) sanlock.log 2020-04-26 16:09:52 740 [25762]: sanlock daemon started 3.8.0 host 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic 2020-04-26 16:12:57 925 [25790]: s1 lockspace bdb91cce-a018-4661-915b-3214940524f5:250:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/ids:0 2020-04-26 16:13:18 946 [25791]: s1:r2 resource bdb91cce-a018-4661-915b-3214940524f5:SDM:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/leases:1048576 2020-04-26 16:13:19 947 [25762]: s1 host 250 1 925 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic 2020-04-26 16:13:24 952 [25791]: s2 lockspace bdb91cce-a018-4661-915b-3214940524f5:1:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/ids:0 2020-04-26 16:13:45 973 [25790]: s2:r3 resource bdb91cce-a018-4661-915b-3214940524f5:SDM:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/leases:1048576 2020-04-26 16:13:45 973 [25762]: s2 host 1 1 952 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic 2020-04-26 16:13:45 973 [25762]: s2 host 250 1 0 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic 2020-04-26 16:14:05 993 [25762]: s2 host 2 1 975 ada599fd-6219-4bb3-90e5-bf0e8e58bea4.lago-basic 2020-04-26 16:16:54 1162 [25790]: worker0 aio timeout WR 0x7f71a8000b60:0x7f71a8000b70:0x7f71b5557000 ioto 10 to_count 1 2020-04-26 16:17:33 1202 [25790]: worker0 aio collect WR 0x7f71a8000b60:0x7f71a8000b70:0x7f71b5557000 result 1048576:0 other free 2020-04-26 16:18:56 1285 [25791]: s3 lockspace 90480be9-523b-4332-9b96-10954ab1b3ce:1:/dev/90480be9-523b-4332-9b96-10954ab1b3ce/ids:0 2020-04-26 16:19:17 1306 [25762]: s3 host 1 1 1285 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic 2020-04-26 16:19:17 1306 [25790]: s3:r4 resource 90480be9-523b-4332-9b96-10954ab1b3ce:SDM:/dev/90480be9-523b-4332-9b96-10954ab1b3ce/leases:1048576 2020-04-26 16:19:31 1319 [25791]: s4 lockspace 90480be9-523b-4332-9b96-10954ab1b3ce:1:/dev/90480be9-523b-4332-9b96-10954ab1b3ce/ids:0 messages Apr 26 16:15:22 lago-basic-suite-master-host-0 systemd-logind[832]: Session 22 logged out. Waiting for processes to exit. Apr 26 16:15:22 lago-basic-suite-master-host-0 systemd-logind[832]: Removed session 22. Apr 26 16:16:20 lago-basic-suite-master-host-0 systemd[1]: Started Session c35 of user root. Apr 26 16:16:21 lago-basic-suite-master-host-0 systemd[1]: Started Session c36 of user root. Apr 26 16:16:22 lago-basic-suite-master-host-0 systemd[1]: Started Session c37 of user root. Apr 26 16:16:22 lago-basic-suite-master-host-0 systemd[1]: Started Session c38 of user root. Apr 26 16:16:54 lago-basic-suite-master-host-0 systemd[1]: Started Session c39 of user root. Apr 26 16:16:56 lago-basic-suite-master-host-0 kernel: FS-Cache: Duplicate cookie detected Apr 26 16:16:56 lago-basic-suite-master-host-0 kernel: FS-Cache: O-cookie c=00000000abc4ace3 [p=00000000102970b8 fl=222 nc=0 na=1] Apr 26 16:16:56 lago-basic-suite-master-host-0 kernel: FS-Cache: O-cookie d=00000000d6816861 n=000000002acd7ec4 Apr 26 16:16:56 lago-basic-suite-master-host-0 kernel: FS-Cache: O-key=[10] '040002000801c0a8cf02' Apr 26 16:16:56 lago-basic-suite-master-host-0 kernel: FS-Cache: N-cookie c=0000000063b0020c [p=00000000102970b8 fl=2 nc=0 na=1] Apr 26 16:16:56 lago-basic-suite-master-host-0 kernel: FS-Cache: N-cookie d=00000000d6816861 n=00000000aeecb256 Apr 26 16:16:56 lago-basic-suite-master-host-0 kernel: FS-Cache: N-key=[10] '040002000801c0a8cf02'
(In reply to Amit Bawer from comment #15) > What I find confusing in all 3 cases is that creation of nfs domain with uid > X fails in vdsm.log for sanlock IO timeout error at time T, > but sanlock.log on same host shows IO timeout for another domain lockspace Y > from another NFS path at the very same time T. Why is this confusing? It seems like expected condition. On time T there was some issue on the host or on the remote storage, causing I/O timeouts. What is missing to understand this issue, is description of the environment having these failures: - What is the real hardware - What are the hosts (VMs) involved in this How many levels of virtualization is used? - Where the storage server is running? What else is running on that VM? - What is the load on the host and storage server when this error occur? - Do we have sar output from these systems? it may be useful to understand what happened at the time of the timeout. - What is the actual storage used by the NFS server? - What are the mounts options used for the existing/new storage domains? The next step is to try to reproduce this issue outside of OST, best by running "sanlock direct init" and measuring the time to create a lockspace. I think we have 2 possible issue related to this bug: - host is overloaded, sanlock is delayed by other processes and miss a deadline this may be improved by patch mentioned in comment 14. - storage server is overloaded by other processes, probably swapping. We can simulate these issues by running the "stress" command with various configurations on the host or storage server, or both. An interesting case is generating lot of IO to the same NFS mount. Does it cause sanlock timeouts?
Regarding sar dumps of hosts (included in the attachment): dev 20 had sanlock iotimeout at 2020-04-26 16:16:54 1162 [25790]: worker0 aio timeout WR 0x7f71a8000b60:0x7f71a8000b70:0x7f71b5557000 ioto 10 to_count 1 $ sar -f ./sarstats.bin -b | more Linux 4.18.0-193.el8.x86_64 (lago-basic-suite-master-host-0) 04/26/2020 _x86_64_ (2 CPU) 02:58:15 PM tps rtps wtps bread/s bwrtn/s 02:58:16 PM 183.17 114.85 68.32 4863.37 1856.44 04:58:17 PM 147.00 140.00 7.00 15456.00 81.00 04:58:18 PM 265.00 242.00 23.00 11416.00 481.00 04:58:19 PM 199.00 189.00 10.00 23456.00 207.00 04:58:20 PM 339.00 337.00 2.00 13360.00 61.00 04:58:21 PM 170.30 168.32 1.98 7809.90 277.23 04:58:22 PM 150.00 135.00 15.00 29712.00 17352.00 04:58:23 PM 204.00 199.00 5.00 45696.00 40.00 04:58:24 PM 221.00 221.00 0.00 45920.00 0.00 04:58:25 PM 67.00 58.00 9.00 9728.00 456.00 04:58:26 PM 76.00 76.00 0.00 11424.00 0.00 04:58:27 PM 116.00 69.00 47.00 10840.00 36840.00 04:58:28 PM 127.00 126.00 1.00 63280.00 16.00 04:58:29 PM 163.37 159.41 3.96 3247.52 253.47 dev09 had sanlock iotimeout at 2020-04-26 14:03:32 882 [25712]: worker0 aio timeout WR 0x7fb004000b60:0x7fb004000b70:0x7fb01014b000 ioto 10 to_count 1 $ sar -f ./sarstats.bin -b | more Linux 4.18.0-193.el8.x86_64 (lago-basic-suite-master-host-1) 04/26/2020 _x86_64_ (2 CPU) 12:49:21 PM tps rtps wtps bread/s bwrtn/s 12:49:22 PM 166.00 164.00 2.00 10720.00 32.00 02:49:23 PM 143.00 134.00 9.00 14824.00 144.00 02:49:24 PM 213.00 200.00 13.00 12920.00 380.00 02:49:25 PM 250.00 235.00 15.00 8760.00 381.00 02:49:26 PM 304.95 296.04 8.91 16625.74 170.30 dev18 had sanlock iotimeout at 2020-04-25 21:00:27 1117 [25733]: worker1 aio timeout WR 0x7f155c000b60:0x7f155c000b70:0x7f1570525000 ioto 10 to_count 1 $ sar -f ./sarstats.bin -b | more Linux 4.18.0-193.el8.x86_64 (lago-basic-suite-master-host-1) 04/25/2020 _x86_64_ (2 CPU) 07:42:24 PM tps rtps wtps bread/s bwrtn/s 07:42:25 PM 208.91 184.16 24.75 22201.98 228.71 09:42:27 PM 304.00 301.00 3.00 15952.00 71.00 09:42:28 PM 97.00 97.00 0.00 4528.00 0.00 09:42:29 PM 187.13 187.13 0.00 6788.12 0.00 09:42:30 PM 724.00 660.00 64.00 116824.00 54400.00 09:42:31 PM 34.00 34.00 0.00 10120.00 0.00 09:42:32 PM 27.00 22.00 5.00 848.00 40.00 09:42:33 PM 82.00 82.00 0.00 3560.00 0.00 sar dumps will not show stats of the timeout time, system was to hung to do so or had some rotation at that point?
(In reply to Nir Soffer from comment #16) > An interesting case is generating lot of IO to the same NFS mount. Does it > cause sanlock timeouts? Seems like it: nfs server setup # exportfs -v /exports/data <world>(sync,wdelay,hide,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash) nfs client mount 10.35.18.192:/exports/data on /mnt type nfs4 (rw,relatime,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.35.18.54,local_lock=none,addr=10.35.18.192) Testing sanlock direct init on /mnt/leases from client with no stressing: # time sanlock direct init -s LS:1:leases:0 init done 0 real 0m0.109s user 0m0.005s sys 0m0.006s Stressing the server with 10K 1GB writes: # stress -d 10K stress: info: [10648] dispatching hogs: 0 cpu, 0 io, 0 vm, 10240 hdd Client would have iotimeout (-202) for writing the leases file: # time sanlock direct init -s LS:1:leases:0 init done -202 real 0m16.293s user 0m0.003s sys 0m0.008s Stressing only the client causes the direct init command to stall, but sanlock direct returns "init done 0": # stress -d 10K -t 60 & # stress: info: [1357] dispatching hogs: 0 cpu, 0 io, 0 vm, 10240 hdd # time sanlock direct init -s LS:1:leases:0 init done 0 real 1m3.757s user 0m0.004s sys 0m0.027s
(In reply to Amit Bawer from comment #17) > Regarding sar dumps of hosts (included in the attachment): Looks like we don't have any useful info in these logs, or we don't extract the info correctly. > sar dumps will not show stats of the timeout time, system was to hung to do > so or had some rotation at that point? Why do you think system was hung? If it was, the issue is not in sanlock or vdsm, and we need to understand why the storage server was not responsive.
(In reply to Amit Bawer from comment #18) > (In reply to Nir Soffer from comment #16) > # time sanlock direct init -s LS:1:leases:0 > init done 0 > > real 0m0.109s > user 0m0.005s > sys 0m0.006s This show that 10 seconds is a reasonable timeout. > Stressing the server with 10K 1GB writes: > > # stress -d 10K > stress: info: [10648] dispatching hogs: 0 cpu, 0 io, 0 vm, 10240 hdd This create 10k threads, each writing 1G of data. I think this is a bit too extreme. > Client would have iotimeout (-202) for writing the leases file: > > # time sanlock direct init -s LS:1:leases:0 > init done -202 > > real 0m16.293s > user 0m0.003s > sys 0m0.008s Good, we see that overloaded server cause timeouts. It is interesting to see what is the minimum load that cause timeouts. I would try with --hdd 1, 2, 4, 8, 16, 32, 64, 128, and run "sanlock direct init" 100 times and see how the times are distributed. > Stressing only the client causes the direct init command to stall, but > sanlock direct returns "init done 0": > > # stress -d 10K -t 60 & > # stress: info: [1357] dispatching hogs: 0 cpu, 0 io, 0 vm, 10240 hdd > > # time sanlock direct init -s LS:1:leases:0 > init done 0 > > real 1m3.757s > user 0m0.004s > sys 0m0.027s Do we see this pattern in vdsm log? How much time we waited for sanlock when it failed with a timeout? This is again a bit too extreme, but all the io was local. stress performs io to files in the current directory. I think it would be more useful to to do: cd /mnt stress --hdd N # N from 1 to 128 So we not only stress the host but create I/O to the same NFS server. When creating a storage domain, we never have any I/O to the same mount, but we may have I/O to other mounts, so maybe use 2 mounts: mount ... /tmp/domain1 mount ... /tmp/domain2 cd /tmp/domain1 stress --hdd N And test snalock with another mount: sanlock direct init -s LS:1:/tmp/domain2/ids:0
(In reply to Nir Soffer from comment #19) > (In reply to Amit Bawer from comment #17) > > Regarding sar dumps of hosts (included in the attachment): > > Looks like we don't have any useful info in these logs, or we don't extract > the info correctly. > > > sar dumps will not show stats of the timeout time, system was to hung to do > > so or had some rotation at that point? > > Why do you think system was hung? If it was, the issue is not in sanlock > or vdsm, and we need to understand why the storage server was not responsive. because of the sar monitoring gap missing for the iotimeout period. I don't know what is the setup, maybe the storage server is on the same machine but on another VM probably Michal could say better.
May be caused by bug 1830738
(In reply to Nir Soffer from comment #20) > (In reply to Amit Bawer from comment #18) > > (In reply to Nir Soffer from comment #16) > > # time sanlock direct init -s LS:1:leases:0 > > init done 0 > > > > real 0m0.109s > > user 0m0.005s > > sys 0m0.006s > > This show that 10 seconds is a reasonable timeout. > > > Stressing the server with 10K 1GB writes: > > > > # stress -d 10K > > stress: info: [10648] dispatching hogs: 0 cpu, 0 io, 0 vm, 10240 hdd > > This create 10k threads, each writing 1G of data. I think this is a bit > too extreme. > > > Client would have iotimeout (-202) for writing the leases file: > > > > # time sanlock direct init -s LS:1:leases:0 > > init done -202 > > > > real 0m16.293s > > user 0m0.003s > > sys 0m0.008s > > Good, we see that overloaded server cause timeouts. > > It is interesting to see what is the minimum load that cause timeouts. > > I would try with --hdd 1, 2, 4, 8, 16, 32, 64, 128, and run > "sanlock direct init" 100 times and see how the times are distributed. Tested as advised, used two mounts /mnt1 /mnt2 where path of /mnt2 is stressed and /mnt1/ids is the sanlock direct init -s path: not stressing running 100 trials Done 100 trials with 0 hdd hogs avg time: 0.1917 max time: 1.87 min time: 0.09 iotimeouts: 0 stressing nfs server with 1 1GB writers stress: info: [28110] dispatching hogs: 0 cpu, 0 io, 0 vm, 1 hdd running 100 trials Done 100 trials with 1 hdd hogs avg time: 0.6773 max time: 12.81 min time: 0.35 iotimeouts: 1 stressing nfs server with 2 1GB writers stress: info: [28194] dispatching hogs: 0 cpu, 0 io, 0 vm, 2 hdd running 100 trials Done 100 trials with 2 hdd hogs avg time: 0.769 max time: 18.96 min time: 0.22 iotimeouts: 1 stressing nfs server with 4 1GB writers stress: info: [28292] dispatching hogs: 0 cpu, 0 io, 0 vm, 4 hdd running 100 trials Done 100 trials with 4 hdd hogs avg time: 0.8635 max time: 9.17 min time: 0.40 iotimeouts: 0 stopping stopping stressing nfs server with 8 1GB writers stress: info: [28420] dispatching hogs: 0 cpu, 0 io, 0 vm, 8 hdd running 100 trials Done 100 trials with 8 hdd hogs avg time: 1.0828 max time: 5.21 min time: 0.42 iotimeouts: 0 stressing nfs server with 16 1GB writers stress: info: [28638] dispatching hogs: 0 cpu, 0 io, 0 vm, 16 hdd running 100 trials Done 100 trials with 16 hdd hogs avg time: 1.6873 max time: 8.46 min time: 0.43 iotimeouts: 0 stressing nfs server with 32 1GB writers stress: info: [29059] dispatching hogs: 0 cpu, 0 io, 0 vm, 32 hdd running 100 trials Done 100 trials with 32 hdd hogs avg time: 3.0077 max time: 138.84 min time: 0.48 iotimeouts: 1 stressing nfs server with 64 1GB writers stress: info: [30702] dispatching hogs: 0 cpu, 0 io, 0 vm, 64 hdd running 100 trials Done 100 trials with 64 hdd hogs avg time: 3.5373 max time: 100.75 min time: 0.46 iotimeouts: 1 stressing nfs server with 128 1GB writers stress: info: [32528] dispatching hogs: 0 cpu, 0 io, 0 vm, 128 hdd running 100 trials Done 100 trials with 128 hdd hogs avg time: 2.5558 max time: 29.80 min time: 0.46 iotimeouts: 1 When timeouts occured, they were always for the first lockspace write attempt for the test and the following ones made it through, for example, when used 2 hdd hogs: init done -202 real 18.96 user 0.00 sys 0.00 init done 0 real 3.35 user 0.00 sys 0.00 init done 0 real 0.75 user 0.00 sys 0.00 init done 0 real 0.58 user 0.00 sys 0.00 ... stress --hdd X is expected to run X writers periodically, so the load should be kept on the nfs server during entire test, i wonder why it times-out once and for the first attempt when it happens. maybe there is some sort of nfs/sanlock write caching? > > > Stressing only the client causes the direct init command to stall, but > > sanlock direct returns "init done 0": > > > > # stress -d 10K -t 60 & > > # stress: info: [1357] dispatching hogs: 0 cpu, 0 io, 0 vm, 10240 hdd > > > > # time sanlock direct init -s LS:1:leases:0 > > init done 0 > > > > real 1m3.757s > > user 0m0.004s > > sys 0m0.027s > > Do we see this pattern in vdsm log? How much time we waited for sanlock > when it failed with a timeout? Not really, just wanted to see if host load itself has something to do with it. dev20 host0 had timeout for write_lockspace at 2020-04-26 16:16:54 2020-04-26 16:16:54,195+0200 ERROR (jsonrpc/1) [storage.initSANLock] Cannot initialize SANLock for domain a3a050d6-cd2b-4cb8-a2d4-ca57525cedd0 (clusterlock:259) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/clusterlock.py", line 250, in initSANLock lockspace_name, idsPath, align=alignment, sector=block_size) sanlock.SanlockException: (-202, 'Sanlock lockspace write failure', 'IO timeout') 2020-04-26 16:16:54,196+0200 INFO (jsonrpc/1) [vdsm.api] FINISH createStorageDomain error=Could not initialize cluster lock: () from=::ffff:192.168.205.2,54384, flow_id=24f4c415, task_id=760cf549-8781-4447-8f28-fe4c77cddacc (api:52) 2020-04-26 16:16:54,196+0200 ERROR (jsonrpc/1) [storage.TaskManager.Task] (Task='760cf549-8781-4447-8f28-fe4c77cddacc') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/clusterlock.py", line 250, in initSANLock lockspace_name, idsPath, align=alignment, sector=block_size) sanlock.SanlockException: (-202, 'Sanlock lockspace write failure', 'IO timeout') According to sanlock.log for thread [25790] it first started on 2020-04-26 16:12:57 and alerted timeout on 2020-04-26 16:16:54, so 3:57 minutes 2020-04-26 16:09:52 740 [25762]: sanlock daemon started 3.8.0 host 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic 2020-04-26 16:12:57 925 [25790]: s1 lockspace bdb91cce-a018-4661-915b-3214940524f5:250:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/ids:0 2020-04-26 16:13:18 946 [25791]: s1:r2 resource bdb91cce-a018-4661-915b-3214940524f5:SDM:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/leases:1048576 2020-04-26 16:13:19 947 [25762]: s1 host 250 1 925 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic 2020-04-26 16:13:24 952 [25791]: s2 lockspace bdb91cce-a018-4661-915b-3214940524f5:1:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/ids:0 2020-04-26 16:13:45 973 [25790]: s2:r3 resource bdb91cce-a018-4661-915b-3214940524f5:SDM:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/leases:1048576 2020-04-26 16:13:45 973 [25762]: s2 host 1 1 952 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic 2020-04-26 16:13:45 973 [25762]: s2 host 250 1 0 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic 2020-04-26 16:14:05 993 [25762]: s2 host 2 1 975 ada599fd-6219-4bb3-90e5-bf0e8e58bea4.lago-basic 2020-04-26 16:16:54 1162 [25790]: worker0 aio timeout WR 0x7f71a8000b60:0x7f71a8000b70:0x7f71b5557000 ioto 10 to_count 1
BTW, the "set scheduler RR|RESET_ON_FORK priority 99 failed: Operation not permitted" of 1830738 is not seen in sanlock.log 2020-04-26 16:09:52 740 [25762]: sanlock daemon started 3.8.0 host 5697e31f-3e1a-4e3d-a46d-73be5980f527.lago-basic 2020-04-26 16:12:57 925 [25790]: s1 lockspace bdb91cce-a018-4661-915b-3214940524f5:250:/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/dom_md/ids:0
(In reply to Amit Bawer from comment #23) > (In reply to Nir Soffer from comment #20) > > (In reply to Amit Bawer from comment #18) > > > (In reply to Nir Soffer from comment #16) > > > # time sanlock direct init -s LS:1:leases:0 > > > init done 0 > > > > > > real 0m0.109s > > > user 0m0.005s > > > sys 0m0.006s > > > > This show that 10 seconds is a reasonable timeout. > > > > > Stressing the server with 10K 1GB writes: > > > > > > # stress -d 10K > > > stress: info: [10648] dispatching hogs: 0 cpu, 0 io, 0 vm, 10240 hdd > > > > This create 10k threads, each writing 1G of data. I think this is a bit > > too extreme. > > > > > Client would have iotimeout (-202) for writing the leases file: > > > > > > # time sanlock direct init -s LS:1:leases:0 > > > init done -202 > > > > > > real 0m16.293s > > > user 0m0.003s > > > sys 0m0.008s > > > > Good, we see that overloaded server cause timeouts. > > > > It is interesting to see what is the minimum load that cause timeouts. > > > > I would try with --hdd 1, 2, 4, 8, 16, 32, 64, 128, and run > > "sanlock direct init" 100 times and see how the times are distributed. > > Tested as advised, used two mounts /mnt1 /mnt2 > where path of /mnt2 is stressed and /mnt1/ids is the sanlock direct init -s > path: > > not stressing > running 100 trials > Done 100 trials with 0 hdd hogs > avg time: 0.1917 > max time: 1.87 > min time: 0.09 > iotimeouts: 0 > > stressing nfs server with 1 1GB writers > stress: info: [28110] dispatching hogs: 0 cpu, 0 io, 0 vm, 1 hdd > running 100 trials > Done 100 trials with 1 hdd hogs > avg time: 0.6773 > max time: 12.81 > min time: 0.35 > iotimeouts: 1 > > stressing nfs server with 2 1GB writers > stress: info: [28194] dispatching hogs: 0 cpu, 0 io, 0 vm, 2 hdd > running 100 trials > Done 100 trials with 2 hdd hogs > avg time: 0.769 > max time: 18.96 > min time: 0.22 > iotimeouts: 1 > > stressing nfs server with 4 1GB writers > stress: info: [28292] dispatching hogs: 0 cpu, 0 io, 0 vm, 4 hdd > running 100 trials > Done 100 trials with 4 hdd hogs > avg time: 0.8635 > max time: 9.17 > min time: 0.40 > iotimeouts: 0 > stopping > stopping > stressing nfs server with 8 1GB writers > stress: info: [28420] dispatching hogs: 0 cpu, 0 io, 0 vm, 8 hdd > running 100 trials > Done 100 trials with 8 hdd hogs > avg time: 1.0828 > max time: 5.21 > min time: 0.42 > iotimeouts: 0 > > stressing nfs server with 16 1GB writers > stress: info: [28638] dispatching hogs: 0 cpu, 0 io, 0 vm, 16 hdd > running 100 trials > Done 100 trials with 16 hdd hogs > avg time: 1.6873 > max time: 8.46 > min time: 0.43 > iotimeouts: 0 > > stressing nfs server with 32 1GB writers > stress: info: [29059] dispatching hogs: 0 cpu, 0 io, 0 vm, 32 hdd > running 100 trials > Done 100 trials with 32 hdd hogs > avg time: 3.0077 > max time: 138.84 > min time: 0.48 > iotimeouts: 1 > > stressing nfs server with 64 1GB writers > stress: info: [30702] dispatching hogs: 0 cpu, 0 io, 0 vm, 64 hdd > running 100 trials > Done 100 trials with 64 hdd hogs > avg time: 3.5373 > max time: 100.75 > min time: 0.46 > iotimeouts: 1 > > stressing nfs server with 128 1GB writers > stress: info: [32528] dispatching hogs: 0 cpu, 0 io, 0 vm, 128 hdd > running 100 trials > Done 100 trials with 128 hdd hogs > avg time: 2.5558 > max time: 29.80 > min time: 0.46 > iotimeouts: 1 > > > When timeouts occured, they were always for the first lockspace write > attempt for the test and the following ones made it through, > for example, when used 2 hdd hogs: > > init done -202 > real 18.96 > user 0.00 > sys 0.00 > > init done 0 > real 3.35 > user 0.00 > sys 0.00 > > init done 0 > real 0.75 > user 0.00 > sys 0.00 > > init done 0 > real 0.58 > user 0.00 > sys 0.00 > ... > > stress --hdd X is expected to run X writers periodically, so the load should > be kept on the nfs server during entire test, i wonder why it times-out once > and for the first > attempt when it happens. maybe there is some sort of nfs/sanlock write > caching? Running stress --hdd 10 on the server in path of /mnt2 1. Waiting 2 seconds between client writes makes most of them time-out: # for x in `seq 10`; do sanlock direct init -s LS:1:/mnt1/ids:0 ; sleep 2 ; done init done -202 init done -202 init done -202 init done -202 init done -202 init done -202 init done 0 init done -202 init done -202 init done -202 2. Shorter wait of 1 seconds between writes causes all the other 9 to make it after the first timed-out: # for x in `seq 10`; do sanlock direct init -s LS:1:/mnt1/ids:0 ; sleep 1 ; done init done -202 init done 0 init done 0 init done 0 init done 0 init done 0 init done 0 init done 0 init done 0 init done 0 Far from being an NFS expert but seems it does cache the server replies (the timeout status is a sanlock decision), as repeating the same writes in short period works.
(In reply to Amit Bawer from comment #21) > I don't know what is the setup, maybe the storage server is on the same > machine but on another VM > probably Michal could say better. yes, it's always the same machine, different VM (engine VM). Usual OST setup...
From dev20/host0 vdsm.log 1. Starting task da6ef21e-ef66-4923-8260-99f37df8ef6b for downloadImage job on SD bdb91cce-a018-4661-915b-3214940524f5 2020-04-26 16:14:36,076+0200 INFO (jsonrpc/7) [vdsm.api] START downloadImage(methodArgs={'headers': {}, 'method': 'http', 'url': 'http://glance.ovirt.org:9292//v1/images/fe03ebf0-2cc9-4e20-bb2f-9d30e7abc0fd'}, spUUID='191a467c-ecf9-4010-a4ac-bd3ef862e893', sdUUID='bdb91cce-a018-4661-915b-3214940524f5', imgUUID='27389daf-258e-406e-86a9-a824faf637dd', volUUID='2974175c-8174-4e9c-b31e-b4f75e42b829') from=::ffff:192.168.205.2,54456, flow_id=e42a259c-9b1b-4cea-b381-a82fc28d57f7, task_id=da6ef21e-ef66-4923-8260-99f37df8ef6b (api:48) 2020-04-26 16:14:36,556+0200 INFO (jsonrpc/7) [vdsm.api] FINISH downloadImage return=None from=::ffff:192.168.205.2,54456, flow_id=e42a259c-9b1b-4cea-b381-a82fc28d57f7, task_id=da6ef21e-ef66-4923-8260-99f37df8ef6b (api:54) 2020-04-26 16:14:37,840+0200 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Image.download succeeded in 1.77 seconds (__init__:312) 2020-04-26 16:14:37,840+0200 INFO (tasks/3) [storage.ThreadPool.WorkerThread] START task da6ef21e-ef66-4923-8260-99f37df8ef6b (cmd=<bound method Task.commit of <vdsm.storage.task.Task object at 0x7f9bf005f550>>, args=None) (threadPool:149) ... 2. Starting another task e757f316-e20f-4ab0-aa19-fa52a09f857a for donloadImage job on SD bdb91cce-a018-4661-915b-3214940524f5: 2020-04-26 16:14:41,686+0200 INFO (jsonrpc/6) [vdsm.api] START downloadImage(methodArgs={'headers': {}, 'method': 'http', 'url': 'http://glance.ovirt.org:9292//v1/images/fe03ebf0-2cc9-4e20-bb2f-9d30e7abc0fd'}, spUUID='191a467c-ecf9-4010-a4ac-bd3ef862e893', sdUUID='bdb91cce-a018-4661-915b-3214940524f5', imgUUID='5cb1beca-63a5-4864-9a2d-41bc9c9c68e8', volUUID='b180b9a2-e1fa-4fd1-ad82-e606ee34a154') from=::ffff:192.168.205.2,54456, flow_id=ce97b813-bb4d-404f-be58-2922c60037f0, task_id=e757f316-e20f-4ab0-aa19-fa52a09f857a (api:48) 2020-04-26 16:14:42,716+0200 INFO (jsonrpc/6) [vdsm.api] FINISH downloadImage return=None from=::ffff:192.168.205.2,54456, flow_id=ce97b813-bb4d-404f-be58-2922c60037f0, task_id=e757f316-e20f-4ab0-aa19-fa52a09f857a (api:54) 2020-04-26 16:14:43,834+0200 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Image.download succeeded in 2.15 seconds (__init__:312) 2020-04-26 16:14:43,835+0200 INFO (tasks/6) [storage.ThreadPool.WorkerThread] START task e757f316-e20f-4ab0-aa19-fa52a09f857a (cmd=<bound method Task.commit of <vdsm.storage.task.Task object at 0x7f9bf0146b70>>, args=None) (threadPool:149) .... 3. Attempting to create another SD a3a050d6-cd2b-4cb8-a2d4-ca57525cedd0 on NFS share2 and writing lockspace fails due to sanlock io timeout: 2020-04-26 16:16:22,215+0200 INFO (jsonrpc/1) [vdsm.api] START createStorageDomain(storageType=1, sdUUID='a3a050d6-cd2b-4cb8-a2d4-ca57525cedd0', domainName='second-nfs', typeSpecificArg='192.168.207.2:/exports/nfs/share2', domClass=1, domVersion='5', block_size=512, max_hosts=250, options=None) from=::ffff:192.168.205.2,54384, flow_id=24f4c415, task_id=760cf549-8781-4447-8f28-fe4c77cddacc (api:48) ... 2020-04-26 16:16:54,195+0200 ERROR (jsonrpc/1) [storage.initSANLock] Cannot initialize SANLock for domain a3a050d6-cd2b-4cb8-a2d4-ca57525cedd0 (clusterlock:259) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/clusterlock.py", line 250, in initSANLock lockspace_name, idsPath, align=alignment, sector=block_size) sanlock.SanlockException: (-202, 'Sanlock lockspace write failure', 'IO timeout') 2020-04-26 16:16:54,196+0200 INFO (jsonrpc/1) [vdsm.api] FINISH createStorageDomain error=Could not initialize cluster lock: () from=::ffff:192.168.205.2,54384, flow_id=24f4c415, task_id=760cf549-8781-4447-8f28-fe4c77cddacc (api:52) 2020-04-26 16:16:54,196+0200 ERROR (jsonrpc/1) [storage.TaskManager.Task] (Task='760cf549-8781-4447-8f28-fe4c77cddacc') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/clusterlock.py", line 250, in initSANLock lockspace_name, idsPath, align=alignment, sector=block_size) sanlock.SanlockException: (-202, 'Sanlock lockspace write failure', 'IO timeout') ... 4. 938 MB Write to NFS share1 for download image task da6ef21e-ef66-4923-8260-99f37df8ef6b has finished: 2020-04-26 16:17:48,825+0200 DEBUG (tasks/3) [root] SUCCESS: <err> = b"223+22550 records in\n223+22550 records out\n938409984 bytes (938 MB, 895 MiB) copied, 189.594 s, 4.9 MB/s\nfork_exec('/usr/bin/curl', 'curl', '-q', '--silent', '--fail', '--show-error', 'http://glance.ovirt.org:9292//v1/images/fe03ebf0-2cc9-4e20-bb2f-9d30e7abc0fd')\nfork_exec('/bin/dd', 'dd', 'bs=2M', 'of=/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/images/27389daf-258e-406e-86a9-a824faf637dd/2974175c-8174-4e9c-b31e-b4f75e42b829')\n"; <rc> = 0 (commands:224) ... 2020-04-26 16:17:51,077+0200 INFO (tasks/3) [storage.ThreadPool.WorkerThread] FINISH task da6ef21e-ef66-4923-8260-99f37df8ef6b (threadPool:151) 5. 938 MB Write to NFS share1 for download image task e757f316-e20f-4ab0-aa19-fa52a09f857a has finished: 2020-04-26 16:17:56,556+0200 DEBUG (tasks/6) [root] SUCCESS: <err> = b"220+18621 records in\n220+18621 records out\n938409984 bytes (938 MB, 895 MiB) copied, 191.032 s, 4.9 MB/s\nfork_exec('/usr/bin/curl', 'curl', '-q', '--silent', '--fail', '--show-error', 'http://glance.ovirt.org:9292//v1/images/fe03ebf0-2cc9-4e20-bb2f-9d30e7abc0fd')\nfork_exec('/bin/dd', 'dd', 'bs=2M', 'of=/rhev/data-center/mnt/192.168.207.2:_exports_nfs_share1/bdb91cce-a018-4661-915b-3214940524f5/images/5cb1beca-63a5-4864-9a2d-41bc9c9c68e8/b180b9a2-e1fa-4fd1-ad82-e606ee34a154')\n"; <rc> = 0 (commands:224) ... 2020-04-26 16:17:58,231+0200 INFO (tasks/6) [storage.ThreadPool.WorkerThread] FINISH task e757f316-e20f-4ab0-aa19-fa52a09f857a (threadPool:151) Which correlates with the scenario of multiple writes to one mount path while having timed-out call for sanlock direct init -s on the other.
hm, so rather than other traffic this could be related to the single OST run itself causing high I/O due to the parallelization? the 1GB downloads are from test_add_glance_images most likely, and that doesn't have a completeness check until test_verify_glance_import in 004_basic_sanity_pytest.py, so they can run concurrently... if we don't have a solution to make SD creation reliable we can shift things around in OST and e.g. wait for glance import before adding secondary SDs. That might help a bit, but of course won't solve the real problem, as well as it likely won't solve an issue in CI where we run multiple concurrent OSTs on larger machines
(In reply to Amit Bawer from comment #27) Very interesting! Lets convert the raw data to something more meaningful: 2020-04-26 16:14:36,076 task 1 started 2020-04-26 16:14:41,686 task 2 started 2020-04-26 16:16:22,215 creating storage domain 2020-04-26 16:16:54,195 sanlock timeout in 34 seconds 2020-04-26 16:17:48,825 task 1 finished, 189.594 s, 895 MiB, 4.9 MB/s 2020-04-26 16:17:56,556 task 2 finished, 191.032 s, 895 MiB, 4.9 MB/s So we wrote 1.75 GiB to storage in extremely slow rate (4.9 MB/s). We don't know if this was slow because of the Glance server, bad network, or very slow storage. During the download, we had sanlock timeout writing and reading 1MiB storage. We download using combination of dd and curl: curl -q --silent --fail --show-error src_url | dd bs=2M of=/path Note that we don't use direct I/O, so what this does is write data to host page cache. The kernel flush data from the page cache when it likes so what can happen is: 1. page cache becomes too full 2. kernel flushes data to storage 3. sanlock try write 1 MiB to storage using direct I/O 4. storage server is slow, so flushing is slow... 5. sanlock I/O is delayed and it fails with a timeout This is a very old issue that we had in the past with qemu-img. This is the reason we use direct I/O when copying images with qemu-img, and in imageio. To verify this theory, we need to bring up this setup and stop it when the host is ready and connected to NFS server. Then check the rate we can write data to this storage: dd if=/dev/zero bs=2M count=512 of=/rhev/data-center/mnt/server:_path/tmp/file Now we can reproduce by running "sanlock direct init" in a loop with some concurrent uploads. If my theory is correct using: dd if=/dev/zero bs=2M count=512 of=/rhev/data-center/mnt/server:_path/tmp/file oflag=direct Will not reproduce this issue, or maybe only when using very large amount of dd jobs. We can have only 10 concurrent curl tasks since we have a 10 workers in the storage thread pool. It this solves the issue, we can update the glance import/export code to use direct I/O, similar to the way we use dd to upload OVF data.
(In reply to Nir Soffer from comment #29) No OST setup to stop, but theory seems to hold for NFS writes test. Test #1: Running 10 cached IO writes: # for x in `seq 10`; do (dd if=/dev/zero of=/mnt2/${x} bs=2M count=512 &) ; done # for x in `seq 10`; do pkill -USR1 -x dd ; sleep 5 ; done ... 18874368 bytes (19 MB, 18 MiB) copied, 4.25177 s, 4.4 MB/s 9+0 records in 9+0 records out 18874368 bytes (19 MB, 18 MiB) copied, 4.57601 s, 4.1 MB/s 6+0 records in 6+0 records out 12582912 bytes (13 MB, 12 MiB) copied, 4.76733 s, 2.6 MB/s 16+0 records in 16+0 records out 33554432 bytes (34 MB, 32 MiB) copied, 5.08976 s, 6.6 MB/s 7+0 records in 7+0 records out 14680064 bytes (15 MB, 14 MiB) copied, 5.13963 s, 2.9 MB/s 16+0 records in 16+0 records out 33554432 bytes (34 MB, 32 MiB) copied, 5.17378 s, 6.5 MB/s 12+0 records in 12+0 records out 25165824 bytes (25 MB, 24 MiB) copied, 5.2562 s, 4.8 MB/s 12+0 records in 12+0 records out 25165824 bytes (25 MB, 24 MiB) copied, 5.45471 s, 4.6 MB/s 16+0 records in 16+0 records out 33554432 bytes (34 MB, 32 MiB) copied, 5.50963 s, 6.1 MB/s ... Had 80% sanlock timeout failure rate while writes were in progress: # for x in `seq 10`; do sanlock direct init -s LS:1:/mnt1/ids:0 ; done init done -202 init done 0 init done -202 init done 0 init done -202 init done -202 init done -202 init done -202 init done -202 init done -202 Test #2: Running 10 direct IO writes (oflag=direct): # for x in `seq 10`; do (dd if=/dev/zero of=/mnt2/${x} bs=2M count=512 oflag=direct &) ; done # for x in `seq 10`; do pkill -USR1 -x dd ; sleep 5 ; done ... 60817408 bytes (61 MB, 58 MiB) copied, 17.2965 s, 3.5 MB/s 30+0 records in 30+0 records out 62914560 bytes (63 MB, 60 MiB) copied, 17.5872 s, 3.6 MB/s 29+0 records in 29+0 records out 60817408 bytes (61 MB, 58 MiB) copied, 17.4189 s, 3.5 MB/s 31+0 records in 31+0 records out 65011712 bytes (65 MB, 62 MiB) copied, 17.3991 s, 3.7 MB/s 31+0 records in 31+0 records out 65011712 bytes (65 MB, 62 MiB) copied, 17.4462 s, 3.7 MB/s 30+0 records in 30+0 records out 62914560 bytes (63 MB, 60 MiB) copied, 17.4899 s, 3.6 MB/s 31+0 records in 31+0 records out 65011712 bytes (65 MB, 62 MiB) copied, 17.5135 s, 3.7 MB/s 31+0 records in 31+0 records out 65011712 bytes (65 MB, 62 MiB) copied, 18.1368 s, 3.6 MB/s 30+0 records in 30+0 records out 62914560 bytes (63 MB, 60 MiB) copied, 17.515 s, 3.6 MB/s 30+0 records in 30+0 records out 62914560 bytes (63 MB, 60 MiB) copied, 17.7085 s, 3.6 MB/s ... Had no sanlock timeouts while writes were in progress: # for x in `seq 10`; do sanlock direct init -s LS:1:/mnt1/ids:0 ; done init done 0 init done 0 init done 0 init done 0 init done 0 init done 0 init done 0 init done 0 init done 0 init done 0
(In reply to Amit Bawer from comment #30) > (In reply to Nir Soffer from comment #29) Nice! So the next step would be to modify glance code to use direct I/O. Unfortunately using direct I/O with current code will not work, since images from glance may be unaligned to storage block size (qcow2 images are typically unaligned). So we need to rewrite this code. I think we used the wrong bug for this issue, since the issue is not qos for sanlock, but bad implementation of vdsm curl wrapper, so moving this to new bug would be best.
If the issue is pagecache writeback, then I wonder if using synchronous i/o would have the same effect as direct i/o?
(In reply to David Teigland from comment #32) > If the issue is pagecache writeback, then I wonder if using synchronous i/o > would have the same effect as direct i/o? In sanlock? dd?
Using dd with oflag=sync (or O_SYNC elsewhere) might prevent the large amount of writeback that seems to delay the sanlock i/o.
(In reply to David Teigland from comment #34) > Using dd with oflag=sync (or O_SYNC elsewhere) might prevent the large > amount of writeback that seems to delay the sanlock i/o. Right! this can be a minimal fix. Will probably be slower because of the additional fsync() every 2M, but may be good enough. We can use bs=8M to mitigate this. Amit, do you want to try this?
(In reply to Nir Soffer from comment #35) > (In reply to David Teigland from comment #34) > > Using dd with oflag=sync (or O_SYNC elsewhere) might prevent the large > > amount of writeback that seems to delay the sanlock i/o. > > Right! this can be a minimal fix. Will probably be slower because of the > additional fsync() every 2M, but may be good enough. We can use bs=8M > to mitigate this. > > Amit, do you want to try this? # for x in `seq 10`; do (dd if=/dev/zero of=/mnt2/${x} bs=8M count=125 oflag=sync &) ; done # for x in `seq 10`; do pkill -USR1 -x dd ; sleep 5 ; done ... 176160768 bytes (176 MB, 168 MiB) copied, 44.87 s, 3.9 MB/s 25+0 records in 25+0 records out 209715200 bytes (210 MB, 200 MiB) copied, 45.0507 s, 4.7 MB/s 26+0 records in 26+0 records out 218103808 bytes (218 MB, 208 MiB) copied, 46.0356 s, 4.7 MB/s 21+0 records in 21+0 records out 176160768 bytes (176 MB, 168 MiB) copied, 46.0015 s, 3.8 MB/s 25+0 records in 25+0 records out 209715200 bytes (210 MB, 200 MiB) copied, 46.0474 s, 4.6 MB/s 24+0 records in 24+0 records out 201326592 bytes (201 MB, 192 MiB) copied, 45.9168 s, 4.4 MB/s 24+0 records in 24+0 records out 201326592 bytes (201 MB, 192 MiB) copied, 46.0064 s, 4.4 MB/s 25+0 records in 25+0 records out 209715200 bytes (210 MB, 200 MiB) copied, 46.0559 s, 4.6 MB/s 23+0 records in 23+0 records out 192937984 bytes (193 MB, 184 MiB) copied, 46.276 s, 4.2 MB/s 23+0 records in 23+0 records out 192937984 bytes (193 MB, 184 MiB) copied, 46.2688 s, 4.2 MB/s ... Had no timeouts while sync writes were ongoing: # for x in `seq 10`; do sanlock direct init -s LS:1:/mnt1/ids:0 ; done init done 0 init done 0 init done 0 init done 0 init done 0 init done 0 init done 0 init done 0 init done 0 init done 0
(In reply to Amit Bawer from comment #36) Looks like we have a minimal and easy fix then.
But we should use oflag=dsync and not oflag=sync, since sync may be slower, writing also file metadata which we don't care about until the upload was completed. There is also oflag=nocache, which uses POSIX_FADV_DONTNEED to advise the kernel that data will not be needed in the near future. This may minimize the amount of buffered data, which may minimize unpredictable delays. I don't know it it will help but worth to try.
The glance issue is tracked at bug 1832967.
I think we can close this. The glance issue was fixed (bug 1832967) and we don't have any way to ensure QOS for sanlock. Maybe we need to file separate bugs for kernel, since this issue may affects other components.
(In reply to Nir Soffer from comment #40) > I think we can close this. The glance issue was fixed (bug 1832967) > and we don't have any way to ensure QOS for sanlock. Maybe we need > to file separate bugs for kernel, since this issue may affects other > components. Can you please close this one?
Closing since we don't have a way to fix this issue.