Bug 1562369

Summary: Hosts show sanlock renewal errors in /var/log/messages
Product: [oVirt] vdsm Reporter: oliver.albl
Component: GeneralAssignee: Idan Shaby <ishaby>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.20.18CC: aperotti, bugs, christian.grundmann, ebenahar, ishaby, nsoffer, oliver.albl, tnisan, ylavi
Target Milestone: ovirt-4.2.6Flags: oliver.albl: needinfo-
rule-engine: ovirt-4.2+
rule-engine: ovirt-4.3+
ylavi: exception+
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-03 15:09:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logfiles ovirt-engine and fabavmhost030
none
/var/log/messages
none
Logfiles from SPM and Engine none

Description oliver.albl 2018-03-30 12:11:09 UTC
Description of problem:

We run an oVirt 4.2.1 installation with 30 hosts. Hosts show sanlock renewal errors in /var/log/messages. Those entries occur about 5 times a day, different times on different hosts. There is always more than one host involved.

We also see sometimes qla2xxx: Abort command issued entries at the same time. With those abort messages we encountered path or access to LUN loss.

We have no evidence on SAN switches or storage systems of any problem (even no evidence of path of LUN loss).

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Stability problems, path and lun loss

Expected results:
Stable system

Additional info:
CentOS 7.4
oVirt 4.2.1
30 Hosts
2 Storage arrays
all FC

/var/log/messages:
Mar 28 01:32:12 fabavmhost030 sanlock[36519]: 2018-03-28 01:32:12 295696 [135126]: s131 delta_renew read timeout 10 sec offset 0 /dev/d6d3da1f-0f96-42ba-bd70-5393a6ea95f4/ids
Mar 28 01:32:12 fabavmhost030 sanlock[36519]: 2018-03-28 01:32:12 295696 [135126]: s131 renewal error -202 delta_length 10 last_success 295665
Mar 28 01:32:12 fabavmhost030 sanlock[36519]: 2018-03-28 01:32:12 295696 [135126]: s131 delta_renew read timeout 10 sec offset 0 /dev/d6d3da1f-0f96-42ba-bd70-5393a6ea95f4/ids
Mar 28 01:32:12 fabavmhost030 sanlock[36519]: 2018-03-28 01:32:12 295696 [135126]: s131 renewal error -202 delta_length 10 last_success 295665
Mar 28 01:32:32 fabavmhost030 sanlock[36519]: 2018-03-28 01:32:32 295716 [135126]: s131 delta_renew read timeout 10 sec offset 0 /dev/d6d3da1f-0f96-42ba-bd70-5393a6ea95f4/ids
Mar 28 01:32:32 fabavmhost030 sanlock[36519]: 2018-03-28 01:32:32 295716 [135126]: s131 delta_renew read timeout 10 sec offset 0 /dev/d6d3da1f-0f96-42ba-bd70-5393a6ea95f4/ids
Mar 28 01:32:32 fabavmhost030 sanlock[36519]: 2018-03-28 01:32:32 295716 [135126]: s131 renewal error -202 delta_length 20 last_success 295665
Mar 28 01:32:32 fabavmhost030 sanlock[36519]: 2018-03-28 01:32:32 295716 [135126]: s131 renewal error -202 delta_length 20 last_success 295665
Mar 28 01:32:33 fabavmhost030 kernel: qla2xxx [0000:3b:00.1]-801c:16: Abort command issued nexus=16:2:203 --  1 2002.
Mar 28 01:32:33 fabavmhost030 kernel: device-mapper: multipath: Failing path 70:176.
Mar 28 01:32:33 fabavmhost030 multipathd: 36006016071b04500c1be585a57d08a97: sddd - emc_clariion_checker: Read error for WWN 6006016071b04500c1be585a57d08a97.  Sense data are 0x0/0x0/0x0.
Mar 28 01:32:33 fabavmhost030 multipathd: checker failed path 70:176 in map 36006016071b04500c1be585a57d08a97
Mar 28 01:32:33 fabavmhost030 multipathd: 36006016071b04500c1be585a57d08a97: sddd - emc_clariion_checker: Read error for WWN 6006016071b04500c1be585a57d08a97.  Sense data are 0x0/0x0/0x0.
Mar 28 01:32:33 fabavmhost030 multipathd: 36006016071b04500c1be585a57d08a97: remaining active paths: 3
Mar 28 01:32:33 fabavmhost030 multipathd: checker failed path 70:176 in map 36006016071b04500c1be585a57d08a97
Mar 28 01:32:33 fabavmhost030 multipathd: 36006016071b04500c1be585a57d08a97: remaining active paths: 3
Mar 28 01:32:33 fabavmhost030 multipathd: f2c2d59f--d781--4bd8--92de--aa83af30a70b-47765b95--5c9e--422c--8522--6d441b3cc7d5: adding map
Mar 28 01:32:33 fabavmhost030 multipathd: f2c2d59f--d781--4bd8--92de--aa83af30a70b-47765b95--5c9e--422c--8522--6d441b3cc7d5: adding map
Mar 28 01:32:33 fabavmhost030 multipathd: f2c2d59f--d781--4bd8--92de--aa83af30a70b-47765b95--5c9e--422c--8522--6d441b3cc7d5: devmap dm-43 added
Mar 28 01:32:33 fabavmhost030 multipathd: f2c2d59f--d781--4bd8--92de--aa83af30a70b-47765b95--5c9e--422c--8522--6d441b3cc7d5: devmap dm-43 added
Mar 28 01:32:33 fabavmhost030 multipathd: dm-43: remove map (uevent)
Mar 28 01:32:33 fabavmhost030 multipathd: dm-43: devmap not registered, can't remove
Mar 28 01:32:33 fabavmhost030 multipathd: dm-43: remove map (uevent)
Mar 28 01:32:33 fabavmhost030 multipathd: dm-43: remove map (uevent)
Mar 28 01:32:33 fabavmhost030 multipathd: dm-43: devmap not registered, can't remove
Mar 28 01:32:33 fabavmhost030 multipathd: dm-43: remove map (uevent)
Mar 28 01:32:33 fabavmhost030 kernel: qla2xxx [0000:3b:00.0]-801c:7: Abort command issued nexus=7:2:203 --  1 2002.
Mar 28 01:32:38 fabavmhost030 multipathd: 36006016071b04500c1be585a57d08a97: sddd - emc_clariion_checker: Active path is healthy.
Mar 28 01:32:38 fabavmhost030 multipathd: 36006016071b04500c1be585a57d08a97: sddd - emc_clariion_checker: Active path is healthy.
Mar 28 01:32:38 fabavmhost030 kernel: device-mapper: multipath: Reinstating path 70:176.
Mar 28 01:32:38 fabavmhost030 kernel: sd 16:0:2:203: alua: port group 02 state A preferred supports tolUsNA
Mar 28 01:32:38 fabavmhost030 multipathd: 70:176: reinstated
Mar 28 01:32:38 fabavmhost030 multipathd: 36006016071b04500c1be585a57d08a97: remaining active paths: 4
Mar 28 01:32:38 fabavmhost030 multipathd: 70:176: reinstated
Mar 28 01:32:38 fabavmhost030 multipathd: 36006016071b04500c1be585a57d08a97: remaining active paths: 4

Comment 1 Allon Mureinik 2018-04-02 08:19:44 UTC
Can you please attach engine, vdsm, sanlock logs and /var/log/messages?

Comment 2 Nir Soffer 2018-04-02 09:53:02 UTC
(In reply to oliver.albl from comment #0)
Oliver, thanks for reporting these issue.

> Actual results:
> Stability problems, path and lun loss

But this issue seems to be on your side. oVirt cannot make your system stable if
it fail to read or write to LUNs.

> Expected results:
> Stable system

oVirt is stable, and it log errors when your system is not.

> /var/log/messages:
> Mar 28 01:32:12 fabavmhost030 sanlock[36519]: 2018-03-28 01:32:12 295696
> [135126]: s131 delta_renew read timeout 10 sec offset 0
> /dev/d6d3da1f-0f96-42ba-bd70-5393a6ea95f4/ids

This means sanlock failed to write to the ids logical volume. Based on the other
logs, this looks like an issue with your storage.

If you think this a sanlock issue, please open a sanlock bug.

> Mar 28 01:32:33 fabavmhost030 kernel: qla2xxx [0000:3b:00.1]-801c:16: Abort
> command issued nexus=16:2:203 --  1 2002.

This is an issue reported by the HBA driver. If you think this is a kernel
issue open a kernel bug.

> Mar 28 01:32:33 fabavmhost030 kernel: device-mapper: multipath: Failing path
> 70:176.
> Mar 28 01:32:33 fabavmhost030 multipathd: 36006016071b04500c1be585a57d08a97:
> sddd - emc_clariion_checker: Read error for WWN
> 6006016071b04500c1be585a57d08a97.  Sense data are 0x0/0x0/0x0.
> Mar 28 01:32:33 fabavmhost030 multipathd: checker failed path 70:176 in map
> 36006016071b04500c1be585a57d08a97
> Mar 28 01:32:33 fabavmhost030 multipathd: 36006016071b04500c1be585a57d08a97:
> sddd - emc_clariion_checker: Read error for WWN
> 6006016071b04500c1be585a57d08a97.  Sense data are 0x0/0x0/0x0.
> Mar 28 01:32:33 fabavmhost030 multipathd: 36006016071b04500c1be585a57d08a97:
> remaining active paths: 3

This is an error from multipath path checker showing that multipath path checker
cannot access a path, so the path was moved to failed state.

This is likely to be an issue with the underlying subsystem (driver, HBA, switch,
etc). If you think this is a multipath bug you can open a multipath bug.

There is nothing in this bug showing that there is an issue with oVirt or the 
underlying subsystems.

To complete the investigation, please attach also:

- contents of /etc/multipath.conf
- contents of /etc/mulipath/conf.d/*
- output of "multipathd show config"
- information about your storage

Comment 3 Nir Soffer 2018-04-02 09:58:36 UTC
Yaniv, can you explain why this bug need an exception? We don't have any evidence
that there is a problem in the system yet.

Comment 4 oliver.albl 2018-04-03 06:46:46 UTC
Created attachment 1416605 [details]
Logfiles ovirt-engine and fabavmhost030

Comment 5 oliver.albl 2018-04-03 06:47:14 UTC
I attached engine.log and messages, sanlock.log, multipath configuration and vdsm log from the host reporting a path loss.

We use two Dell/EMC Unity arrays (400F and 450F), connected via two FC switches. Both arrays show >20 8TB luns to all hosts, each array has a unique host lun id scope with one exception: each array shows a dummy lun 0 to all hosts.

We currently use no_path_retry 12 - oVirt 4.2.2 moved from fail to 4, storage vendor recommends 60...

Comment 6 oliver.albl 2018-04-03 13:56:52 UTC
Today almost all hosts reported "sanlock renewal error" and "Abort command issued" messages at the same time, two hosts reported a path loss. oVirt reported several vms not responding (but not releated to the lun mentioned by the reported sanlock and abort messages).

Another identically installed oVirt host (test system, single host in one dc) using the same fc swichtes/storage arrays does not show any related messages.

Comment 7 oliver.albl 2018-05-11 18:12:09 UTC
Storage array vendor monitored "extremely large IO" (32MB to 64MB) issued by a host at the time, the errors occur.

They recommend limiting maximum IO size hosts may send to the array based on the following guidance in regards to IO sizes:

IO size < 4KB can be considered very small
IO size between 4KB and  32KB can be considered normal IO size for response time sensitive applications
IO size between 32KB and 256KB can be considered large IO for sequential operations
IO size between 256KB and 1MB can be considered very large IO
IO size > 1MB can be considered extremely large

Our main use case in this environment is automatically creating/running/deleting VMs/snapshots. So possible causes for this IO might be:
- Automatically cloning VMs from snapshots in parallel
- Automatically creating/deleting snapshots
- Automatically suspending VMs (memory dumps)
- Automatically deleting VMs/snapshots (we use thin provisioned LUNs and “discard on delete” on the hypervisor level, so the discard operations after deletion of VMs might also trigger this IO)

On the hosts we see max_sectors_kb = 32767 and discard_max_bytes = 4294966784.

Storage vendor suggests to set max_sectors_kb to a value between 64 and 512. We would give this a try (starting with 512) via multipath.conf.

Do you expect problems in oVirt/VDSM when we reduce max_sectors_kb in multipath.conf? Would you suggest to reduce discard_max_bytes as well and if yes, what would be an appropriate setting?

Comment 8 Nir Soffer 2018-05-28 17:41:25 UTC
(In reply to oliver.albl from comment #7)
> Storage array vendor monitored "extremely large IO" (32MB to 64MB) issued by
> a host at the time, the errors occur.
> 
> They recommend limiting maximum IO size hosts may send to the array based on
> the following guidance in regards to IO sizes:
> 
> IO size < 4KB can be considered very small
> IO size between 4KB and  32KB can be considered normal IO size for response
> time sensitive applications
> IO size between 32KB and 256KB can be considered large IO for sequential
> operations
> IO size between 256KB and 1MB can be considered very large IO
> IO size > 1MB can be considered extremely large

We use 1MB read/writes for most storage operations.

I think qemu-img is using 2MB read/writes for operations like converting
images.

sanlock is using 1MB reads for checking monitoring storage leases.

We are going to switch to 8M read/writes for image upload/download, as this 
show significant improvement on all storage types we tested.
 
> Our main use case in this environment is automatically
> creating/running/deleting VMs/snapshots. So possible causes for this IO
> might be:
> - Automatically cloning VMs from snapshots in parallel

This uses qemu-img to copy volume data

> - Automatically creating/deleting snapshots

This just creates new logical volume and attach it to a vm. Very little
io should be generated.

> - Automatically suspending VMs (memory dumps)

qemu writes memory via compression utility, maybe this generated big writes?

> - Automatically deleting VMs/snapshots (we use thin provisioned LUNs and
> “discard on delete” on the hypervisor level, so the discard operations after
> deletion of VMs might also trigger this IO)

zero image is using either dd with 1M block size (vdsm < 40.20.23), or 
blkdiscard --zerout --step 2M (vdsm >= 4.20.23).

Zeroing images using blkdiscard is very new feature, see this for the details.
see https://github.com/oVirt/vdsm/commit/b323ece018a360c91c39028928026e7208c4291f.

During development of this feature, we found that LIO storage has trouble with
blkdiscard --zeroout without --step parameter. The issue is described here:
https://github.com/oVirt/vdsm/blob/8eb442944d1013fa068373d83651c8b15d80f16d/lib/vdsm/storage/blkdiscard.py#L29

It is possible that your storage has trouble with BLKZEROUT with 2MB ranges,
in this case both blkdiscard --zeroout and qemu-img convert can cause trouble.

When this issue happen, the kenrnel log to /var/log/messages about failing to 
send WRITE_SAME request, and about "zeroing manually". Then 
/sys/devices/virtual/dm-*/queue/write_same_max_bytes becomes 0, and all future
BLKZEROUT calls use manual inefficient zeroing.
 
To discard volumes, we use blkdiscard /path/to/volume. This should not generate
big io, but maybe the storage has trouble with this?

I would try to use run many discards from the shell, and see if this triggers
the issues with sanlock.

> On the hosts we see max_sectors_kb = 32767 and discard_max_bytes =
> 4294966784.
> 
> Storage vendor suggests to set max_sectors_kb to a value between 64 and 512.
> We would give this a try (starting with 512) via multipath.conf.

I don't think it is related to multipath.conf.

Using small sectors will probably harm discard performance. It smells like the
storage vendor should fix the issue on storage side instead of suggesting using
tiny io on the client side.
 
> Do you expect problems in oVirt/VDSM when we reduce max_sectors_kb in
> multipath.conf? Would you suggest to reduce discard_max_bytes as well and if
> yes, what would be an appropriate setting?

We never changed this settings, so I cannot recommend anything. However disacard
is new feature, and we don't have lot of experience with it, so it is possible
that it cause trouble with some storage.

You should be able to change values like max_sectors_kb and like by writing to 
these files in /sys/. One way to modify such values dynamically is using a udev
rule that will set the value each time a device is added. I don't know if there
is a nicer way to set this globally.

Next step - if you did not upload /var/log/messages from the host, please do.

Comment 9 oliver.albl 2018-05-29 14:54:34 UTC
Nir,

  I uploaded /var/log/messages from SPM. Last sanlock renewal timeout occured today 12:07 pm on 25 hosts.

Current max_sectors_kb (via udev rules) is 512 (we cannot adjust discard_max_bytes in with our 3.10 kernel).

All the best,
Oliver

Comment 10 oliver.albl 2018-05-29 14:56:31 UTC
Created attachment 1445436 [details]
/var/log/messages

Comment 11 Idan Shaby 2018-06-03 12:58:35 UTC
Hi Oliver,

I didn't see any errors in your vdsm logs.
Which version of vdsm are you using?
What is the value of "zero_method" in your SPM configuration (did you change the default?)?

Thanks,
Idan

Comment 12 oliver.albl 2018-06-03 13:27:40 UTC
We are currently using vdsm-4.20.27.1-1.el7.centos.x86_64.

We did not change "zero_method" so I guess we use blk_discard.

We run our storage domains with "discard after delete" enabled and "wipe after delete" disabled.

Thanks, 
Oliver

Comment 13 Idan Shaby 2018-06-04 08:07:04 UTC
(In reply to oliver.albl from comment #12)
> We are currently using vdsm-4.20.27.1-1.el7.centos.x86_64.
Which contains the patch that changes the default to blkdiscard.

> 
> We did not change "zero_method" so I guess we use blk_discard.
Indeed

> 
> We run our storage domains with "discard after delete" enabled and "wipe
> after delete" disabled.
The fact that you use Discard After Delete (DAD) means that before every call to lvremove on an about to be removed disk, the SPM calls blkdiscard on it.
However, discarding an LV is supposed to be a very fast operation. It depends on the LV size, but seconds, generally. Can you say what size of disks you usually use?
The easiest way to check whether DAD is the blame for this issue is to disable it for all of your storage domains and see if you still get these issues. But that will cause your removed disks' space to not be reclaimed, so we prefer that you will not try it unless you don't have a choice.
I searched for the info logs that come before and after the SPM calls blkdiscard, but I couldn't find it in your vdsm logs. Which of the logs was taken from the SPM? Plus, it seems that your vdsm logs and engine log are not synchronized. Can you please attach new logs from the same time?
When we have the SPM log, we can see how much time it took it to discard your LVs.
Also, if you don't use Wipe After Delete (and that can also be seen in the logs), you don't even use the new feature of "blkdiscard -z". so I am not concerned about it.

> 
> Thanks, 
> Oliver

Comment 14 oliver.albl 2018-06-04 08:51:31 UTC
Created attachment 1447363 [details]
Logfiles from SPM and Engine

Comment 15 oliver.albl 2018-06-04 08:52:29 UTC
I uploaded /var/log/messages and vdsm.log (from SPM) and engine.log. Last occurrence was today 09:53am.

Thanks,
Oliver

Comment 16 oliver.albl 2018-06-04 09:06:56 UTC
Just to be precise: Last occurrence was today 09:53am CEST.

Comment 17 Idan Shaby 2018-06-04 09:58:17 UTC
Thanks, Oliver.
From looking at the SPM "blkdiscard" calls:
- You had 173 calls to "blkdiscard".
- ~ 75% of them took less than a second.
- ~ 24% took less than 10 seconds.
- 3 calls took more than 10 seconds:
2018-06-04 09:48:01,264+0200 INFO  (tmap/2) [storage.blockdev] Discarded device /dev/5673ab6d-6315-4503-ae6d-098f95aff44b/f197c900-d22c-4784-aa3b-dc638fb0ea66: 10.31 seconds (utils:452)
2018-06-04 09:49:22,028+0200 INFO  (tmap/0) [storage.blockdev] Discarded device /dev/a356e8d6-0901-47fc-921b-12714f2c47e7/9aac238f-6adb-4842-8b13-5dbfa6062209: 15.63 seconds (utils:452)
2018-06-04 09:54:02,930+0200 INFO  (tmap/0) [storage.blockdev] Discarded device /dev/a356e8d6-0901-47fc-921b-12714f2c47e7/c04b490e-ca29-4bf7-9cb9-0a203d0054a4: 27.62 seconds (utils:452)

The last one may be the problematic one.

From /var/log/messages:

Jun  4 09:53:58 fabavmhost081 sanlock[64815]: 2018-06-04 09:53:58 1544442 [110142]: s38 delta_renew read timeout 10 sec offset 0 /dev/a356e8d6-0901-47fc-921b-12714f2c47e7/ids
Jun  4 09:53:58 fabavmhost081 sanlock[64815]: 2018-06-04 09:53:58 1544442 [110142]: s38 renewal error -202 delta_length 10 last_success 1544411
Jun  4 09:53:58 fabavmhost081 sanlock[64815]: 2018-06-04 09:53:58 1544442 [110142]: s38 delta_renew read timeout 10 sec offset 0 /dev/a356e8d6-0901-47fc-921b-12714f2c47e7/ids
Jun  4 09:53:58 fabavmhost081 sanlock[64815]: 2018-06-04 09:53:58 1544442 [110142]: s38 renewal error -202 delta_length 10 last_success 1544411

From the vdsm log:

2018-06-04 09:53:35,304+0200 INFO  (tmap/0) [storage.blockdev] Discarding device /dev/a356e8d6-0901-47fc-921b-12714f2c47e7/c04b490e-ca29-4bf7-9cb9-0a203d0054a4 (blockdev:167)
...
2018-06-04 09:54:02,250+0200 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/a356e8d6-0901-47fc-921b-12714f2c47e7/metadata running next_check=5839112.60 at 0x7f24a81b5390>> delayed by 0.51 seconds (asyncutils:138)
2018-06-04 09:54:02,250+0200 ERROR (check/loop) [storage.Monitor] Error checking path /dev/a356e8d6-0901-47fc-921b-12714f2c47e7/metadata (monitor:498)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 496, in _pathChecked
    delay = result.delay()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 391, in delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
MiscFileReadException: Internal file read failure: ('/dev/a356e8d6-0901-47fc-921b-12714f2c47e7/metadata', 1, 'Read timeout')
2018-06-04 09:54:02,250+0200 INFO  (check/loop) [storage.Monitor] Domain a356e8d6-0901-47fc-921b-12714f2c47e7 became INVALID (monitor:469)
2018-06-04 09:54:02,930+0200 INFO  (tmap/0) [storage.blockdev] Discarded device /dev/a356e8d6-0901-47fc-921b-12714f2c47e7/c04b490e-ca29-4bf7-9cb9-0a203d0054a4: 27.62 seconds (utils:452)
2018-06-04 09:54:02,930+0200 INFO  (tmap/0) [storage.LVM] Removing LVs (vg=a356e8d6-0901-47fc-921b-12714f2c47e7, lvs=['c04b490e-ca29-4bf7-9cb9-0a203d0054a4']) (lvm:1174)

Nir, could the blkdiscard operation on /dev/a356e8d6-0901-47fc-921b-12714f2c47e7/c04b490e-ca29-4bf7-9cb9-0a203d0054a4 (~190G if I got it right) prevented from sanlock to reach /dev/a356e8d6-0901-47fc-921b-12714f2c47e7/ids?

Comment 18 Nir Soffer 2018-06-04 16:31:25 UTC
(In reply to Idan Shaby from comment #17)
> Thanks, Oliver.
> From looking at the SPM "blkdiscard" calls:
> - You had 173 calls to "blkdiscard".
> - ~ 75% of them took less than a second.
> - ~ 24% took less than 10 seconds.
> - 3 calls took more than 10 seconds:

Nice!

> The last one may be the problematic one.
> 
> From /var/log/messages:
> 
> Jun  4 09:53:58 fabavmhost081 sanlock[64815]: 2018-06-04 09:53:58 1544442
> [110142]: s38 delta_renew read timeout 10 sec offset 0
> /dev/a356e8d6-0901-47fc-921b-12714f2c47e7/ids
> Jun  4 09:53:58 fabavmhost081 sanlock[64815]: 2018-06-04 09:53:58 1544442
> [110142]: s38 renewal error -202 delta_length 10 last_success 1544411
> Jun  4 09:53:58 fabavmhost081 sanlock[64815]: 2018-06-04 09:53:58 1544442
> [110142]: s38 delta_renew read timeout 10 sec offset 0
> /dev/a356e8d6-0901-47fc-921b-12714f2c47e7/ids
> Jun  4 09:53:58 fabavmhost081 sanlock[64815]: 2018-06-04 09:53:58 1544442
> [110142]: s38 renewal error -202 delta_length 10 last_success 1544411
> 
> From the vdsm log:
> 
> 2018-06-04 09:53:35,304+0200 INFO  (tmap/0) [storage.blockdev] Discarding
> device
> /dev/a356e8d6-0901-47fc-921b-12714f2c47e7/c04b490e-ca29-4bf7-9cb9-
> 0a203d0054a4 (blockdev:167)
> ...
> 2018-06-04 09:54:02,250+0200 WARN  (check/loop) [storage.asyncutils] Call
> <bound method DirectioChecker._check of <DirectioChecker
> /dev/a356e8d6-0901-47fc-921b-12714f2c47e7/metadata running
> next_check=5839112.60 at 0x7f24a81b5390>> delayed by 0.51 seconds
> (asyncutils:138)

This may not be related, but a bad sign - storage checkers should not be delayed.
In the past we have seen delayed because of slow logging. This should be
eliminated in current release, if using the default logging configuration.

> 2018-06-04 09:54:02,250+0200 ERROR (check/loop) [storage.Monitor] Error
...
> MiscFileReadException: Internal file read failure:
> ('/dev/a356e8d6-0901-47fc-921b-12714f2c47e7/metadata', 1, 'Read timeout')
> 2018-06-04 09:54:02,250+0200 INFO  (check/loop) [storage.Monitor] Domain
> a356e8d6-0901-47fc-921b-12714f2c47e7 became INVALID (monitor:469)
> 2018-06-04 09:54:02,930+0200 INFO  (tmap/0) [storage.blockdev] Discarded
> device
> /dev/a356e8d6-0901-47fc-921b-12714f2c47e7/c04b490e-ca29-4bf7-9cb9-
> 0a203d0054a4: 27.62 seconds (utils:452)
> 2018-06-04 09:54:02,930+0200 INFO  (tmap/0) [storage.LVM] Removing LVs
> (vg=a356e8d6-0901-47fc-921b-12714f2c47e7,
> lvs=['c04b490e-ca29-4bf7-9cb9-0a203d0054a4']) (lvm:1174)
> 
> Nir, could the blkdiscard operation on
> /dev/a356e8d6-0901-47fc-921b-12714f2c47e7/c04b490e-ca29-4bf7-9cb9-
> 0a203d0054a4 (~190G if I got it right) prevented from sanlock to reach
> /dev/a356e8d6-0901-47fc-921b-12714f2c47e7/ids?

It look like this based on the logs. This is most likely an issue on the storage
side, not being able to serve requests while discarding big lv.

Oliver:
- What kind of storage do you use? can you describe the setup?
- Are you using default logging configuration?
- Can you attach /etc/vdsm/logger.conf?
- Where /var/log/vdsm is located? local disk? NFS?

Idan:
We need input from kernel block layer developer regarding possible quality
of service issues when discarding large devices. If it is possible that some
storage will have trouble to discard large devices, we may need to break the 
discard into many small steps, that will allow other requests to be served.

Comment 19 oliver.albl 2018-06-04 17:57:36 UTC
We have about 40 hosts connected via FC/SAN to two Dell/EMC Unity arrays (both all flash). Each array presents about 20 8TB luns to all hosts.

We are using the default logging configuration, /var/log/vdsm is located on local disks (SSD).

/etc/vdsm/logger.conf:

# Vdsm logging configuration.

[loggers]
keys=root,vds,storage,virt,ovirt_hosted_engine_ha,ovirt_hosted_engine_ha_config,IOProcess,devel

[handlers]
keys=console,syslog,logfile,logthread

[formatters]
keys=long,simple,none,sysform

[logger_root]
level=INFO
handlers=syslog,logthread
propagate=0

[logger_vds]
level=INFO
handlers=syslog,logthread
qualname=vds
propagate=0

[logger_storage]
level=INFO
handlers=logthread
qualname=storage
propagate=0

[logger_ovirt_hosted_engine_ha]
level=ERROR
handlers=
qualname=ovirt_hosted_engine_ha
propagate=1

[logger_ovirt_hosted_engine_ha_config]
level=ERROR
handlers=
qualname=ovirt_hosted_engine_ha.env.config
propagate=0

[logger_IOProcess]
level=INFO
handlers=logthread
qualname=IOProcess
propagate=0

[logger_virt]
level=INFO
handlers=logthread
qualname=virt
propagate=0

[logger_devel]
level=ERROR
handlers=logthread
qualname=devel
propagate=0

[handler_syslog]
level=WARN
class=handlers.SysLogHandler
formatter=sysform
args=('/dev/log', handlers.SysLogHandler.LOG_USER)

[handler_logthread]
class=vdsm.common.logutils.ThreadedHandler
args=[]
level=DEBUG
target=logfile

[handler_logfile]
class=vdsm.common.logutils.UserGroupEnforcingHandler
args=('vdsm', 'kvm', '/var/log/vdsm/vdsm.log',)
level=DEBUG
formatter=long

[handler_console]
class: StreamHandler
args: []
formatter: none

[formatter_simple]
format: %(asctime)s:%(levelname)s:%(message)s

[formatter_none]
format: %(message)s

[formatter_long]
format: %(asctime)s %(levelname)-5s (%(threadName)s) [%(name)s] %(message)s (%(module)s:%(lineno)d)
class: vdsm.common.logutils.TimezoneFormatter

[formatter_sysform]
format: vdsm[%(process)d]: %(levelname)s %(message)s

Comment 20 Nir Soffer 2018-06-04 18:27:24 UTC
Oliver, did you find anything interesting in the storage logs or health stats
around the time you see sanlock renewal errors in sanlock logs and vdsm monitoring
errors in vdsm log?

Also, did you consulted the storage vendor about possible issue with dicarding?

Comment 21 Idan Shaby 2018-06-05 06:44:56 UTC
(In reply to Nir Soffer from comment #18)

> Idan:
> We need input from kernel block layer developer regarding possible quality
> of service issues when discarding large devices. If it is possible that some
> storage will have trouble to discard large devices, we may need to break the 
> discard into many small steps, that will allow other requests to be served.

I think that it is a good idea to add a --step of 2M anyway. That may solve issues like this, and like we had when we tested blkdiscard -z. It's the same tool, after all. It's just that we didn't think that it will take so long to discard a device (don't know why it took it almost a half of a minute).
What do you think?

Comment 22 oliver.albl 2018-06-05 07:07:09 UTC
Nir,

when there are large discards we monitor long latency for the specific LUN on many hosts. We are working with our storage vendor in parallel, same suspect there: large discards.

All the best,
Oliver

Comment 23 Idan Shaby 2018-06-05 09:07:15 UTC
Maybe 2M is not a good idea. A better idea can be to query the discard_max_bytes from sysfs for the specific device we want to discard and use it as the value of --step. That way we won't starve other requests and get the less bad performance.

From my tests on xtremio, 1T disk, discarding it from the engine takes:
- 33.13s without a step (like we do today).
- 213.15s with a specific step of 2M.
- 82.76s with a step that equals to the device's discard_max_bytes value (8M).

Comment 24 Idan Shaby 2018-06-19 09:58:24 UTC
The need info was accidentally drooped on comment 22.

Comment 25 Nir Soffer 2018-06-19 16:00:04 UTC
(In reply to Idan Shaby from comment #21)
> (In reply to Nir Soffer from comment #18)
> I think that it is a good idea to add a --step of 2M anyway. That may solve
> issues like this, and like we had when we tested blkdiscard -z. It's the
> same tool, after all. It's just that we didn't think that it will take so
> long to discard a device (don't know why it took it almost a half of a
> minute).

Using --step to limit the time we are blocked in the kernel without being able to
abort an operation is a good idea in general, but it it not a solution for this
issue.

The kernel already using max_discard_bytes (or similar) to send proper requests
to the storage. There may be an issue with some storage reporting wrong values,
or maybe this an issue with how storage handles large operations.

We really need input from kenrel/storage on this.

Oliver, did you reproduce the issue by discarding volumes manually? Maybe you want
to experiment on your system with using --step argument?

Comment 26 oliver.albl 2018-06-20 10:29:44 UTC
Nir,

I can reproduce the symptoms of the issue (long latency while reading from storage domain) for a 200 GB LV on an idle system (short test, 2 minutes duration. I monitor latency using 

while true; do
  time /bin/dd iflag=direct if=/dev/VG/metadata bs=4096 count=1 of=/dev/null
done

Using no step parameter sometimes reports times up to 4 seconds to read 4K (blkdiscard itself takes about 6 seconds)

while true; do
  time blkdiscard /dev/VG/LV
done

Using 2M step parameter does not report higher latency to read 4K (blkdiscard itself takes about 100 seconds)

while true; do
  time blkdiscard --step 2M /dev/VG/LV
done

Comment 27 Nir Soffer 2018-06-20 17:32:58 UTC
Thanks Oliver, this is very interesting.

I wonder what is the bigest step size avoiding this issue. I guess that using
128m step would be effective and much faster than 2m steps.

Elad, can we get access to fast server and storage for testing this?

Comment 28 Elad 2018-06-21 06:07:44 UTC
Provided offline

Comment 29 Idan Shaby 2018-08-01 12:34:42 UTC
The problem seems to be discarding (by calling blkdiscard) without a step parameter. From the commit message of patch 93338 [1]:

When discarding a device using blkdiscard without the 'step' flag, 'step' gets its default value, which is the size of the device. Usually it functions fine, but in busy systems with certain storage arrays, it can cause delays in other parallel IOs to the storage. From the tests that we ran, we found out that the optimal step size is 32M, in terms of total discard time and minimal starvation of other processes that run in parallel. Therefore, to minimize potential starvation and enhance the performance of blkdiscard, we added a step=32M to blkdiscard calls that aim to discard devices.

For more information, please refer to [1].

Tal, can we please retarget this BZ to 4.2.6?
The change is straightforward.

[1] https://gerrit.ovirt.org/#/c/93338/

Comment 30 Elad 2018-08-15 13:19:48 UTC
Hi Idan, 

Can you please provide steps to reproduce? Does it require a specific storage type/backend? 

Thanks

Comment 31 Idan Shaby 2018-08-16 05:17:52 UTC
I Elad,

I didn't try to reproduce the bug as it's pretty complicated and wouldn't gain me anything more than running the simulation that I described in the commit message (https://gerrit.ovirt.org/#/c/93338/3//COMMIT_MSG).
Not sure if you guys have environments for stress tests as described in this BZ.
Maybe you can just check that regular flows work fine and the new value of blkdiscard step is used?

Comment 32 Elad 2018-08-20 11:31:00 UTC
Verifying based on latest regression automation executions, haven't seen any related regression.

vdsm-4.20.37-1.el7ev.x86_64
ovirt-engine-4.2.6.3-0.1.el7ev.noarch