Bug 1672496

Summary:	lvmraid RAID1: I/O stalls with HDD+SSD
Product:	[Community] LVM and device-mapper	Reporter:	Paul Wise (Debian) <pabs3>
Component:	device-mapper	Assignee:	LVM and device-mapper development team <lvm-team>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	unspecified	Docs Contact:
Priority:	high
Version:	unspecified	CC:	agk, heinzm, jbrassow, msnitzer, prajnoha, thornber, zkabelac
Target Milestone:	---	Flags:	rule-engine: lvm-technical-solution?
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-09-06 12:23:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Paul Wise (Debian) 2019-02-05 07:22:53 UTC

I am having issues with LVM lvmraid RAID1.

I am using Linux 4.19.16-1 and lvm2 2.03.02-1 on Debian 11 (buster).

I have a desktop system with two 500GB disks, a HDD and an SSD. Each disk has a partition containing a LUKS volume, which contain LVM PVs, which form a single VG, which contains a swap LV and a rootfs LV. The swap LV is backed by the HDD PV and the rootfs LV by both PVs.

I often randomly get sets of random processes going into D state for an extended period. It feels like all I/O is stalled as the disk light doesn't go on at all.

I do not know if these issues are new in the Linux version I use since I only recently switched to this setup from having just the HDD.

Sometimes switching to a different virtual console causes the I/O to start again. Sometimes logging in over SSH causes the I/O to start again. Sometimes I just need to wait some minutes until the I/O starts again.

Recently I split the HDD from the RAID1 array (using `lvconvert --yes --splitmirrors 1 --trackchanges`) and this has completely prevented the I/O stalls from happening for several days. Of course, I don't really want to keep running in this configuration in case of SSD failures.

I'm wondering if the difference in latency between the two devices is causing the Linux mq-deadline I/O scheduler to display suboptimal I/O behaviour.

I'm wondering if there is any better way to investigate this than the naive script below that dumps kernel stacks for processes in D state. I wanted to make it only print processes that have been in D state for more than 5 seconds but the stime item in /proc/$PID/stat only seems to report cumulative kernel time rather than time since the last state transition.

I tried looking in dmesg but the warnings about blocked tasks do not seem to happen even when /proc/sys/kernel/hung_task_timeout_secs is set to 5 seconds.

# lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                            8:0    0 465.8G  0 disk
├─sda1                         8:1    0   243M  0 part
│ └─md127                      9:127  0   242M  0 raid1 /boot
├─sda2                         8:2    0     1K  0 part
└─sda5                         8:5    0 465.5G  0 part
  └─sda5_crypt               253:9    0 465.5G  0 crypt
    ├─hostname-root_rmeta_1  253:4    0     4M  0 lvm
    │ └─hostname-root        253:7    0 449.9G  0 lvm   /
    └─hostname-root_rimage_1 253:6    0 449.9G  0 lvm
      └─hostname-root        253:7    0 449.9G  0 lvm   /
sdb                            8:48   0 465.8G  0 disk
├─sdb1                         8:49   0   243M  0 part
│ └─md127                      9:127  0   242M  0 raid1 /boot
├─sdb2                         8:50   0     1K  0 part
└─sdb5                         8:53   0 465.5G  0 part
  └─sdb5_crypt               253:0    0 465.5G  0 crypt
    ├─hostname-root_rmeta_0  253:1    0     4M  0 lvm
    │ └─hostname-root        253:7    0 449.9G  0 lvm   /
    ├─hostname-root_rimage_0 253:2    0 449.9G  0 lvm
    │ └─hostname-root        253:7    0 449.9G  0 lvm   /
    └─hostname-swap          253:8    0  15.6G  0 lvm   [SWAP]

# lvs -a
  LV              VG       Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root            hostname rwi-aor--- 449.91g                                    100.00
  [root_rimage_0] hostname iwi-aor--- 449.91g
  [root_rimage_1] hostname iwi-aor--- 449.91g
  [root_rmeta_0]  hostname ewi-aor---   4.00m
  [root_rmeta_1]  hostname ewi-aor---   4.00m
  swap            hostname -wi-ao---- <15.60g

# head /sys/block/sd{a,b}/queue/scheduler
==> /sys/block/sda/queue/scheduler <==
[mq-deadline] none

==> /sys/block/sdb/queue/scheduler <==
[mq-deadline] none

# cat `which dump-d-state-process-stacks`
#!/bin/bash
while sleep 0.1 ; do
        grep -l State:.D /proc/*/status 2> /dev/null |
                sed 's_/proc/__;s_/status__' |
                xargs -I _ bash -c '
                        ret=0
                        link=$(readlink /proc/_/exe) || ret=$?
                        if [ $ret -eq 0 ] ; then
                                echo START PROCESS -------------------------------------------------
                                date
                                echo $link
                                tr "\0" " " < /proc/_/cmdline
                                cat /proc/_/stack
                                echo END PROCESS ----------------------------------------------------
                        fi
                '
done

Comment 1 Jonathan Earl Brassow 2019-08-19 19:55:15 UTC

I'm not sure what is happening here and I haven't heard of this reported before.  Instead of splitting the raid1, you could try 'writebehind' and 'writemostly' options   (see lvmraid(7)).

Comment 2 Paul Wise (Debian) 2019-08-20 02:30:43 UTC

I switched the HDD to writemostly some months ago and that resolved the issue for me.

Comment 3 Heinz Mauelshagen 2019-09-06 12:23:33 UTC

(In reply to Paul Wise (Debian) from comment #2)
> I switched the HDD to writemostly some months ago and that resolved the
> issue for me.

Closing as of this comment.

FWIW: could be another disk controller issue throttling I/O.

Comment 4 Paul Wise (Debian) 2019-09-07 02:36:20 UTC

writemostly seems like a workaround rather than a fix, surely RAID1 should work sanely on disks of differing latency without requiring that?