1727319 – lvm cache volumes with writethrough operating mode showing non-zero dirty blocks count

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1727319 - lvm cache volumes with writethrough operating mode showing non-zero dirty blocks count

Summary: lvm cache volumes with writethrough operating mode showing non-zero dirty blo...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	7.6
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Joe Thornber
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-05 14:23 UTC by John Pittman
Modified:	2021-09-06 14:55 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-08 09:04:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Used and dirty blocks over time. (14.31 KB, image/png) 2019-07-11 20:49 UTC, Lucas Vinicius Hartmann	no flags	Details
View All

Description John Pittman 2019-07-05 14:23:53 UTC

Description of problem:

Customer is seeing non-zero dirty blocks count on writethrough cache volumes.  They have seen it twice now, once after a system crash, and potentially once after a standard reboot (still need to verify).  After the system comes up, the dirty count increases then eventually drains.

Version-Release number of selected component (if applicable):

kernel-3.10.0-957.28.1.el7
device-mapper-1.02.149-10.el7_6.3
lvm2-2.02.180-10.el7_6.3

How reproducible:

Sporadic

Steps to Reproduce:

Customer has been able to reproduce at least once with system crash.  I have not been able to reproduce locally.  I was suspicious of the raid resync, but unable to verify.

Actual results:

Dirty block count is non-zero

Expected results:

As this cache volume is using writethrough operating mode, the dirty block count should remain at 0.

Additional info:

Stack info for one cache vol from sosreport (not while the issue was occurring):

LV              VG              Attr       LSize     Pool    Origin        Data%  Meta%  Move Log Cpy%Sync Convert LV Tags Devices         
[cache]         xnat            Cwi---C---   <49.92g                       99.99  16.21           0.00                     cache_cdata(0)  
[cache_cdata]   xnat            Cwi-ao----   <49.92g                                                                       /dev/md125p2(20)
[cache_cmeta]   xnat            ewi-ao----    40.00m                                                                       /dev/md125p2(10)
cargo           xnat            Cwi-aoC---   <16.00t [cache] [cargo_corig] 99.99  16.21           0.00                     cargo_corig(0)  
[cargo_corig]   xnat            owi-aoC---   <16.00t                                                                       /dev/rbd1(0)    
[lvol0_pmspare] xnat            ewi-------    40.00m                                                                       /dev/md125p2(0) 

md125 : active raid1 nvme1n1[1] nvme0n1[0]
      468719424 blocks super 1.2 [2/2] [UU]
      bitmap: 3/4 pages [12KB], 65536KB chunk

Table and status snip at the time of the issue (system up post crash):

xnat-cargo: 0 34359730176 cache 253:6 253:5 253:7 128 2 metadata2 writethrough smq 0
xnat-cache_cdata: 0 104685568 linear 259:3 165888
xnat-cache_cmeta: 0 81920 linear 259:3 83968
xnat-cargo_corig: 0 34359730176 linear 252:16 8192

xnat-cargo: 0 34359730176 cache 8 2464/10240 128 817854/817856 2242260 4826846 1324932 423883 24945 24943 530616 2 metadata2 writethrough 2 migration_threshold 2048 smq 0 rw -
xnat-cache_cdata: 0 104685568 linear
xnat-cache_cmeta: 0 81920 linear
xnat-cargo_corig: 0 34359730176 linear

Comment 9 Joe Thornber 2019-07-09 12:24:16 UTC

For performance reasons the dirty bitset is not updated on disk whilst the target is running.  This would cause
latency, since we'd have to hold every bio that is about to dirty a block until the bitset had been updated.

Instead, we write out the complete dirty bitset on clean shutdown.  This means if there is a crash we
have to assume all blocks are dirty.  We do not know that the last activation was in writethrough mode, indeed there
are scenarios where writethrough mode could be enabled and there be dirty blocks.

So seeing dirty blocks after a crash is not a bug.

However if the dirty blocks are all written back, and then a _clean_ reboot is performed you should see no dirty
blocks after reboot.

Comment 10 Joe Thornber 2019-07-09 14:02:00 UTC

The following tests check a couple of properties of a writetrhough cache:

https://github.com/jthornber/device-mapper-test-suite/blob/master/lib/dmtest/tests/cache/writethrough_tests.rb

i) check that a cache is still clean after applying an io workload
ii) after a crash, reboot, clean, reboot cycle it is still clean.

I've run these against upstream kernels and the RHEL 7.6 kernel and both pass.

At this point I need more info:

- is the cache clean at the point you reboot?
- is the cache taken down cleanly?  eg, fs unmounted, lv deactivated?
- after the reboot how many blocks are dirty?  How long does it take to subsequently clean it?
- do dirty blocks appear if you just deactivate then activate the cache lv with out rebooting?

Comment 12 Lucas Vinicius Hartmann 2019-07-11 15:10:47 UTC

I ran into a similar issue on a Fedora 30 desktop. My partitioning was something like:

/        LVM cached writeback
/boot    Simple GPT partition
/home    LVM cached writeback

On every single boot the entire cache for / got marked dirty and took hours to clean, but /home was not affected at all.

I suspect Fedora is not cleanly shutting down LVM when root is cached.

Comment 13 John Pittman 2019-07-11 20:07:48 UTC

(In reply to Lucas Vinicius Hartmann from comment #12)
> I ran into a similar issue on a Fedora 30 desktop. My partitioning was
> something like:
> 
> /        LVM cached writeback
> /boot    Simple GPT partition
> /home    LVM cached writeback
> 
> On every single boot the entire cache for / got marked dirty and took hours
> to clean, but /home was not affected at all.
> 
> I suspect Fedora is not cleanly shutting down LVM when root is cached.

Thanks Lucas.  I'm able to reproduce what you're seeing.  I installed F30 using automatic LVM partitioning, then added root to cache using https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/logical_volume_manager_administration/lvm_cache_volume_creation.  

# vgextend fedora_localhost-live /dev/vdb
# lvcreate -n lv_cache -L 2G fedora_localhost-live /dev/vdb
# lvcreate -L 30m lv_cache_meta fedora_localhost-live /dev/vdb
# lvcreate -L 30m -n lv_cache_meta fedora_localhost-live /dev/vdb
# lvconvert --type cache-pool --poolmetadata fedora_localhost-live/lv_cache_meta fedora_localhost-live/lv_cache
# lvconvert --type cache --cachepool fedora_localhost-live/lv_cache fedora_localhost-live/root 

Ended up with the below config.

[root@localhost-live ~]# lvs -a
  LV               VG                    Attr       LSize   Pool       Origin       Data%  Meta%  Move Log Cpy%Sync Convert
  [lv_cache]       fedora_localhost-live Cwi---C---   2.00g                         71.77  1.37            0.00            
  [lv_cache_cdata] fedora_localhost-live Cwi-ao----   2.00g                                                                
  [lv_cache_cmeta] fedora_localhost-live ewi-ao----  32.00m                                                                
  [lvol0_pmspare]  fedora_localhost-live ewi-------  32.00m                                                                
  root             fedora_localhost-live Cwi-aoC--- <12.50g [lv_cache] [root_corig] 71.77  1.37            0.00            
  [root_corig]     fedora_localhost-live owi-aoC--- <12.50g                                                                
  swap             fedora_localhost-live -wi-ao----   1.50g          

After the config was done, I rebooted and when the system came back up dirty blocks was gradually growing.

fedora_localhost--live-root: 0 26206208 cache 8 109/8192 128 23511/32768 30852 52940 4413 22171 0 22193 0 2 metadata2 writethrough 2 migration_threshold 2048 smq 0 rw - 
fedora_localhost--live-root: 0 26206208 cache 8 109/8192 128 23513/32768 30857 52940 4431 22182 0 22195 0 2 metadata2 writethrough 2 migration_threshold 2048 smq 0 rw - 
fedora_localhost--live-root: 0 26206208 cache 8 109/8192 128 23513/32768 30886 52944 4515 22182 0 22195 0 2 metadata2 writethrough 2 migration_threshold 2048 smq 0 rw - 
fedora_localhost--live-root: 0 26206208 cache 8 109/8192 128 23517/32768 31201 53005 4535 22183 0 22199 0 2 metadata2 writethrough 2 migration_threshold 2048 smq 0 rw - 
fedora_localhost--live-root: 0 26206208 cache 8 109/8192 128 23519/32768 31328 53007 4553 22187 0 22201 0 2 metadata2 writethrough 2 migration_threshold 2048 smq 0 rw - 

Your symptoms and the symptoms of the original issue are the same, but I'm unsure if the root cause is the same.

Comment 15 Lucas Vinicius Hartmann 2019-07-11 20:49:23 UTC

Created attachment 1589665 [details]
Used and dirty blocks over time.

Space is void while the computer is powered down, except for the glitch on 6/11 09:00 which was a network issue.
Notice full-conversion of all used block to dirty on every power cycle.
All reboots were supposedly clean shutdowns.

Scenario: ext4 / filesystem, on lvm-cached-lvs, backed by a pair of lvm-pvs, each PV separately encrypted with luks. No separate /root mountpoint.

I happened to be playing with InfluxDB and Grafana at the time, which is how I stumbled on the issue.

Comment 16 Lucas Vinicius Hartmann 2019-07-11 20:59:00 UTC

(In reply to John Pittman from comment #13)

Just occurred to me, did you enable noatime mount option for the cached partition?

Maybe during boot there is a piece of software searching/scanning/reading a lot of files? Without using noatime reading files will cause access times to be updated/written back to the disk. This could potentially lead to an increase in dirty blocks at boot, followed by normal operation then.

Careful, though, as noatime may have adverse effects on some applications.

Comment 17 John Pittman 2019-07-11 21:27:59 UTC

(In reply to Lucas Vinicius Hartmann from comment #16)
> (In reply to John Pittman from comment #13)
> 
> Just occurred to me, did you enable noatime mount option for the cached
> partition?
> 
> Maybe during boot there is a piece of software searching/scanning/reading a
> lot of files? Without using noatime reading files will cause access times to
> be updated/written back to the disk. This could potentially lead to an
> increase in dirty blocks at boot, followed by normal operation then.
> 
> Careful, though, as noatime may have adverse effects on some applications.

Thanks, I tried adding noatime, and rebooted.  When the system came back up there were maybe 60 dirty blocks and the value was slowly incrementing.  However, I tried about 5 more reboots with noatime, then maybe 10 without, and have not been able to reproduce again.  Note my cache volume is writethrough and yours is writeback.  Will look further later.

Comment 23 John Pittman 2019-07-12 12:29:20 UTC

(In reply to Lucas Vinicius Hartmann from comment #16)
> (In reply to John Pittman from comment #13)
> 
> Just occurred to me, did you enable noatime mount option for the cached
> partition?
> 
> Maybe during boot there is a piece of software searching/scanning/reading a
> lot of files? Without using noatime reading files will cause access times to
> be updated/written back to the disk. This could potentially lead to an
> increase in dirty blocks at boot, followed by normal operation then.
> 
> Careful, though, as noatime may have adverse effects on some applications.

Hi Lucas, would you mind please splitting off your issue into a separate Fedora 30 bug?  We will continue to track your issue there.  You can note in that bug where you split it off from here.

As a note, I was mistaken that I reproduced your issue.  It was late in the day and I think my brain was seeing what it wanted. :)

fedora_localhost--live-root: 0 26206208 cache 8 109/8192 128 23511/32768 30852 52940 4413 22171 0 22193 [0] 2 metadata2 writethrough 2 migration_threshold 2048 smq 0 rw - 

Above, I've put the dirty blocks in brackets []; you can see they are 0.

Comment 24 John Pittman 2019-07-15 14:15:34 UTC

I have a reproduction environment, but I have not yet been able to reproduce as the customer has.

LV               VG  Attr       LSize  Pool       Origin     Data%  Meta%  Move Log Cpy%Sync Convert Devices          
lv               vg2 Cwi-aoC--- <3.42g [lv_cache] [lv_corig] 99.95  2.13            0.00             lv_corig(0)      
[lv_cache]       vg2 Cwi---C---  2.00g                       99.95  2.13            0.00             lv_cache_cdata(0)
[lv_cache_cdata] vg2 Cwi-ao----  2.00g                                                               /dev/md0p2(0)    
[lv_cache_cmeta] vg2 ewi-ao---- 20.00m                                                               /dev/md0p2(512)  
[lv_corig]       vg2 owi-aoC--- <3.42g                                                               /dev/rbd0(0)     
[lvol0_pmspare]  vg2 ewi------- 20.00m                                                               /dev/md0p2(517)  

NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vg2-lv 253:145  0  3.4G  0 lvm  /mnt/ceph-block-device2

In one terminal, I run:

# fio --bs=128k --ioengine=libaio --rw=randrw --name=blah --filename=/mnt/ceph-block-device2/file --size=3g --direct=0

And in another:

# while true ; do fio --bs=32k --ioengine=libaio --rw=randrw --name=blah2 --filename=/mnt/ceph-block-device2/file2 --size=40m --direct=1 ; done

I let them run for a while, then ctrl-c, quickly check that there are no dirty blocks, then reboot.  This process has not reproduced the issue.  I've also tried just killing the fio threads, unmounting the filesystem, then vgchange -an/-ay; this did not reproduce either.

Please note however, I did once issue a reboot while the fio threads were still running.  This caused all sorts of hung tasks and failures.  When the system came back up, there were many dirty blocks which eventually drained to 0.  So I was able to reproduce dirty blocks in that way.

If there are flaws in my procedure, or anything you guys want me to try, please let me know.  Otherwise, I will report back if I or the customer finds anything else.

Comment 25 John Pittman 2019-07-17 12:16:17 UTC

Someone asking similar questions posed here at the below link:

https://www.redhat.com/archives/dm-devel/2019-July/msg00114.html

Comment 28 John Pittman 2019-07-24 14:25:00 UTC

As a note, in my reproduction environment, I added the NFS setup.  Even when issuing I/O to the nfs mount from the nfs client system, then rebooting the nfs server while the I/O is still going, I have been unable to reproduce.

Still the only reproduction I've been able to do is when I was issuing I/O directly to the cache LV, then rebooting while the fio threads were still issuing I/O.  All sorts of hung tasks and failures as the system goes down; dirty blocks when the system comes back up.

Comment 29 Joe Thornber 2019-10-08 09:04:13 UTC

Closing since no one has produced this issue if there's a clean shutdown.

Note You need to log in before you can comment on or make changes to this bug.