2112147 – simultaneous writes to a page on xfs can result in zero-byte data

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2112147 - simultaneous writes to a page on xfs can result in zero-byte data

Summary: simultaneous writes to a page on xfs can result in zero-byte data

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	8.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Carlos Maiolino
QA Contact:	Murphy Zhou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2184101 2184102 2184103 2208413
TreeView+	depends on / blocked

Reported:	2022-07-29 00:01 UTC by Frank Sorenson
Modified:	2023-11-14 17:13 UTC (History)
CC List:	19 users (show)
Fixed In Version:	kernel-4.18.0-491.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2184101 2184102 2184103 2208413 (view as bug list)
Environment:
Last Closed:	2023-11-14 15:37:57 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
reproducer (19.74 KB, text/x-csrc) 2022-07-29 00:01 UTC, Frank Sorenson	no flags	Details
test run data (4.91 MB, application/x-xz) 2022-07-29 00:20 UTC, Frank Sorenson	no flags	Details
updated reproducer (32.90 KB, text/x-csrc) 2022-08-04 03:55 UTC, Frank Sorenson	no flags	Details
program to consume memory (2.26 KB, text/x-csrc) 2022-08-10 06:34 UTC, Frank Sorenson	no flags	Details
updated reproducer (79.68 KB, text/x-csrc) 2022-08-26 21:01 UTC, Frank Sorenson	no flags	Details
xfs_io-based reproducer (2.00 KB, application/x-shellscript) 2022-10-04 23:03 UTC, Frank Sorenson	no flags	Details
Show Obsolete (3) View All

Links
System	ID	Priority	Status	Summary	Last Updated
Gitlab	redhat/rhel/src/kernel rhel-8 merge_requests 4240	None	None	None	2023-03-15 08:53:10 UTC
Red Hat Issue Tracker	RHELPLAN-129544	None	None	None	2022-07-29 00:03:45 UTC
Red Hat Knowledge Base (Solution)	7006705	None	None	None	2023-04-07 07:46:31 UTC
Red Hat Product Errata	RHSA-2023:7077	None	None	None	2023-11-14 15:38:33 UTC

Description Frank Sorenson 2022-07-29 00:01:23 UTC

Created attachment 1900079 [details]
reproducer

Description of problem:

simultaneous writes to a page on xfs can result in zero-byte data


Version-Release number of selected component (if applicable):

4.18.0-348.7.1.el8_5.x86_64


How reproducible:

easy; reproducer attached

Steps to Reproduce:

# gcc test_write_bug.c -o test_write_bug -lpthread -Wall -g
# ./test_write_bug . 40 3

(starts 40 processes, each having 3 write threads, writing to files in ./testfiles/test*)

each of the 3 write threads writes 1 MiB at the following offsets:

thread 0 - 0x300
thread 1 - 0x100300
thread 2 - 0x200300

the threads then wait for each other to complete, then continue with 1 MiB writes at the following offsets:

thread 0 - 0x300300
thread 1 - 0x400300
thread 2 - 0x500300

etc.

after writing 100 MiB, each test process checks the file to see whether the bug has reproduced



Actual results:

2022-07-28 18:50:54.891379862  [3452412] file size will be 104858368 bytes, and buffer size will be 1048576
2022-07-28 18:50:54.892680344  [3452412] forked test proc 0 as pid 0
2022-07-28 18:50:54.893127424  [3452412] forked test proc 1 as pid 0
2022-07-28 18:50:54.893589137  [3452412] forked test proc 2 as pid 0
2022-07-28 18:50:54.894022621  [3452412] forked test proc 3 as pid 0
...
2022-07-28 18:51:09.906519882  [3452412] 40 test processes started, 40 test processes running, 66 tests started, bug replicated: 0
2022-07-28 18:51:14.906558757  [3452412] 40 test processes started, 40 test processes running, 66 tests started, bug replicated: 0
2022-07-28 18:51:17.575708075  [3452412] child pid 3452428 exiting
2022-07-28 18:51:17.575783604  [3452412] child 15 replicated the bug
...
2022-07-28 18:51:25.636657183  [3452412] child pid 3452413 exiting
2022-07-28 18:51:25.636690800  [3452412] child 0 exited without replicating the bug
2022-07-28 18:51:25.636699962  [3452412] exiting after replicating the bug 4 time

(in this case, 4 of the 40 processes replicated the bug)

the zeroed bytes are always the last 3328 bytes of the page (the first 3328 bytes written by the higher-numbered thread); the first 768 bytes contain the expected data (written by the lower-numbered thread).

logs/test15.log
2022-07-28 18:50:54.996502725  [3452428] opened './testfiles/test15', which is inode 157 on device 252:32
2022-07-28 18:50:54.996780933  [3452505 / 1] pid 3452412 child is tid 3452505
2022-07-28 18:50:54.996789468  [3452505 / 1] filesize=0x6400300 initial offset=0x100300 step=3
2022-07-28 18:50:54.997995547  [3452506 / 2] pid 3452412 child is tid 3452506
2022-07-28 18:50:54.998010562  [3452506 / 2] filesize=0x6400300 initial offset=0x200300 step=3
...
2022-07-28 18:51:17.114529951  [3452504 / 0] all writes complete
2022-07-28 18:51:17.116878403  [3452428] verifying file contents
2022-07-28 18:51:17.548658338  [3452428] error: found zero bytes at offset 0x04600300
0x046002c0: 5e 5e 5e 5e 5e 5e 5e 5e | 5e 5e 5e 5e 5e 5e 5e 5e  |^^^^^^^^^^^^^^^^|
0x046002d0: 5e 5e 5e 5e 5e 5e 5e 5e | 5e 5e 5e 5e 5e 5e 5e 5e  |^^^^^^^^^^^^^^^^|
0x046002e0: 5e 5e 5e 5e 5e 5e 5e 5e | 5e 5e 5e 5e 5e 5e 5e 5e  |^^^^^^^^^^^^^^^^|
0x046002f0: 5e 5e 5e 5e 5e 5e 5e 5e | 5e 5e 5e 5e 5e 5e 5e 5e  |^^^^^^^^^^^^^^^^|
0x04600300: 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00  |................|
0x04600310: 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00  |................|
0x04600320: 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00  |................|
0x04600330: 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00  |................|
2022-07-28 18:51:17.574989030  [3452428] test proc 15 replicated the bug on test 1


Expected results:

The page contains the data as written by both processes.


Additional info:

the writes are not overlapping, however the thread 0 writes into the first 768 bytes of the page that thread 1 writes to beginning at offset 768, and the same is true for threads 1 and 2.

the reproducer creates files in testfiles/test[0-9]+, and outputs to individual logfiles for each test process logs/test[0-9]+.log

Comment 1 Frank Sorenson 2022-07-29 00:20:19 UTC

Created attachment 1900080 [details]
test run data

ran the reproducer and got 5 files with zeroed data.

attaching a tarball of the logged results.

# grep zero logs/* -l
logs/test0.log
logs/test13.log
logs/test31.log
logs/test33.log
logs/test8.log

# for i in 0 8 13 31 33 ; do echo test$i ; egrep zero logs/test$i.log -B2 -A10 ; hexdump -C testfiles/test$i ; echo "==============================================" ; done > hexdumps.out


tarball also contains the trace.dat from 
 # trace-cmd record -p nop -e 'xfs:*'

Comment 2 Dave Chinner 2022-07-29 02:00:33 UTC

The test is doing buffered writes, so by definition we can't have simultaneous writes occurring to a single file. i.e buffered writes are all done under the XFS_IOLOCK_EXCL - we can only have one buffered write modifying the data/EOF zeroing/moving EOF running at any point in time. So I can't see how there's a race between the threads writing data into the file that would cause this.

This smells of EOF block zeroing during writeback racing with incoming writes updating the file size - the only place we zero the tail of a newly written page is during writeback when EOF lands inside the page, and that can run simmultaneously with incoming writes. But we do that zeroing in writeback while we have the page locked, and the only place in the incoming write path where we change the inode size is when we copy new data into the EOF block. But all that is also done under teh page lock, so we have still have exclusion between writing new data into the EOF block and writeback zeroing the empty part of the EOF block. and that is all done under the page lock.

Hence from looking at the code, I can't explain how this corruption is occurring and it gives me no insight into how to reproduce it. And I haven't been able to reproduce it locally with the test code, so I can't work out the cause by observation, either.

Frank, have you reproduced this on an in-house system? What is the hardware config of the system this is being reproduced on, and what is the storage/filesystem config it is being run on? Does ext4 reproduce a similar issue, or is it just an XFS issue?

-Dave.

Comment 4 Frank Sorenson 2022-07-29 03:50:11 UTC

(In reply to Dave Chinner from comment #2)

> Frank, have you reproduced this on an in-house system? What is the hardware
> config of the system this is being reproduced on, and what is the
> storage/filesystem config it is being run on? Does ext4 reproduce a similar
> issue, or is it just an XFS issue?


The customer initially reported this on aarch64 (Fujitsu MP1000 - 50 cpus, 32 GiB RAM), then reproduced it on x86_64 (KVM - 2 cpus, 3.7 GiB RAM)


I've reproduced this on my test VM:
	KVM - 8 cpus, 2 GiB RAM
	tested both qcow2 and raw types - filesystem image files are on xfs backing

meta-data=/dev/vdc               isize=512    agcount=4, agsize=3276800 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=13107200, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=6400, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

(reproduced on 5 of the 40 test processes in 13 seconds)



I have not been able to reproduce this with ext4

Comment 5 Dave Chinner 2022-07-29 04:43:05 UTC

(In reply to Frank Sorenson from comment #1)
> Created attachment 1900080 [details]
> # grep zero logs/* -l
> logs/test0.log
> logs/test13.log
> logs/test31.log
> logs/test33.log
> logs/test8.log

Frank, what are the inode numbers of these files? The traces index all the file operations by inode number, and the filename->inode number information is not in the traces so it's hard to correlate the operations recorded with the corruptions in the files...

Also, if you unmount/mount and recheck the files for corruption, is it still there? i.e. is it only corrupting the page cache in memory, or are teh zeros actually getting written to disk?

And, on a long shot - what happens if you modify the test program to unlink and recreate the files between each iteration of the test, instead of just opening them with O_TRUNC? i.e. make sure the initial state of the file for each test is identical and there's nothing being left behind in teh page caceh from previous test runs to rule
out a truncate bug....

-Dave.

Comment 6 Frank Sorenson 2022-07-29 11:37:39 UTC

# grep -l zero logs/test* | xargs head -n1
==> logs/test0.log <==
2022-07-28 19:08:09.313059226  [3454771] opened './testfiles/test0', which is inode 149 on device 252:32

==> logs/test13.log <==
2022-07-28 19:08:09.367999567  [3454787] opened './testfiles/test13', which is inode 159 on device 252:32

==> logs/test31.log <==
2022-07-28 19:08:09.462633591  [3454805] opened './testfiles/test31', which is inode 172 on device 252:32

==> logs/test33.log <==
2022-07-28 19:08:09.450968160  [3454807] opened './testfiles/test33', which is inode 178 on device 252:32

==> logs/test8.log <==
2022-07-28 19:08:09.366622996  [3454782] opened './testfiles/test8', which is inode 155 on device 252:32


the log files also have the tid of the writing threads:

logs/test0.log

2022-07-28 19:08:09.313059226  [3454771] opened './testfiles/test0', which is inode 149 on device 252:32
2022-07-28 19:08:09.313946242  [3454778 / 0] pid 3454765 child is tid 3454778
2022-07-28 19:08:09.314026552  [3454778 / 0] filesize=0x6400300 initial offset=0x300 step=3
2022-07-28 19:08:09.314116471  [3454779 / 1] pid 3454765 child is tid 3454779
2022-07-28 19:08:09.314196609  [3454779 / 1] filesize=0x6400300 initial offset=0x100300 step=3
2022-07-28 19:08:09.314597478  [3454781 / 2] pid 3454765 child is tid 3454781
2022-07-28 19:08:09.314621304  [3454781 / 2] filesize=0x6400300 initial offset=0x200300 step=3
...
2022-07-28 19:08:19.357378024  [3454778 / 0] offset=0x6000300 count=0x100000 c=0x43 (2) starting
2022-07-28 19:08:19.357386652  [3454781 / 2] offset=0x6200300 count=0x100000 c=0x45 (4) starting
2022-07-28 19:08:19.357413114  [3454779 / 1] offset=0x6100300 count=0x100000 c=0x44 (3) starting
2022-07-28 19:08:19.358070271  [3454778 / 0] offset=0x6000300 count=0x100000 c=0x43 complete (0x100000)
2022-07-28 19:08:19.435329367  [3454781 / 2] offset=0x6200300 count=0x100000 c=0x45 complete (0x100000)
2022-07-28 19:08:19.435405982  [3454781 / 2] all writes complete
2022-07-28 19:08:19.435431267  [3454781 / 2] waiting for other threads to complete
2022-07-28 19:08:19.760010128  [3454779 / 1] offset=0x6100300 count=0x100000 c=0x44 complete (0x100000)
2022-07-28 19:08:19.760041545  [3454779 / 1] all writes complete
2022-07-28 19:08:19.760046715  [3454779 / 1] waiting for other threads to complete
2022-07-28 19:08:19.760309096  [3454778 / 0] offset=0x6300300 count=0x100000 c=0x46 (5) starting
2022-07-28 19:08:19.762794774  [3454778 / 0] offset=0x6300300 count=0x100000 c=0x46 complete (0x100000)
2022-07-28 19:08:19.762829844  [3454778 / 0] all writes complete
2022-07-28 19:08:19.763432871  [3454771] verifying file contents
2022-07-28 19:08:20.756985111  [3454771] error: found zero bytes at offset 0x06200300
0x062002c0: 44 44 44 44 44 44 44 44 | 44 44 44 44 44 44 44 44  |DDDDDDDDDDDDDDDD|
0x062002d0: 44 44 44 44 44 44 44 44 | 44 44 44 44 44 44 44 44  |DDDDDDDDDDDDDDDD|
0x062002e0: 44 44 44 44 44 44 44 44 | 44 44 44 44 44 44 44 44  |DDDDDDDDDDDDDDDD|
0x062002f0: 44 44 44 44 44 44 44 44 | 44 44 44 44 44 44 44 44  |DDDDDDDDDDDDDDDD|
0x06200300: 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00  |................|
0x06200310: 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00  |................|
0x06200320: 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00  |................|
0x06200330: 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00  |................|
2022-07-28 19:08:20.792478877  [3454771] test proc 0 replicated the bug on test 1


so this is operating on test0 - device 252:32 inode number 149
tid 3454781 started to write 1 MiB at 0x6200300-
and tid 3454779 started to write 1 MiB at 0x6100300-

    (times are self-reported, so may not necessarily align with the syscall start/complete times)
    2022-07-28 19:08:19.357378024  [3454778 / 0] offset=0x6000300 count=0x100000 c=0x43 (2) starting
    2022-07-28 19:08:19.357386652  [3454781 / 2] offset=0x6200300 count=0x100000 c=0x45 (4) starting  << the thread which wrote at 0x6200300-
    2022-07-28 19:08:19.357413114  [3454779 / 1] offset=0x6100300 count=0x100000 c=0x44 (3) starting  << the thread which wrote at 0x6100300-0x62002ff

    2022-07-28 19:08:19.358070271  [3454778 / 0] offset=0x6000300 count=0x100000 c=0x43 complete (0x100000)
    2022-07-28 19:08:19.435329367  [3454781 / 2] offset=0x6200300 count=0x100000 c=0x45 complete (0x100000)  << the thread which wrote at 0x6200300-
    2022-07-28 19:08:19.760010128  [3454779 / 1] offset=0x6100300 count=0x100000 c=0x44 complete (0x100000)  << the thread which wrote at 0x6100300-0x62002ff

the relevant events in the trace would be:

	$ trace-cmd report -F 'xfs/:dev==0xfc00020 && ino==149' -t

(or for events which aren't inode-specific (though this also includes all the events which involve inodes as well):
	$ trace-cmd report -F 'xfs/:dev==0xfc00020' -t


It looks like we can combine these:

	$ trace-cmd report -F 'xfs/:dev==0xfc00020 && !ino!=149' -t


these particular writes, for getting our bearings:

	$ trace-cmd report -F 'xfs_file_buffered_write/: dev==0xfc00020 && !ino!=149 && (offset==0x6200300 || offset==0x6100300)' -t
	cpus=8
           <...>-3454781 [006] 3122356.893118267: xfs_file_buffered_write: dev 252:32 ino 0x95 size 0x5c7e000 offset 0x6200300 count 0x100000
           <...>-3454779 [005] 3122356.970414183: xfs_file_buffered_write: dev 252:32 ino 0x95 size 0x5c7e000 offset 0x6100300 count 0x100000



I added an unlink and retested without O_TRUNC; the bug still occurs.

Comment 7 Frank Sorenson 2022-07-29 21:26:45 UTC

(In reply to Dave Chinner from comment #5)

> Also, if you unmount/mount and recheck the files for corruption, is it still
> there? i.e. is it only corrupting the page cache in memory, or are teh zeros
> actually getting written to disk?

I can confirm that the corruption is still there after unmount/mount.

Comment 8 Dave Chinner 2022-07-29 22:56:49 UTC

(In reply to Frank Sorenson from comment #6)
> # grep -l zero logs/test* | xargs head -n1
> ==> logs/test0.log <==
> 2022-07-28 19:08:09.313059226  [3454771] opened './testfiles/test0', which
> is inode 149 on device 252:32

Thanks, I missed this.

> the relevant events in the trace would be:
> 
> 	$ trace-cmd report -F 'xfs/:dev==0xfc00020 && ino==149' -t
> 
> (or for events which aren't inode-specific (though this also includes all
> the events which involve inodes as well):
> 	$ trace-cmd report -F 'xfs/:dev==0xfc00020' -t

Yup, trace-cmd filtering is mostly just an esoteric way to grep. When you have large trace files, parsing the trace file repeatedly to apply different filters is a lot slower then just dump the entire trace to a text file once and run grep and awk on it repeatedly to filter it.

$ time trace-cmd report  -F 'xfs/:dev==0xfc00020' > f.t 

real	0m3.729s
user	0m3.267s
sys	0m0.275s

$ time grep "dev 232:32" t.t > tt.t

real	0m0.153s
user	0m0.080s
sys	0m0.053s

Sure, if you're only loooking at one thing and you know exactly what you are looking for and are an expert in the filtering language, filtering can be faster. But I typically run 10s of different filters over a trace in the course of triage, and at that point sed, awk and grep are much easier, faster and more flexible than trace-cmd filtering, and I don't have to learn yet another filtering expression language.... :)

> these particular writes, for getting our bearings:
> 
> 	$ trace-cmd report -F 'xfs_file_buffered_write/: dev==0xfc00020 &&
> !ino!=149 && (offset==0x6200300 || offset==0x6100300)' -t
> 	cpus=8
>            <...>-3454781 [006] 3122356.893118267: xfs_file_buffered_write:
> dev 252:32 ino 0x95 size 0x5c7e000 offset 0x6200300 count 0x100000
>            <...>-3454779 [005] 3122356.970414183: xfs_file_buffered_write:
> dev 252:32 ino 0x95 size 0x5c7e000 offset 0x6100300 count 0x100000

I've already looked at this specific file corruption because the xfs_lookup trace is present for test0 so I knew it's inode number. There's no smoking gun for that file corruption, unfortunately, as the exact same pattern of writes, EOF zeroing and inode locking exclusion occurs earlier in the trace and there is no corruption at that point in the file....

Cheers,

Dave.

Comment 9 Frank Sorenson 2022-08-03 01:33:31 UTC

I just replicated this on RHEL 9 kernel 5.14.0-1.1.1.el9.x86_64

Comment 10 Eric Sandeen 2022-08-03 18:17:00 UTC

(In reply to Frank Sorenson from comment #9)
> I just replicated this on RHEL 9 kernel 5.14.0-1.1.1.el9.x86_64

Good to know, makes it unlikely that the problem is a backporting error in RHEL8.

Comment 11 Dave Chinner 2022-08-04 01:33:01 UTC

(In reply to Frank Sorenson from comment #9)
> I just replicated this on RHEL 9 kernel 5.14.0-1.1.1.el9.x86_64

Interesting.

I've been trying for the past hour to reproduce this on 5.10 (XFS in rhel8.4 is 5.10 based), 5,14 (RHEL 9 base), 5.15 (approx. where RHEL9.0 released at) and current TOT, and I can't get the reproducer to trigger on any of the VMs or bare metal machines I run have locally. So it's not clear yet whether there is a problem in the upstream kernels or whether it's something specific to RHEL kernels.

Frank, can you reproduce this on an unmodified upstream 5.14 or 5.15 kernel on the same OS image that the RHEL 9 kernel reproduces the problem on?

-Dave.

Comment 12 Frank Sorenson 2022-08-04 03:55:56 UTC

Created attachment 1903488 [details]
updated reproducer

Comment 13 Frank Sorenson 2022-08-04 03:57:35 UTC

(In reply to Dave Chinner from comment #11)

> Frank, can you reproduce this on an unmodifiedd upstream 5.14 or 5.15 kernel
> on the same OS image that the RHEL 9 kernel reproduces the problem on?

Yes, will do, and let you know the results.

Comment 14 Pierguido Lambri 2022-08-04 15:54:50 UTC

I've tested Frank's reproducer on a RHEL8 VM (2VCPUs 2GB ram) with these kernels:

4.18.0-372.16.1.el8_6.x86_64:

2022-08-04 16:47:50.044498141  [1797] child 35 (pid 1845) replicated the bug on test #1 with device 253:0 inode 33625091
2022-08-04 16:47:53.102051778  [1797] 40 test processes started, 22 test processes running, 74 tests started, bug replicated: 1

2022-08-04 16:47:50.037380611  [1845 / test proc 35] error: found zero bytes at offset 0x0bb00300 (196084480)
0x0bb002c0: 2f 2f 2f 2f 2f 2f 2f 2f | 2f 2f 2f 2f 2f 2f 2f 2f  |////////////////|
0x0bb002d0: 2f 2f 2f 2f 2f 2f 2f 2f | 2f 2f 2f 2f 2f 2f 2f 2f  |////////////////|
0x0bb002e0: 2f 2f 2f 2f 2f 2f 2f 2f | 2f 2f 2f 2f 2f 2f 2f 2f  |////////////////|
0x0bb002f0: 2f 2f 2f 2f 2f 2f 2f 2f | 2f 2f 2f 2f 2f 2f 2f 2f  |////////////////|
0x0bb00300: 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00  |................|
0x0bb00310: 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00  |................|
0x0bb00320: 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00  |................|
0x0bb00330: 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00  |................|
2022-08-04 16:47:50.043739431  [1845 / test proc 35] test proc 35 replicated the bug on test #1 with device 253:0 inode 33625091

# hexdump -C testfiles/test35 | grep ' 00 00 ' -C3
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000300  41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41  |AAAAAAAAAAAAAAAA|
*
--
*
0ba00300  2f 2f 2f 2f 2f 2f 2f 2f  2f 2f 2f 2f 2f 2f 2f 2f  |////////////////|
*
0bb00300  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
0bb01000  3f 3f 3f 3f 3f 3f 3f 3f  3f 3f 3f 3f 3f 3f 3f 3f  |????????????????|
*


4.18.0-372.20.1.el8_6.sfdc03278655.x86_64 (this was a kernel with a small NFS patch applied) had the same result (sorry I didn't saved the logs but it seems to be easily reproducible).
A test is still running on RHEL9 and it doesn't seem to reproduce it (still running after 30+ minutes).

Comment 16 Frank Sorenson 2022-08-10 06:34:45 UTC

Created attachment 1904637 [details]
program to consume memory

program will fork one or more processes which will mmap() and mlock() a chunk of memory

# gcc -Wall eatmemmap.c -o eatmemmap

usage: eatmemmap <memory_size> [<process_count>]
  
    <memory_size> may be specified with a suffix [kKmMgG];
        if omitted, size is interpreted as KiB

    if <process_count> is given, multiple processes will be
        forked, each mapping the specified memory size

# ./eatmemmap 200M
forked 9434 to allocate 200M
#


simply kill the processes to free the memory

Comment 17 Brian Foster 2022-08-11 12:17:12 UTC

Just a semi-random data point based on the lowmem observation described in the previous couple comments..

A similar corruption variant has been reported upstream [1] fairly recently via generic/522. It was originally presumed folio related (and it may still be), but eventually turned out related to a dirty page accounting problem/leak that more than likely exacerbated the issue via increased writeback/reclaim activity. IOW, the corruption problem became regularly reproducible due to a secondary issue that increased memory pressure. The corruption seemingly disappeared when the accounting problem was fixed, but then discovered to be reproducible again when tested in an environment with artificially induced memory/writeback pressure (i.e. elevating dirty_[background_]bytes). There isn't enough context to say whether these corruption issues are the same or even related, but the broader point is just that it wouldn't be terribly surprising if the upstream variant is some longer standing issue than originally anticipated and only discovered due to the more recently introduced accounting problem.

[1] https://lore.kernel.org/linux-xfs/YrO243DkbckLTfP7@magnolia/

Comment 18 Dave Chinner 2022-08-11 22:42:49 UTC

(In reply to Frank Sorenson from comment #15)
> However, I think I'm zeroing in on the key to making this bug reproduce
> on-demand.
> 
> It appears that if I consume most of the system's memory, the bug will occur
> relatively quickly.

Hmmm. I tried that on my systems because I suspect a page uptodate vs EOF zeroing vs writeback issue as the cause. xfstests has a src/usemem.c that allows you to mlock up to 2047MB of RAM, so I ran half a dozen instances to lock about 14.5GB of RAM on the 16GB RAM test VM I was using. It still didn't reproduce, so I guess it's not a guaranteed way of reproducing the issue....

Frank, to cut down on random kernel testing, can you confirm the customer's observation that RHEL8.3 didn't reproduce the problem, whilst 8.4 did? There were some potentially interesting changes in 8.4 (like converting the page cache to use Xarray instead of radix trees) and I'm starting to wonder if a subtle change of locking and/or implied memory ordering is affecting the page flags as a result of some other structural change. Confirming that it doesn't reproduce on 8.3 would help narrow down the potential causes enormously.

- Dave.

Comment 19 Frank Sorenson 2022-08-15 21:47:25 UTC

(In reply to Dave Chinner from comment #18)

> Frank, to cut down on random kernel testing, can you confirm the customer's
> observation that RHEL8.3 didn't reproduce the problem, whilst 8.4 did? There
> were some potentially interesting changes in 8.4 (like converting the page
> cache to use Xarray instead of radix trees) and I'm starting to wonder if a
> subtle change of locking and/or implied memory ordering is affecting the
> page flags as a result of some other structural change. Confirming that it
> doesn't reproduce on 8.3 would help narrow down the potential causes
> enormously.

I can not reproduce this on 8.3 (4.18.0-240*)

I can reproduce the bug on 8.4+ (4.18.0-305*), RHEL 9, unmodified 5.14 and 5.15,
and last Friday's upstream kernel (v5.19-14090-g4a9350597aff)


I'll start looking at kernels between -240 and -305

Comment 20 Frank Sorenson 2022-08-16 17:31:41 UTC

I can replicate this as far back as 4.18.0-269.el8.x86_64, but have not been able to replicate the bug on 4.18.0-268.el8.x86_64 or earlier  (yet... I'm pretty certain, however -- "Absence of evidence is not evidence of absence" and all that... I plan to let it run overnight tonight for more confidence)


which would mean that the bug appears to have been introduced in one of the following:

$ git log --oneline kernel-4.18.0-268.el8..kernel-4.18.0-269.el8 -- mm fs/*.[ch] fs/xfs drivers/block | cat
76241485743f [fs] xfs: force writes to delalloc regions to unwritten
2c5f5435cb50 [fs] xfs: refactor xfs_iomap_prealloc_size
d7229798b05e [fs] xfs: measure all contiguous previous extents for prealloc size
34fc77215098 [fs] xfs: don't fail unwritten extent conversion on writeback due to edquot
ef55ab3db5bf [fs] xfs: allow individual quota grace period extension
b28eef8ab5e8 [fs] xfs: per-type quota timers and warn limits
f293c40fdb40 [fs] xfs: switch xfs_get_defquota to take explicit type
66529fed37aa [fs] xfs: pass xfs_dquot to xfs_qm_adjust_dqtimers
3acb9d1d67a4 [fs] xfs: fix up some whitespace in quota code
97c94c002e90 [fs] xfs: preserve default grace interval during quotacheck

(though only the first three would reasonably be suspected)



I've had the most success replicating the bug when I first chew up all but ~50-60 MiB of the 'available' memory, as displayed by 'free'

# sync ; sysctl vm.drop_caches=3 ; free -m ; echo ; size=$(free -m | awk '$1=="Mem:" {print ($NF-50)}') ; ./eatmemmap ${size}M ; sleep 2 ; free -m
vm.drop_caches = 3
              total        used        free      shared  buff/cache   available
Mem:           7549         240        7235           1          74        7134
Swap:          7951          60        7891

forked 1 process, allocating 7084M
              total        used        free      shared  buff/cache   available
Mem:           7549        7338         132           1          79          33
Swap:          7951          60        7891

then starting the test program:

# ./test_write_rhbz2112147 -p 25 -s 25M -t 25 -b 16384 -o 768 -V .
...
2022-08-16 12:15:31.901986827  [59480] test proc 4 (pid 59491) replicated the bug on test #1 with device 253:0 inode 135661946 at offset 0x30300 (197376)
...
==========================================================
replicated the bug 8 times
/var/tmp/testfiles/test1 - device 253:0 inode 135661942 - 1 error
	  1 - offset 0x4300 (17152) - 0x4fff (20479) - length: 3328
/var/tmp/testfiles/test2 - device 253:0 inode 135661943 - 1 error
	  1 - offset 0x4c300 (312064) - 0x4cfff (315391) - length: 3328
/var/tmp/testfiles/test4 - device 253:0 inode 135661946 - 1 error
	  1 - offset 0x30300 (197376) - 0x30fff (200703) - length: 3328
/var/tmp/testfiles/test9 - device 253:0 inode 135742183 - 1 error
	  1 - offset 0x8300 (33536) - 0x8fff (36863) - length: 3328
/var/tmp/testfiles/test11 - device 253:0 inode 135742184 - 1 error
	  1 - offset 0x54300 (344832) - 0x54fff (348159) - length: 3328
/var/tmp/testfiles/test18 - device 253:0 inode 135742192 - 1 error
	  1 - offset 0x3c300 (246528) - 0x3cfff (249855) - length: 3328
/var/tmp/testfiles/test20 - device 253:0 inode 135742195 - 1 error
	  1 - offset 0x20300 (131840) - 0x20fff (135167) - length: 3328
/var/tmp/testfiles/test22 - device 253:0 inode 135742196 - 1 error
	  1 - offset 0x4300 (17152) - 0x4fff (20479) - length: 3328
results logged to /var/tmp/log.out


(I'm still trying to zero in on the magic combination of memory and test parameters; I'll update if/when I find a bulletproof reproducer, or at least bullet-resistant)

Comment 21 Frank Sorenson 2022-08-17 15:49:35 UTC

Dave,

Per your explanation at https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/  I wonder if we couldn't artificially force the system to reclaim without having to push the system to the brink of OOM first?  I've tried very aggressive sysctl vm.* settings, which helps force writeback, but only the extreme memory pressure is causing the reclaim.

Off to dig into the kernel.  I envision vmcores in my future.

Comment 22 Frank Sorenson 2022-08-26 21:01:41 UTC

Created attachment 1907936 [details]
updated reproducer

So, I went a different direction.


Rather than carefully trying to push the entire system just to the brink of OOM, I'm throwing the test processes into a cgroup and setting a very low memory limit for the cgroup.

Only the processes in the cgroup are under memory pressure, so the system doesn't suffer as a whole, and with low-enough limits, the processes in the cgroup are _always_ reclaiming.


This version will probably work with just the directory parameter; it tries to spawn 1 test process for each online CPU, sets very aggressive memory limits (based on the actual usage of the test threads at steady-state), and should work with both cgroups v1 and v2 under /sys/fs/cgroup

(obviously requires root in order to play with cgroups, etc.)


# gcc -Wall -lpthread -lm -g test_write_rhbz2112147.c -o test_write_rhbz2112147
# /mnt/vdc/test_write_rhbz2112147 /mnt/vdc
2022-08-26 15:55:35.801984700  [186866] test running on 'vm9' arch 'x86_64' kernel '4.18.0-348.23.1.el8_5.x86_64'
2022-08-26 15:55:35.802066659  [186866] base directory for testing is '/mnt/vdc'
2022-08-26 15:55:35.802116518  [186866] size of each testfile is 0x12c00300 (314573568 - 300.00 MiB) bytes, and buffer size will be 0x100000 (1048576 - 1.00 MiB)
2022-08-26 15:55:35.802174202  [186866] initial offset is 0x300 (768)
2022-08-26 15:55:35.802200697  [186866] creating 8 test processes, each having 3 threads
...
2022-08-26 15:55:36.829056310  [186866] 8/8 running, writing: 8, verifying: 0, 8 tests started, replicated: 0
2022-08-26 15:55:36.829183361  [186866] current/minimum memory usage: 25.67 MiB
2022-08-26 15:55:36.829191647  [186866] setting cgroup 'high' memory size: 38.51 MiB
2022-08-26 15:55:36.829226467  [186866] setting cgroup 'max' memory size: 51.35 MiB
...
2022-08-26 15:55:37.829124430  [186866] 8/8 running, writing: 8, verifying: 0, 8 tests started, replicated: 0
...
2022-08-26 15:56:02.147933348  [186866] test proc 5 (pid 186872) replicated the bug 6 times on test #1 with device 252:32 inode 176
	see the logfile '/mnt/vdc/logs/test5.log' for further details
2022-08-26 15:56:02.148026896  [186866] 0/8 running, writing: 0, verifying: 0, 8 tests started, replicated: 8
==========================================================
replicated the bug 8 times
/mnt/vdc/testfiles/test0 - device 252:32 inode 61785 - 4 errors
	  1 - offset 0x4600300 (73401088) - 0x4600fff (73404415) - length: 3328
		start is page 17920, offset 768; byte 0 of write 23 by thread 1
		end is page 17920, offset 4095; byte 3327 of write 23 by thread 1
...

40 total errors in 8 total test files
results logged to /mnt/vdc/log.out
per-test results are logged in /mnt/vdc/logs/
total runtime: 26.354212983 seconds


# hexdump -C /mnt/vdc/testfiles/test0 | grep -v ^\* | grep '00 00' -B1 -A2
04500300  65 65 65 65 65 65 65 65  65 65 65 65 65 65 65 65  |eeeeeeeeeeeeeeee|
04600300  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
04601000  66 66 66 66 66 66 66 66  66 66 66 66 66 66 66 66  |ffffffffffffffff|
04700300  67 67 67 67 67 67 67 67  67 67 67 67 67 67 67 67  |gggggggggggggggg|
--
0b100300  72 72 72 72 72 72 72 72  72 72 72 72 72 72 72 72  |rrrrrrrrrrrrrrrr|
0b200300  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
0b201000  73 73 73 73 73 73 73 73  73 73 73 73 73 73 73 73  |ssssssssssssssss|
0b300300  74 74 74 74 74 74 74 74  74 74 74 74 74 74 74 74  |tttttttttttttttt|
--
0dc00300  3e 3e 3e 3e 3e 3e 3e 3e  3e 3e 3e 3e 3e 3e 3e 3e  |>>>>>>>>>>>>>>>>|
0dd00300  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
0dd01000  3f 3f 3f 3f 3f 3f 3f 3f  3f 3f 3f 3f 3f 3f 3f 3f  |????????????????|
0de00300  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40  |@@@@@@@@@@@@@@@@|
--
12300300  26 26 26 26 26 26 26 26  26 26 26 26 26 26 26 26  |&&&&&&&&&&&&&&&&|
12400300  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
12401000  27 27 27 27 27 27 27 27  27 27 27 27 27 27 27 27  |''''''''''''''''|
12500300  28 28 28 28 28 28 28 28  28 28 28 28 28 28 28 28  |((((((((((((((((|

Comment 24 Dave Chinner 2022-09-21 08:43:59 UTC

First pass at a fix for this in the upstream kernel:

https://lore.kernel.org/linux-xfs/20220921082959.1411675-1-david@fromorbit.com/

Frank, your latest reproducer works on my upstream test machines and reproduces the issue on 6.0-rc6 in under 10s. I've been able to validate the above series fixes the data corruption with the reproducer - it's been running for over an hour without reproducing a corruption with the patches in place.

I suspect the fix is going to take a few iterations to sort out - the fixes I've made are awful and hacky, but I'm not able to come up with anything better right now. I also suspect that the fix is going to be a bit of a pain to backport because of the different iomap iterator structures in upstream vs RHEL8...

-Dave.

Comment 25 Frank Sorenson 2022-10-04 23:03:48 UTC

Created attachment 1916098 [details]
xfs_io-based reproducer

It's embarrassing to see all the work that went into the original reproducer pruned down to 75 lines of bash code calling xfs_io, but now that I know the necessary conditions...  this script seems to reproduce the bug just as well.

Comment 112 errata-xmlrpc 2023-11-14 15:37:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:7077

Note You need to log in before you can comment on or make changes to this bug.