1573189 – parallel aio file extending writes can result in corrupted on-disk superblock fdblocks value

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1573189 - parallel aio file extending writes can result in corrupted on-disk superblock fdblocks value

Summary: parallel aio file extending writes can result in corrupted on-disk superblock...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	7.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	fs-maint
QA Contact:	Zorro Lang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-30 12:48 UTC by Frank Sorenson
Modified:	2022-03-13 14:56 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-15 23:38:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
stand-alone reproducer (5.21 KB, text/x-csrc) 2018-04-30 12:48 UTC, Frank Sorenson	no flags	Details
trace (818.91 KB, text/plain) 2018-05-09 22:39 UTC, Frank Sorenson	no flags	Details
full trace (14.25 MB, application/x-xz) 2018-05-10 11:57 UTC, Frank Sorenson	no flags	Details
dmesg from second vmcore in comment 19 (43.02 KB, text/plain) 2018-05-14 22:11 UTC, Frank Sorenson	no flags	Details
xfs trace from second vmcore in comment 19 (190.62 KB, text/plain) 2018-05-14 22:13 UTC, Frank Sorenson	no flags	Details
slightly faster reproducer (6.78 KB, text/x-csrc) 2018-05-15 20:46 UTC, Frank Sorenson	no flags	Details
Show Obsolete (1) View All

Description Frank Sorenson 2018-04-30 12:48:25 UTC

Created attachment 1428816 [details]
stand-alone reproducer

Description of problem:

A program performing parallel aio writes which extend a file on an xfs filesystem can sometimes result in the superblock's fdblocks count becoming significantly incorrect, causing 'df' output to indicate a many-TiB negative free space, on a filesystem with size in the hundred-GiB range.

Per-ag counts are correct, and repairing re-computes the correct fdblocks count.


Version-Release number of selected component (if applicable):

RHEL 7.4 with 3.10.0-693.21.1.el7.x86_64 kernel


How reproducible:

mixed results.  reported to either work 100% or 0% so far (though sample set is just 3 at this point; 2 positive, 1 negative)


Steps to Reproduce:


originally reported to occur during preparation phase of iometer

stand-alone reproducer was then created which approximates the syscalls made by iometer, and reproduces the incorrect block count

# mkfs.xfs -f /dev/vde ; mount /dev/vde /mnt/vde
meta-data=/dev/vde               isize=512    agcount=4, agsize=9830400 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=39321600, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=19200, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# df -h /mnt/vde
Filesystem      Size  Used Avail Use% Mounted on
/dev/vde        150G   33M  150G   1% /mnt/vde

# ./aio_df_repl-1 /mnt/vde/testfile
iteration 1
iteration 2
iteration 3
replicated: TRUE


Actual results:

# df -h /mnt/vde
Filesystem      Size  Used Avail Use% Mounted on
/dev/vde        150G  -16T   17T    - /mnt/vde


Expected results:

in-use blocks remains accurate


Additional info:

# for AG in $(seq 0 3); do xfs_db -c "agf $AG" -c "print freeblks flcount btreeblks" /dev/vde ; done | awk '{total=total+$3;} END{print "Total is " total}'
Total is 39253223

xfs_db -c sb -c "print dblocks fdblocks agblocks agcount" /dev/vde

fdblocks = 4334220519
agblocks = 9830400
agcount = 4

# echo $(( 9830400 * 4 ))
39321600

# xfs_repair /dev/vde
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
sb_fdblocks 4334220519, counted 39253223
...

# printf %x\\n 39253223
256f4e7

# printf %x\\n 4334220519
10256f4e7

Comment 2 Frank Sorenson 2018-04-30 14:55:34 UTC

# df -h /mnt/vde
Filesystem      Size  Used Avail Use% Mounted on
/dev/vde        150G  -16T   17T    - /mnt/vde

# df -B 4096 /mnt/vde
Filesystem     4K-blocks        Used  Available Use% Mounted on
/dev/vde        39302400 -4294875091 4334177491    - /mnt/vde


# xfs_db -f -c sb -c 'print fdblocks' /dev/vde
fdblocks = 4334177511


crash> px 4334177511
$2 = 0x102564ce7

crash> mount
     MOUNT           SUPERBLK     TYPE   DEVNAME   DIRNAME
ffff88001444d200 ffff88006adab000 xfs    /dev/vde  /mnt/vde  

crash> super_block.s_fs_info ffff88006adab000
  s_fs_info = 0xffff88005ffda000

crash> xfs_mount.m_fdblocks 0xffff88005ffda000
  m_fdblocks = {
    lock = {
      raw_lock = {
        val = {
          counter = 0x0
        }
      }
    }, 
    count = 0x102564cdf, 
    list = {
      next = 0xffff88005ffda160, 
      prev = 0xffffffff81abb880 <percpu_counters>
    }, 
    counters = 0x60ff8f800b7c
  }

Comment 3 Frank Sorenson 2018-04-30 15:15:16 UTC

  This is reproducing for me within the first 3 iterations on my test vm:

4 cpus
1.7 GiB of memory
model name      : Intel Xeon E312xx (Sandy Bridge)
cpu MHz         : 1995.192


the reproducer (patterned after iometer) works as follows:

parent process
  truncate-creates the test file
  starts 8 threads
      each thread starts 16 aio pwrites of 512 KiB sequentially from the beginning of the file
      each thread waits for completion of all 16 aio writes
      each thread then begins 16 aio pwrites of 512 KiB to the next 16 512KiB 'blocks' of the file
      if the child threads reach an arbitrary file size (currently 5 GiB), they exit and the parent process begins the test again

parent process then checks free blocks on the filesystem, and if 'free blocks' becomes larger than the filesystem size, terminates all the children and exits successfully


my filesystem size is 150 GiB, but I suspect that any filesystem where the number of blocks fits within the lower 32 bits could experience the bug, since the invalid blocks count appears to always be off by a multiple of 16 TiB


crash> pd (0xffffffff)
$4 = 4294967295

crash> pd (0xffffffffULL * 4096 / 1024.0/1024.0/1024.0/1024.0)
$9 = 15.99999999627471

crash> pd (0x100000000 * 4096 / 1024.0/1024.0/1024.0/1024.0)
$7 = 16

Comment 4 Frank Sorenson 2018-05-05 20:24:03 UTC

the invalid free blocks count occurs when xfs_mod_fdblocks is called with a ridiculously large 'delta' value.  In fact, it's so large, it looks like it may be a 32-bit negative number.  For example:

kworker/3:1 (2755) - ALERT: xfs_mod_fdblocks (delta = 4294967049)

# printf 0x%x\\n 4294967049
0xffffff09

# echo $(( (1<<32) - 0xffffff09))
247

this occurred in the following backtrace:
 0xffffffffc04a93f0 : xfs_mod_fdblocks+0x0/0x1b0 [xfs]
 0xffffffffc04aeee3 : xfs_trans_unreserve_and_mod_sb+0x1c3/0x4b0 [xfs]
 0xffffffffc04b5b70 : xfs_log_commit_cil+0x440/0x670 [xfs]
 0xffffffffc04af43d : __xfs_trans_commit+0x12d/0x260 [xfs]
 0xffffffffc04af830 : xfs_trans_commit+0x10/0x20 [xfs]
 0xffffffffc049f68a : xfs_iomap_write_unwritten+0x1aa/0x300 [xfs]
 0xffffffffc04898b8 : xfs_end_io+0x78/0x80 [xfs]
 0xffffffffc048b0d7 : xfs_end_io_direct_write+0x67/0x140 [xfs]
 0xffffffffaee5b5d5 : dio_complete+0x125/0x2a0 [kernel]


In another test run, I got the following trace:
    kworker/2:19-15843 [002] 1525505049.628362: xfs_buf_item_format:  dev 253:48 bno 0x640008 len 0x1000 hold 3 pincount 1 lock 0 flags ASYNC|DONE|PAGES recur 0 refcount 2 bliflags DIRTY|LOGGE
D lidesc 0x0xffff8f295acd02d0 liflags IN_AIL
    kworker/2:19-15843 [002] 1525505049.628362: xfs_buf_item_format:  dev 253:48 bno 0x640010 len 0x1000 hold 3 pincount 1 lock 0 flags ASYNC|DONE|PAGES recur 0 refcount 2 bliflags DIRTY|LOGGE
D lidesc 0x0xffff8f295acd1d88 liflags IN_AIL
    kworker/2:19-15843 [002] 1525505049.628363: xfs_log_done_nonperm: dev 253:48 t_ocnt 2 t_cnt 1 t_curr_res 112676 t_unit_res 112720 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq empt
y writeq empty grant_reserve_cycle 1 grant_reserve_bytes 431072 grant_write_cycle 1 grant_write_bytes 431072 curr_cycle 1 curr_block 175 tail_cycle 1 tail_block 171
    kworker/2:19-15843 [002] 1525505049.628365: print:                xlog_ungrant_log_space: CALL:   kworker/2:19 - m_fdblocks = 11935911
    kworker/2:19-15843 [002] 1525505049.628367: xfs_log_ungrant_enter: dev 253:48 t_ocnt 2 t_cnt 0 t_curr_res 112676 t_unit_res 112720 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq emp
ty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 431072 grant_write_cycle 1 grant_write_bytes 431072 curr_cycle 1 curr_block 175 tail_cycle 1 tail_block 171
    kworker/2:19-15843 [002] 1525505049.628367: xfs_log_ungrant_sub:  dev 253:48 t_ocnt 2 t_cnt 0 t_curr_res 112676 t_unit_res 112720 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq empt
y writeq empty grant_reserve_cycle 1 grant_reserve_bytes 431072 grant_write_cycle 1 grant_write_bytes 431072 curr_cycle 1 curr_block 175 tail_cycle 1 tail_block 171
    kworker/2:19-15843 [002] 1525505049.628368: xfs_log_ungrant_exit: dev 253:48 t_ocnt 2 t_cnt 0 t_curr_res 112676 t_unit_res 112720 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq empt
y writeq empty grant_reserve_cycle 1 grant_reserve_bytes 318396 grant_write_cycle 1 grant_write_bytes 318396 curr_cycle 1 curr_block 175 tail_cycle 1 tail_block 171
    kworker/2:19-15843 [002] 1525505049.628370: print:                xlog_ungrant_log_space: RETURN: kworker/2:19 - m_fdblocks = 11935911
    kworker/2:19-15843 [002] 1525505049.628388: print:                xfs_mod_fdblocks: CALL:   kworker/2:19 - m_fdblocks = 11935911
    kworker/2:19-15843 [002] 1525505049.628392: print:                xfs_mod_fdblocks: RETURN: kworker/2:19 - m_fdblocks = 4306903088
    kworker/2:19-15843 [002] 1525505049.628399: xfs_buf_item_unlock:  dev 253:48 bno 0x640001 len 0x200 hold 3 pincount 1 lock 0 flags ASYNC|DONE|KMEM recur 0 refcount 2 bliflags DIRTY lidesc 0x(nil) liflags IN_AIL
    kworker/2:19-15843 [002] 1525505049.628400: xfs_buf_unlock:       dev 253:48 bno 0x640001 nblks 0x1 hold 3 pincount 1 lock 1 flags ASYNC|DONE|KMEM caller xfs_buf_item_unlock
    kworker/2:19-15843 [002] 1525505049.628400: xfs_buf_rele:         dev 253:48 bno 0x640001 nblks 0x1 hold 3 pincount 1 lock 1 flags ASYNC|DONE|KMEM caller xfs_buf_item_unlock
    kworker/2:19-15843 [002] 1525505049.628406: xfs_buf_item_unlock:  dev 253:48 bno 0x640008 len 0x1000 hold 3 pincount 1 lock 0 flags ASYNC|DONE|PAGES recur 0 refcount 2 bliflags DIRTY lidesc 0x(nil) liflags IN_AIL
    kworker/2:19-15843 [002] 1525505049.628406: xfs_buf_unlock:       dev 253:48 bno 0x640008 nblks 0x8 hold 3 pincount 1 lock 1 flags ASYNC|DONE|PAGES caller xfs_buf_item_unlock
    kworker/2:19-15843 [002] 1525505049.628406: xfs_buf_rele:         dev 253:48 bno 0x640008 nblks 0x8 hold 3 pincount 1 lock 1 flags ASYNC|DONE|PAGES caller xfs_buf_item_unlock
    kworker/2:19-15843 [002] 1525505049.628412: xfs_buf_item_unlock:  dev 253:48 bno 0x640010 len 0x1000 hold 3 pincount 1 lock 0 flags ASYNC|DONE|PAGES recur 0 refcount 2 bliflags DIRTY lidesc 0x(nil) liflags IN_AIL
    kworker/2:19-15843 [002] 1525505049.628412: xfs_buf_unlock:       dev 253:48 bno 0x640010 nblks 0x8 hold 3 pincount 1 lock 1 flags ASYNC|DONE|PAGES caller xfs_buf_item_unlock
    kworker/2:19-15843 [002] 1525505049.628413: xfs_buf_rele:         dev 253:48 bno 0x640010 nblks 0x8 hold 3 pincount 1 lock 1 flags ASYNC|DONE|PAGES caller xfs_buf_item_unlock
   aio_df_repl-1-16449 [000] 1525505049.628416: print:                xfs_fs_statfs: CALL:   aio_df_repl-1 - m_fdblocks = 4306903088
   aio_df_repl-1-16449 [000] 1525505049.628421: print:                xfs_fs_statfs: RETURN: aio_df_repl-1 - m_fdblocks = 4306903088

here again, xfs_mod_fdblocks was called with a delta close to the maximum 32-bit number, again hinting at the possiblity of a negative 32-bit number.  The exit and entry values (I wasn't capturing 'delta' yet):

# echo $(( 4306903088 - 11935911 ))
4294967177

[root@vm6 tmp]# echo $(( (1<<32) - 4294967177))
119



and again in this test run, where the free blocks glitched twice:

    kworker/3:12-16292 [003] 1525532322.388169: xfs_log_done_nonperm: dev 253:48 t_ocnt 2 t_cnt 1 t_curr_res 112676 t_unit_res 112720 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq empt
y writeq empty grant_reserve_cycle 1 grant_reserve_bytes 254272 grant_write_cycle 1 grant_write_bytes 254272 curr_cycle 1 curr_block 270 tail_cycle 1 tail_block 266
    kworker/3:12-16292 [003] 1525532322.388171: print:                xlog_ungrant_log_space: CALL:   kworker/3:12 - m_fdblocks = 12278447
    kworker/3:12-16292 [003] 1525532322.388172: xfs_log_ungrant_enter: dev 253:48 t_ocnt 2 t_cnt 0 t_curr_res 112676 t_unit_res 112720 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq emp
ty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 254272 grant_write_cycle 1 grant_write_bytes 254272 curr_cycle 1 curr_block 270 tail_cycle 1 tail_block 266
    kworker/3:12-16292 [003] 1525532322.388173: xfs_log_ungrant_sub:  dev 253:48 t_ocnt 2 t_cnt 0 t_curr_res 112676 t_unit_res 112720 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq empt
y writeq empty grant_reserve_cycle 1 grant_reserve_bytes 254272 grant_write_cycle 1 grant_write_bytes 254272 curr_cycle 1 curr_block 270 tail_cycle 1 tail_block 266
    kworker/3:12-16292 [003] 1525532322.388173: xfs_log_ungrant_exit: dev 253:48 t_ocnt 2 t_cnt 0 t_curr_res 112676 t_unit_res 112720 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq empt
y writeq empty grant_reserve_cycle 1 grant_reserve_bytes 141596 grant_write_cycle 1 grant_write_bytes 141596 curr_cycle 1 curr_block 270 tail_cycle 1 tail_block 266
    kworker/3:12-16292 [003] 1525532322.388176: print:                xlog_ungrant_log_space: RETURN: kworker/3:12 - m_fdblocks = 12278447
    kworker/3:12-16292 [003] 1525532322.388184: print:                xfs_mod_fdblocks: CALL:   kworker/3:12 - m_fdblocks = 12278447, delta = 4294967177, rsvd = 1
    kworker/3:12-16292 [003] 1525532322.388188: print:                xfs_mod_fdblocks: RETURN: kworker/3:12 - m_fdblocks = 4307245624, delta = 4294967177, rsvd = 1
   aio_df_repl-1-16192 [001] 1525532322.388199: sys_enter_statfs:     pathname: 0x7ffd87ed261e, buf: 0x7ffd87ed0660
   aio_df_repl-1-16192 [001] 1525532322.388209: print:                xfs_fs_statfs: CALL:   aio_df_repl-1 - m_fdblocks = 4307245624
   aio_df_repl-1-16192 [001] 1525532322.388213: print:                xfs_fs_statfs: RETURN: aio_df_repl-1 - m_fdblocks = 4307245624
   aio_df_repl-1-16192 [001] 1525532322.388213: sys_exit_statfs:      0x0
   aio_df_repl-1-16192 [001] 1525532322.388221: xfs_getattr:          dev 253:48 ino 0x43
   aio_df_repl-1-16192 [001] 1525532322.388233: sys_enter_fstatfs:    fd: 0x00000003, buf: 0x7ffd87ed0660
   aio_df_repl-1-16192 [001] 1525532322.388236: print:                xfs_fs_statfs: CALL:   aio_df_repl-1 - m_fdblocks = 4307245624
    kworker/3:12-16292 [003] 1525532322.388246: xfs_buf_item_unlock:  dev 253:48 bno 0x1 len 0x200 hold 3 pincount 1 lock 0 flags ASYNC|DONE|KMEM recur 0 refcount 2 bliflags DIRTY lidesc 0x(nil) liflags IN_AIL
...
    kworker/3:12-16292 [003] 1525532322.391088: print:                xlog_ungrant_log_space: CALL:   kworker/3:12 - m_fdblocks = 4307245095
       trace-cmd-16191 [000] 1525532322.391089: xfs_iunlock:          dev 253:2 ino 0x147a85d flags ILOCK_EXCL caller kretprobe_trampoline
    kworker/3:12-16292 [003] 1525532322.391090: xfs_log_ungrant_enter: dev 253:48 t_ocnt 2 t_cnt 0 t_curr_res 112676 t_unit_res 112720 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq emp
ty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 479836 grant_write_cycle 1 grant_write_bytes 479836 curr_cycle 1 curr_block 270 tail_cycle 1 tail_block 266
    kworker/3:12-16292 [003] 1525532322.391090: xfs_log_ungrant_sub:  dev 253:48 t_ocnt 2 t_cnt 0 t_curr_res 112676 t_unit_res 112720 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq empt
y writeq empty grant_reserve_cycle 1 grant_reserve_bytes 479836 grant_write_cycle 1 grant_write_bytes 479836 curr_cycle 1 curr_block 270 tail_cycle 1 tail_block 266
    kworker/3:12-16292 [003] 1525532322.391091: xfs_log_ungrant_exit: dev 253:48 t_ocnt 2 t_cnt 0 t_curr_res 112676 t_unit_res 112720 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq empt
y writeq empty grant_reserve_cycle 1 grant_reserve_bytes 367160 grant_write_cycle 1 grant_write_bytes 367160 curr_cycle 1 curr_block 270 tail_cycle 1 tail_block 266
    kworker/3:12-16292 [003] 1525532322.391094: print:                xlog_ungrant_log_space: RETURN: kworker/3:12 - m_fdblocks = 4307245095
    kworker/3:12-16292 [003] 1525532322.391102: print:                xfs_mod_fdblocks: CALL:   kworker/3:12 - m_fdblocks = 4307245095, delta = 4294967177, rsvd = 1
       trace-cmd-16191 [000] 1525532322.391103: xfs_iunlock:          dev 253:2 ino 0x147a85d flags IOLOCK_EXCL caller kretprobe_trampoline
    kworker/3:12-16292 [003] 1525532322.391108: print:                xfs_mod_fdblocks: RETURN: kworker/3:12 - m_fdblocks = 8602212272, delta = 4294967177, rsvd = 1
    kworker/3:12-16292 [003] 1525532322.391115: xfs_buf_item_unlock:  dev 253:48 bno 0x1 len 0x200 hold 3 pincount 1 lock 0 flags ASYNC|DONE|KMEM recur 0 refcount 2 bliflags DIRTY lidesc 0x(nil) liflags IN_AIL
       trace-cmd-16191 [000] 1525532322.391121: xfs_ilock:            dev 253:2 ino 0x147a85d flags IOLOCK_EXCL caller kretprobe_trampoline
       trace-cmd-16191 [000] 1525532322.391123: xfs_file_splice_write: dev 253:2 ino 0x147a85d size 0x401b000 offset 0x407c000 count 0x1000


both of which were called with delta 4294967177:

12278447, delta = 4294967177, rsvd = 1
    kworker/3:12-16292 [003] 1525532322.388188: print:                xfs_mod_fdblocks: RETURN: kworker/3:12 - m_fdblocks = 4307245624
    kworker/3:12-16292 [003] 1525532322.391102: print:                xfs_mod_fdblocks: CALL:   kworker/3:12 - m_fdblocks = 4307245095, delta = 4294967177, rsvd = 1

    kworker/3:12-16292 [003] 1525532322.391102: print:                xfs_mod_fdblocks: CALL:   kworker/3:12 - m_fdblocks = 4307245095, delta = 4294967177, rsvd = 1
       trace-cmd-16191 [000] 1525532322.391103: xfs_iunlock:          dev 253:2 ino 0x147a85d flags IOLOCK_EXCL caller kretprobe_trampoline
    kworker/3:12-16292 [003] 1525532322.391108: print:                xfs_mod_fdblocks: RETURN: kworker/3:12 - m_fdblocks = 8602212272, delta = 4294967177, rsvd = 1

# echo $(( (1<<32) - 4294967177))
119

Comment 5 Dave Chinner 2018-05-07 01:38:37 UTC

(In reply to Frank Sorenson from comment #4)
> the invalid free blocks count occurs when xfs_mod_fdblocks is called with a
> ridiculously large 'delta' value.  In fact, it's so large, it looks like it
> may be a 32-bit negative number.  For example:
> 
> kworker/3:1 (2755) - ALERT: xfs_mod_fdblocks (delta = 4294967049)

delta is a int64_t, not a 32 bit value. It can be negative, too.

> # printf 0x%x\\n 4294967049
> 0xffffff09
> 
> # echo $(( (1<<32) - 0xffffff09))
> 247
> 
> this occurred in the following backtrace:
>  0xffffffffc04a93f0 : xfs_mod_fdblocks+0x0/0x1b0 [xfs]
>  0xffffffffc04aeee3 : xfs_trans_unreserve_and_mod_sb+0x1c3/0x4b0 [xfs]
>  0xffffffffc04b5b70 : xfs_log_commit_cil+0x440/0x670 [xfs]
>  0xffffffffc04af43d : __xfs_trans_commit+0x12d/0x260 [xfs]
>  0xffffffffc04af830 : xfs_trans_commit+0x10/0x20 [xfs]
>  0xffffffffc049f68a : xfs_iomap_write_unwritten+0x1aa/0x300 [xfs]
>  0xffffffffc04898b8 : xfs_end_io+0x78/0x80 [xfs]
>  0xffffffffc048b0d7 : xfs_end_io_direct_write+0x67/0x140 [xfs]
>  0xffffffffaee5b5d5 : dio_complete+0x125/0x2a0 [kernel]

Getting a negative block count here is indicative of data blocks being allocated during unwritten extent conversion at IO completion. This should be and extent conversion, not allocation, and so doesn't have a space reservation for allocation.

We account for the space allocated in From xfs_alloc_ag_vextent():

        if (!args->isfl) {
                xfs_trans_mod_sb(args->tp, args->wasdel ?
                                 XFS_TRANS_SB_RES_FDBLOCKS :
                                 XFS_TRANS_SB_FDBLOCKS,
                                 -((long)(args->len)));
        }

If we landed in a hole then wasdel = false. Hence we add -(args->len) to tp->t_fdblocks_delta. When the transaction is committed, we call xfs_trans_unreserve_and_mod_sb() which does:

        if (tp->t_blk_res > 0)
                blkdelta = tp->t_blk_res;
        if ((tp->t_fdblocks_delta != 0) &&
            (xfs_sb_version_haslazysbcount(&mp->m_sb) ||
             (tp->t_flags & XFS_TRANS_SB_DIRTY)))
                blkdelta += tp->t_fdblocks_delta;

But because we don't have a block reservation for the allocation, tp->t_blk_res will be very small (10-20 blocks) and then we add the negative block count for the allocation, which is large. Hence blkdelta ends up being a negative value. We then call xfs_mod_fdblocks(blkdelta) and it ends up doing:

        __percpu_counter_add(&mp->m_fdblocks, delta, batch);

and that updates the in-memory free block count. Everything in this path is using signed 64 bit integers, so there should not be any integer overflows in this path. Hence my suspicion that the negative block count is a case of allocation without reservation during conversion.

I'm going to run this on a TOT kernel to see if it reproduces there, and then go from there....

-Dave.

Comment 6 Dave Chinner 2018-05-07 01:52:17 UTC

FWIW:

> trace-cmd-16191 [000] 1525532322.391121: xfs_ilock:            dev 253:2 ino 0x147a85d flags IOLOCK_EXCL caller kretprobe_trampoline

Does the retpoline code in the RHEL kernel break all the return pointer tracing information? Or is this something specific to the kernel you were testing, Frank?

-Dave.

Comment 7 Dave Chinner 2018-05-07 05:39:07 UTC

I haven't been able to reproduce this on a 4.17-rc3 kernel through hundreds of iterations.

-Dave.

Comment 8 Donald Douwsma 2018-05-07 11:41:07 UTC

It can take some time, on 3.10.0-693.21.1.el7.x86_64 I eventually reproduced it after 3000 odd iterations. It's reproduction may have coincided with a suspend/resume, but I cant be sure as I'd stopped paying attention to it.

That was on a 7 CPU W541 bare metal (Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz). Didn't try with a VM.

Comment 9 Frank Sorenson 2018-05-07 14:38:17 UTC

(In reply to Dave Chinner from comment #6)
> FWIW:
> 
> > trace-cmd-16191 [000] 1525532322.391121: xfs_ilock:            dev 253:2 ino 0x147a85d flags IOLOCK_EXCL caller kretprobe_trampoline
> 
> Does the retpoline code in the RHEL kernel break all the return pointer
> tracing information? Or is this something specific to the kernel you were
> testing, Frank?
> 
> -Dave.

No, I was running a systemtap with a lot of probes on function returns.  I'll add debugging directly into the kernel and recompile so I get valid returns in backtraces.

I tried reproducing on 4.16 overnight, and it hadn't occurred after 14,000 iterations, despite always hitting within the first 10 with the 3.10.0-* kernels I tried (usually within the first 3-4, though I've only tried on VMs).


Yes, I was seeing that all the variables which appeared to be relevant were 64-bit, and that negative values were possible, as well.  I was just observing that if a negative value were cut at 32 bits, the delta would make sense.

Comment 10 Frank Sorenson 2018-05-08 04:11:31 UTC

Okay, this reproduces a lot faster with fewer AGs (limit to 4 or 2), and with these changes in the reproducer:

#define WRITE_SIZE (128ULL * KiB)

#define IOCB_NUM 256
#define MAX_EVENTS 256



the bug hits in the following (I have not tracked the transaction back any further yet, so I don't know the origin of the t_blk_res value):

xfs_log_commit_cil(
...
  xfs_trans_unreserve_and_mod_sb(tp);  t_fdblocks_delta = 1, tp->t_blk_res = 4294967272
    xfs_mod_fdblocks(m_fdblocks = 12579877, delta = 4294967273, rsvd = 1)



     kworker/0:2-26564 [000] print: xfs_trans_unreserve_and_mod_sb: CALL:   - m_fdblocks = 12579877, t_blk_res = 4294967272, t_fdblocks_delta = 1, t_rtx_res = 0, t_frextents_delta = 0, t_icount_delta = 0
     kworker/0:2-26564 [000] print: xfs_mod_fdblocks: CALL:   - m_fdblocks = 12579877, delta = 4294967273, rsvd = 1
     kworker/0:2-26564 [000] print: xfs_mod_fdblocks: - ALERT: xfs_mod_fdblocks (delta = 4294967273)
     kworker/0:2-26564 [000] print: xfs_mod_fdblocks: RETURN: - m_fdblocks = 4307547150, delta = 4294967273, rsvd = 1
     kworker/0:2-26564 [000] print: xfs_trans_unreserve_and_mod_sb: RETURN: - m_fdblocks = 4307547150


xfs_log_commit_cil(
...

        /* check we didn't blow the reservation */
        if (tp->t_ticket->t_curr_res < 0)
                xlog_print_tic_res(mp, tp->t_ticket);

        tp->t_commit_lsn = cil->xc_ctx->sequence;
        if (commit_lsn)
                *commit_lsn = tp->t_commit_lsn;

        xfs_log_done(mp, tp->t_ticket, NULL, regrant);
        xfs_trans_unreserve_and_mod_sb(tp); <<<<<<<<<<<<<<<<

Comment 11 Dave Chinner 2018-05-09 01:57:14 UTC

> m_fdblocks = 12579877, t_blk_res = 4294967272, t_fdblocks_delta = 1,

OK, so t_blk_res is going negative, not t_fdblocks_delta. I'm not sure how that can happen as the only time t_blk_res is modified is at transaction allocation (and that should never over/underflow) or during a transaction roll. hence this implies we're committing a transaction we've just duplicated as xfs_trans_dup() does:

       tp->t_blk_res = tp->t_blk_res_used;

So that when it commits it should end up only accounting for the blocks allocated within that transaction and not freeing the unused part of the block reservation it has passed on to the duplicated transaction. 

However, t_blk_res_used is unsigned, and we only ever change it when the delta is negative by doing:

      if (delta < 0)
             tp->t_blk_res_used += (uint)-delta;

Hence it should only have small positive increments.

However, we've apparently *freed* one block (t_fdblocks_delta = 1) during this transaction, so delta was not less than zero which would indicate allocation.

That's just confusing. Frank, can you dump t_blk_res_used as well as t_blk_res, and also dump the stack if t_blk_res_used is negative in xfs_trans_dup()?

-Dave.

Comment 12 Frank Sorenson 2018-05-09 22:39:25 UTC

Created attachment 1434042 [details]
trace

the last ~4100 lines of the trace.

the delta was first applied to the xfs_mount here:

kworker/3:7 (14753) xfs_mod_fdblocks(m_fdblocks = 12096966, delta = 4294967273, rsvd = 1
 0xffffffffc03fe3f0 : xfs_mod_fdblocks+0x0/0x1b0 [xfs]
 0xffffffffc0403ee3 : xfs_trans_unreserve_and_mod_sb+0x1c3/0x4b0 [xfs]
 0xffffffffc040ab70 : xfs_log_commit_cil+0x440/0x670 [xfs]
 0xffffffffc040443d : __xfs_trans_commit+0x12d/0x260 [xfs]
 0xffffffffc0404830 : xfs_trans_commit+0x10/0x20 [xfs]
 0xffffffffc03f468a : xfs_iomap_write_unwritten+0x1aa/0x300 [xfs]
 0xffffffffc03de8b8 : xfs_end_io+0x78/0x80 [xfs]
 0xffffffffc03e00d7 : xfs_end_io_direct_write+0x67/0x140 [xfs]
 0xffffffffa9e5b5d5 : dio_complete+0x125/0x2a0 [kernel]


bogus value in t_blk_res is first seen in this location:

 0xffffffffc0403430 : xfs_trans_reserve+0x0/0x1f0 [xfs]
 0xffffffffc0404b12 : __xfs_trans_roll+0xb2/0xe0 [xfs]
 0xffffffffc0404b65 : xfs_trans_roll+0x25/0x40 [xfs]
 0xffffffffc03c6ad9 : xfs_defer_trans_roll+0x69/0x160 [xfs]
 0xffffffffc03c6cda : xfs_defer_finish+0x7a/0x3a0 [xfs]
 0xffffffffc03f4676 : xfs_iomap_write_unwritten+0x196/0x300 [xfs]
 0xffffffffc03de8b8 : xfs_end_io+0x78/0x80 [xfs]
 0xffffffffc03e00d7 : xfs_end_io_direct_write+0x67/0x140 [xfs]
 0xffffffffa9e5b5d5 : dio_complete+0x125/0x2a0 [kernel]

Comment 13 Frank Sorenson 2018-05-09 23:04:43 UTC

the call to xfs_trans_dup is with a positive t_blk_res_used:

xfs_trans_dup: caller __xfs_trans_roll, m_fdblocks = 12096966, tid = 0xb7452d4a, t_blk_res = 8, t_blk_res_used = 32

Okay, so this actually makes sense now...

typedef struct xfs_trans {
   [0x0] unsigned int t_magic;
   [0x4] unsigned int t_log_res;
   [0x8] unsigned int t_log_count;
   [0xc] unsigned int t_blk_res;
  [0x10] unsigned int t_blk_res_used;
...

in xfs_trans_dup:
  100           ntp->t_blk_res = tp->t_blk_res - tp->t_blk_res_used;
  101           tp->t_blk_res = tp->t_blk_res_used;



#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

int main(int argc, char *argv[]) {
        uint32_t tp_t_blk_res = 8;
        uint32_t tp_t_blk_res_used = 32;
        uint32_t ntp_t_blk_res = tp_t_blk_res - tp_t_blk_res_used;

        printf("ntp_t_blk_res = %u\n", ntp_t_blk_res);
}

ntp_t_blk_res = 4294967272


and that's the t_blk_res value:

     kworker/3:7-14753 [003] 1525903278.255302: xfs_trans_reserve: caller __xfs_trans_roll, m_fdblocks = 12096966, tid = 0xb7452d4a, blocks = 0, t_blk_res = 4294967272, t_blk_res_used = 0

Comment 14 Dave Chinner 2018-05-10 06:25:30 UTC

Frank,

Do you have the entire trace? It looks like IO completion is running unwritten extent conversion on a hole, but there's not enough history in the trace to determine why. i.e:

kworker/2:30-14998 [002] 1525903278.258586: xfs_bmap_pre_update:  dev 253:48 ino 0x43 state LC idx 248 offset 995584 block 996025 count 32 flag 0 caller xfs_bmap_add_extent_hole_real
    kworker/2:30-14998 [002] 1525903278.258587: xfs_bmap_post_update: dev 253:48 ino 0x43 state LC idx 248 offset 995584 block 996025 count 64 flag 0 caller xfs_bmap_add_extent_hole_real

Those two traces in a IO completion context indicate that there is a conversion allocating a written extent over a hole. It seems to be happening a lot in that trace, and I'm not sure why...

-Dave.

Comment 15 Frank Sorenson 2018-05-10 11:57:41 UTC

Created attachment 1434336 [details]
full trace

Comment 16 Brian Foster 2018-05-10 12:02:00 UTC

Just a couple high-level points that might be worth noting:

- Upstream should check for transaction block reservation overrun as of commit 3e78b9a468 ("xfs: shutdown if block allocation overruns tx reservation"). The consequences it describes sounds like the resulting behavior here.

- It doesn't sound like this has been reproduced upstream at all. FWIW, upstream is fully using the iomap interface while rhel7 still uses the old __blockdev_direct_IO()/__xfs_get_blocks() model for direct I/O.

Comment 17 Frank Sorenson 2018-05-10 21:23:32 UTC

Ah, yes.  This condition would definitely hit the ASSERT; I haven't looked closely at xfs_trans_mod_sb, but it does look like we'd hit the condition.

Was the commit added to address a specific condition (did someone else hit this bug or similar?), or was it just for added safety?


Correct, I don't think anyone has been able to reproduce this upstream, so perhaps the switch to the iomap interface has removed the possibility of hitting it.

Comment 18 Brian Foster 2018-05-11 10:32:42 UTC

(In reply to Frank Sorenson from comment #17)
> Ah, yes.  This condition would definitely hit the ASSERT; I haven't looked
> closely at xfs_trans_mod_sb, but it does look like we'd hit the condition.
> 
> Was the commit added to address a specific condition (did someone else hit
> this bug or similar?), or was it just for added safety?
> 

I think it was associated with the very next commit, which is b3fed43482 ("xfs: account format bouncing into rmapbt swapext tx reservation"). IIRC, that reproduced a similar block reservation overrun during swapext (xfs_fsr) on reverse mapping enabled filesystems. I don't think that is related to this problem, since the reproducer isn't doing extent swaps (and rhel7 doesn't support rmap).

I'm not sure what you have going for debug/instrumentation code, but that first patch might be useful if you wanted to catch the block usage modification that actually put the transaction over reservation (before it gets propagated into an overrun or whatnot..).

> 
> Correct, I don't think anyone has been able to reproduce this upstream, so
> perhaps the switch to the iomap interface has removed the possibility of
> hitting it.

Indeed, or whatever customization we may have introduced to support both methods could have caused a downstream only regression...

Comment 19 Frank Sorenson 2018-05-14 21:33:45 UTC

I enabled xfs debugging on kernel-3.10.0-883.el7, and started hitting assertions immediately after I started running the reproducer.

currently
	8 threads, each:
		opens the same file O_DIRECT
		while position < 5 GiB:
			io_submit for 512 pwrites of 64 KiB sequentially from the beginning of the file
			io_getevents for all 512 ios
			advance position by 512*64 KiB



[ 1444.475850] XFS: Assertion failed: offset + size <= ioend->io_offset + ioend->io_size, file: fs/xfs/xfs_aops.c, line: 1488
[ 1444.475987] ------------[ cut here ]------------
[ 1444.476018] kernel BUG at fs/xfs/xfs_message.c:113!
...
[ 1444.476714] Workqueue: dio/vdd dio_aio_complete_work
[ 1444.476751] task: ffffa11d753a0000 ti: ffffa11d732f8000 task.ti: ffffa11d732f8000

[ 1444.477341] Call Trace:
[ 1444.477393]  [<ffffffffc05c39f6>] xfs_end_io_direct_write+0x186/0x190 [xfs]
[ 1444.477441]  [<ffffffffb9c5b405>] dio_complete+0x125/0x2a0
[ 1444.477506]  [<ffffffffb9c5b701>] dio_aio_complete_work+0x21/0x30
[ 1444.477558]  [<ffffffffb9ab1f0f>] process_one_work+0x17f/0x440
[ 1444.477608]  [<ffffffffb9ab2fa6>] worker_thread+0x126/0x3c0
[ 1444.477644]  [<ffffffffb9ab2e80>] ? manage_workers.isra.24+0x2a0/0x2a0
[ 1444.477689]  [<ffffffffb9ab9f41>] kthread+0xd1/0xe0
[ 1444.477718]  [<ffffffffb9ab9e70>] ? insert_kthread_work+0x40/0x40
[ 1444.477752]  [<ffffffffba114637>] ret_from_fork_nospec_begin+0x21/0x21
[ 1444.477796]  [<ffffffffb9ab9e70>] ? insert_kthread_work+0x40/0x40

static void
__xfs_end_io_direct_write(
        struct inode            *inode,
        struct xfs_ioend        *ioend,
        loff_t                  offset,
        ssize_t                 size)
{
        struct xfs_mount        *mp = XFS_I(inode)->i_mount;
        unsigned long           flags;

        if (XFS_FORCED_SHUTDOWN(mp) || ioend->io_error)
                goto out_end_io;

        if (size <= 0)
                return;
...
        ASSERT(offset + size <= ioend->io_offset + ioend->io_size);  <<<<
        ioend->io_size = size;
        ioend->io_offset = offset;


0xffffffffc05c38d7 <xfs_end_io_direct_write+103>:       mov    0x28(%r15),%rax
0xffffffffc05c38db <xfs_end_io_direct_write+107>:       add    0x20(%r15),%rax
0xffffffffc05c38df <xfs_end_io_direct_write+111>:       lea    (%r12,%rbx,1),%r13
0xffffffffc05c38e3 <xfs_end_io_direct_write+115>:       cmp    %rax,%r13
0xffffffffc05c38e6 <xfs_end_io_direct_write+118>:       ja     0xffffffffc05c39de <xfs_end_io_direct_write+366>
...
0xffffffffc05c39de <xfs_end_io_direct_write+366>:       mov    $0x5d0,%edx
0xffffffffc05c39e3 <xfs_end_io_direct_write+371>:       mov    $0xffffffffc061bbd8,%rsi
0xffffffffc05c39ea <xfs_end_io_direct_write+378>:       mov    $0xffffffffc0626c68,%rdi
0xffffffffc05c39f1 <xfs_end_io_direct_write+385>:       callq  0xffffffffc05e43d0 <assfail>
0xffffffffc05c39f6 <xfs_end_io_direct_write+390>:       jmpq   0xffffffffc05c38ec <xfs_end_io_direct_write+124>



0xffffffffb9c5b3ea <dio_complete+266>:  mov    0x20(%r12),%rcx << 4th arg in %rcx and at %r12+0x20
0xffffffffb9c5b3ef <dio_complete+271>:  mov    0x58(%r12),%rdi << 1st arg in %rdi and at %r12+0x58
0xffffffffb9c5b3f4 <dio_complete+276>:  xor    %r9d,%r9d
0xffffffffb9c5b3f7 <dio_complete+279>:  xor    %r8d,%r8d
0xffffffffb9c5b3fa <dio_complete+282>:  mov    %rbx,%rdx  << 3rd arg in %rbx and %rdx
0xffffffffb9c5b3fd <dio_complete+285>:  mov    %r14,%rsi  << 2nd arg in %r14 and %rsi
0xffffffffb9c5b400 <dio_complete+288>:  callq  0xffffffffb9d5bc20 <__x86_indirect_thunk_rax>

void xfs_end_io_direct_write(struct kiocb *, loff_t, ssize_t, void *, int, bool);


0xffffffffc05c3870 <xfs_end_io_direct_write>:   data32 data32 data32 xchg %ax,%ax [FTRACE NOP]
0xffffffffc05c3875 <xfs_end_io_direct_write+5>: push   %rbp
0xffffffffc05c3876 <xfs_end_io_direct_write+6>: mov    %rsp,%rbp
0xffffffffc05c3879 <xfs_end_io_direct_write+9>: push   %r15
0xffffffffc05c387b <xfs_end_io_direct_write+11>:        mov    %rcx,%r15
0xffffffffc05c387e <xfs_end_io_direct_write+14>:        push   %r14
0xffffffffc05c3880 <xfs_end_io_direct_write+16>:        push   %r13
0xffffffffc05c3882 <xfs_end_io_direct_write+18>:        push   %r12
0xffffffffc05c3884 <xfs_end_io_direct_write+20>:        mov    %rsi,%r12
0xffffffffc05c3887 <xfs_end_io_direct_write+23>:        push   %rbx
0xffffffffc05c3888 <xfs_end_io_direct_write+24>:        mov    %rdx,%rbx
0xffffffffc05c388b <xfs_end_io_direct_write+27>:        sub    $0x10,%rsp

 #8 [ffffa11d732fbd88] xfs_end_io_direct_write at ffffffffc05c39f6 [xfs]
    ffffa11d732fbd90: ffffa11da74a2638 0000000000000000 
    ffffa11d732fbda0: 0000000000010000 ffffa11da2f70c00  << %rbx and %r12
    ffffa11d732fbdb0: 0000000000010000 0000000000010000  << %r13 and %r14
    ffffa11d732fbdc0: 0000000000000001 ffffa11d732fbe08  << %r15 and %rbp
    ffffa11d732fbdd0: ffffffffb9c5b405                   << return address
 #9 [ffffa11d732fbdd0] dio_complete at ffffffffb9c5b405


kiocb is ffffa11dab987600
offset is 0000000000010000
size is 0000000000010000

xfs_ioend is ffffa11da2bfe300
dio is ffffa11da2f70c00

crash> xfs_ioend.io_offset,io_size ffffa11da2bfe300
  io_offset = 0
  io_size = 65536

        ASSERT(offset + size <= ioend->io_offset + ioend->io_size);  <<<<

the offset + size is 0x20000 (131072)
ioend->io_offset + ioend->io_size is 0x10000 (65536)

=============================================================================

The next run, I hit this assert immediately:

[10750.624262] XFS (vdd): Ending clean mount
[10752.243679] XFS: Assertion failed: offset + size <= i_size_read(inode), file: fs/xfs/xfs_aops.c, line: 1553
[10752.243755] ------------[ cut here ]------------
[10752.243780] kernel BUG at fs/xfs/xfs_message.c:113!
[10752.243826] invalid opcode: 0000 [#1] SMP 
...
[10752.244207] CPU: 2 PID: 18710 Comm: aio_df_repl-3te Kdump: loaded Tainted: G           OE  ------------   3.10.0 #1
[10752.244255] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
[10752.244300] task: ffff8a0b72c4c480 ti: ffff8a0b7fc4c000 task.ti: ffff8a0b7fc4c000
[10752.244336] RIP: 0010:[<ffffffffc02ca3f2>]  [<ffffffffc02ca3f2>] assfail+0x22/0x30 [xfs]
[10752.244729] Call Trace:
[10752.244743]  <IRQ> 
[10752.244790]  [<ffffffffc02a99b3>] xfs_end_io_direct_write+0x143/0x190 [xfs]
[10752.244828]  [<ffffffff95e5b405>] dio_complete+0x125/0x2a0
[10752.244856]  [<ffffffff95e5b168>] ? dio_bio_complete+0x98/0xc0
[10752.244885]  [<ffffffff95e5b667>] dio_bio_end_aio+0xe7/0x130
[10752.244914]  [<ffffffff95e58537>] bio_endio+0x67/0xb0
[10752.244941]  [<ffffffff95f1c290>] blk_update_request+0x90/0x360
[10752.244972]  [<ffffffff95f263ea>] blk_mq_end_request+0x1a/0x70
[10752.245003]  [<ffffffffc022b3bf>] virtblk_request_done+0x3f/0x70 [virtio_blk]
[10752.245038]  [<ffffffff95f249e9>] __blk_mq_complete_request_remote+0x19/0x20
[10752.245074]  [<ffffffff95d07113>] flush_smp_call_function_queue+0x63/0x130
[10752.245108]  [<ffffffff95d07813>] generic_smp_call_function_single_interrupt+0x13/0x30
[10752.246327]  [<ffffffff95c53b4d>] smp_call_function_single_interrupt+0x2d/0x40
[10752.247488]  [<ffffffff96316aa2>] call_function_single_interrupt+0x162/0x170
[10752.248661]  <EOI> 
[10752.248676]  [<ffffffff95f53c2d>] ? rwsem_spin_on_owner+0x4d/0x70
[10752.250995]  [<ffffffff963091e1>] rwsem_down_write_failed+0xe1/0x3a0
[10752.252178]  [<ffffffffc02c0abd>] ? xfs_vn_update_time+0xcd/0x150 [xfs]
[10752.253340]  [<ffffffff95d5948c>] ? ring_buffer_unlock_commit+0x2c/0x260
[10752.254512]  [<ffffffffc02c0abd>] ? xfs_vn_update_time+0xcd/0x150 [xfs]
[10752.255659]  [<ffffffff95f5be87>] call_rwsem_down_write_failed+0x17/0x30
[10752.256784]  [<ffffffff96306bad>] down_write+0x2d/0x3d
[10752.257915]  [<ffffffffc02c2b1d>] xfs_ilock+0x11d/0x200 [xfs]
[10752.259049]  [<ffffffffc02c0abd>] xfs_vn_update_time+0xcd/0x150 [xfs]
[10752.260147]  [<ffffffff95e375a8>] update_time+0x28/0xd0
[10752.261224]  [<ffffffff95e376f0>] file_update_time+0xa0/0xf0
[10752.262318]  [<ffffffffc02b6075>] xfs_file_aio_write_checks+0x185/0x1f0 [xfs]
[10752.263425]  [<ffffffffc02b6358>] xfs_file_dio_aio_write+0xc8/0x380 [xfs]
[10752.264503]  [<ffffffffc02b69d2>] xfs_file_aio_write+0x102/0x1b0 [xfs]
[10752.265549]  [<ffffffffc02b68d0>] ? xfs_file_buffered_aio_write+0x2c0/0x2c0 [xfs]
[10752.266563]  [<ffffffff95e70103>] do_io_submit+0x3c3/0x870
[10752.267532]  [<ffffffff95e705c0>] SyS_io_submit+0x10/0x20
[10752.268476]  [<ffffffff96314a51>] tracesys+0x9d/0xc3

PID: 18710  TASK: ffff8a0b72c4c480  CPU: 2   COMMAND: "aio_df_repl-3te"
 #0 [ffff8a0bb0103af0] machine_kexec at ffffffff95c5fefa
 #1 [ffff8a0bb0103b50] __crash_kexec at ffffffff95d12aa2
 #2 [ffff8a0bb0103c20] crash_kexec at ffffffff95d12b90
 #3 [ffff8a0bb0103c38] oops_end at ffffffff9630c768
 #4 [ffff8a0bb0103c60] die at ffffffff95c2d96b
 #5 [ffff8a0bb0103c90] do_trap at ffffffff9630beb0
 #6 [ffff8a0bb0103ce0] do_invalid_op at ffffffff95c2a284
 #7 [ffff8a0bb0103d90] invalid_op at ffffffff96317b2e
    [exception RIP: assfail+34]
    RIP: ffffffffc02ca3f2  RSP: ffff8a0bb0103e40  RFLAGS: 00010046
    RAX: 0000000000000000  RBX: 0000000000020000  RCX: 0000000000000000
    RDX: ffff8a0bb0103d48  RSI: 000000000000000a  RDI: ffffffffc03027d2
    RBP: ffff8a0bb0103e40   R8: 0000000000000000   R9: 0000000000000000
    R10: 000000000000000a  R11: f000000000000000  R12: 0000000000010000
    R13: 0000000000000000  R14: ffff8a0b478e0020  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffff8a0bb0103e48] xfs_end_io_direct_write at ffffffffc02a99b3 [xfs]
 #9 [ffff8a0bb0103e90] dio_complete at ffffffff95e5b405
#10 [ffff8a0bb0103ed0] dio_bio_end_aio at ffffffff95e5b667
#11 [ffff8a0bb0103f00] bio_endio at ffffffff95e58537
#12 [ffff8a0bb0103f28] blk_update_request at ffffffff95f1c290
#13 [ffff8a0bb0103f68] blk_mq_end_request at ffffffff95f263ea
#14 [ffff8a0bb0103f88] virtblk_request_done at ffffffffc022b3bf [virtio_blk]
#15 [ffff8a0bb0103f98] __blk_mq_complete_request_remote at ffffffff95f249e9
#16 [ffff8a0bb0103fa8] flush_smp_call_function_queue at ffffffff95d07113

This assertion is:
void
xfs_end_io_direct_write(
        struct kiocb            *iocb,
        loff_t                  offset,
        ssize_t                 size,
        void                    *private,
        int                     __attribute__((unused))ret,
        bool                    __attribute__((unused))is_async)
{
        struct inode            *inode = file_inode(iocb->ki_filp);
        struct xfs_ioend        *ioend = private;

        trace_xfs_gbmap_direct_endio(XFS_I(inode), offset, size,
                                     ioend ? ioend->io_type : 0, NULL);

        if (!ioend) {
                ASSERT(offset + size <= i_size_read(inode));  <<<<<<<<<<<<<<<<<
                return;
        }

        __xfs_end_io_direct_write(inode, ioend, offset, size);

inode is ffff8a0b43ba2638

crash> inode.i_size ffff8a0b43ba2638
  i_size = 65536

void xfs_end_io_direct_write(struct kiocb *, loff_t, ssize_t, void *, int, bool);


 #8 [ffff8a0bb0103e48] xfs_end_io_direct_write at ffffffffc02a99b3 [xfs]
    ffff8a0bb0103e50: ffff8a0b43ba2638 ffff8a0b43ba24c0 
    ffff8a0bb0103e60: 0000000000010000 ffff8a0baac74400  << %rbx and %r12
    ffff8a0bb0103e70: 0000000000010000 0000000000010000  << %r13 and %r14
    ffff8a0bb0103e80: 0000000000000001 ffff8a0bb0103ec8  << %r15 and %rbp
    ffff8a0bb0103e90: ffffffff95e5b405                   << return addr
 #9 [ffff8a0bb0103e90] dio_complete at ffffffff95e5b405

dio is ffff8a0baac74400
kiocb is 0xffff8a0b7fda2800

crash> dio.inode,i_size,private ffff8a0baac74400
  inode = 0xffff8a0b43ba2638
  i_size = 0
  private = 0x0

crash> inode.i_size 0xffff8a0b43ba2638
  i_size = 65536

offset is 0000000000010000
size is 0000000000010000

offset + size = 0000000000020000 (131072)
                ASSERT(offset + size <= i_size_read(inode));

I also had all the xfs tracepoints enabled, and I can attach the trace.


I'm wondering if these assertions are even related to the incorrect freespace block accounting issue.  These are hitting long before the block count has ever become incorrect.

Comment 20 Frank Sorenson 2018-05-14 22:11:16 UTC

Created attachment 1436597 [details]
dmesg from second vmcore in comment 19

Comment 21 Frank Sorenson 2018-05-14 22:13:01 UTC

Created attachment 1436598 [details]
xfs trace from second vmcore in comment 19

Comment 22 Frank Sorenson 2018-05-15 20:17:40 UTC

after turning on just warnings, and not XFS_DEBUG (kernel-3.10.0-883.el7), here are the assertions hit before the free blocks accounting bug occurs:

# grep Assertion messages | cut -c 29- | sort | uniq -c | sort -rn
   2289 XFS: Assertion failed: offset + size <= ioend->io_offset + ioend->io_size, file: fs/xfs/xfs_aops.c, line: 1488
   1530 XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: fs/xfs/xfs_trans.c, line: 309
    483 XFS: Assertion failed: offset + size <= i_size_read(inode), file: fs/xfs/xfs_aops.c, line: 1553


I have hit the free blocks bug with:
	3.10.0-327.10.1.el7.x86_64
	3.10.0-693.32.1.el7.x86_64
	3.10.0-841.el7.x86_64
	3.10.0-869.el7.x86_64
	3.10.0-883.el7.x86_64

I don't have a -514 or earlier than -327 readily available so can't test those right now.  At any rate, if this is a regression, it's not a very recent one.

Comment 23 Frank Sorenson 2018-05-15 20:46:35 UTC

Created attachment 1437003 [details]
slightly faster reproducer

updated reproducer which uses smaller aio writes, and a lot more of them

Also eliminates the extra statvfs() calls from the child threads; they're not necessary for reproduction, so only the parent thread calls statvfs() for monitoring whether the reproducer has succeeded.


tested with 150 GiB, 50 GiB, and 10 GiB xfs filesystems on 4-cpu VMs with between 1.2 and 4 GiB of memory in 4 or fewer iterations

Comment 24 Brian Foster 2018-05-16 11:11:04 UTC

Just an FYI, since I hit this when looking into some unrelated work. I produce a stream of the following assert when running generic/083 (and some other tests) against current master:

XFS: Assertion failed: offset + size <= i_size_read(inode), file: fs/xfs/xfs_aops.c, line: 1553

This appears to be related to rhel7 commit c513c2d835 ("[fs] direct-io: always call ->end_io if non-NULL").

Comment 25 Dave Chinner 2018-05-17 05:07:36 UTC

TL;DR: It's a bug in the crusty of direct IO code paths, which is why RHEL7 hits it and upstream doesn't. Upstream commit that fixes the bug at the end.

When doing AIO DIO completion from the crusty old RHEL7 DIO code, we get the file offset from:

                       dio_complete(dio, dio->iocb->ki_pos, 0,
                                     DIO_COMPLETE_ASYNC);

the offset of the IO in the kiocb. Seems simple enough, right? But the trace is telling me there's something wrong - there's *lots* of allocation being done from unwritten extent conversion, so we're screwing up the completion offset somehow. The trace from /IO submission/ points us in the right direction (trimmed process/timestamp for readability) - I simply picked the last unwritten extent conversion in the trace that did allocation, and went grepping from there. The IO completion was for:

xfs_unwritten_convert: dev 253:48 ino 0x43 isize 0x12da0000 disize 0x12da0000 offset 0x12c80000 count 131072

an offset that only appeared in one other place in the trace - IO submission. but that was a strange trace:


xfs_file_direct_write: dev 253:48 ino 0x43 size 0x129e0000 offset 0x12c80000 count 0x20000
xfs_ilock:            dev 253:48 ino 0x43 flags ILOCK_SHARED caller xfs_ilock_data_map_shared
xfs_get_blocks_found: dev 253:48 ino 0x43 size 0x129e0000 offset 0x129e0000 count 131072 type 

So we got an IO for the offset we see the completion for, but when the mapping for that IO is done, the extent that was mapped has got a /different offset/.

When I first looked at the reroducer, I wondered if the O_APPEND flag had any bearing on the result, and it turns out it does. When the XFS direct IO code
calls generic_write_checks(), it passes in a "pos" parameter that we got from
the VFS, set up originally way up in the aio code to be equal to the offset passed in from the user AIO iocb. All good up to here, but this is where it goes wrong.

generic_write_checks() sees the O_APPEND flag set, and sets pos to the current inode size - it gets changed to point at the current EOF rather than the offset passed into the kernel held by kiocb->ki_pos. The IO is set up and run with "pos", which is also passed to the direct IO code. IOWs, the IO is submitted using "pos", not kiocb->ki_pos.

Now go back and look at what offset the direct IO code is saying the IO is completing at.....

This bug doesn't exist upstream, because the VFS doesn't supply filesystems with two different IO offset variables for a single IO. We get one - kiocb->ki_pos - and it gets passed around everywhere, including to generic_write_checks(). Hence when O_APPEND modifies the offset, that new offset is always used, both on IO submission and completion.

This problem was fixed upstream as part of the iocb/iov_iter infrastructure rework. This part landed in 4.1 in this commit, and it just so happens that it fixed the O_APPEND bug here by passing the &kiocb->ki_pos to generic_write_checks() in preparation for passing the whole kiocb rather than the position. i.e. the commit is not related to the bug, but changes in it did fix it.

$ git describe --contains 99733fa372ea
v4.1-rc1~7^2~37
$ gl -n 1 99733fa
commit 99733fa372eaaa59cfb93fd383cee7b0ff056e1d
Author: Al Viro <viro.org.uk>
Date:   Tue Apr 7 14:25:18 2015 -0400

    xfs_file_aio_write_checks: switch to iocb/iov_iter
    
    Signed-off-by: Al Viro <viro.org.uk>

-Dave.

Comment 26 Frank Sorenson 2018-05-17 06:49:40 UTC

And after removing O_APPEND, the assertions do indeed stop, and the free blocks counter looks sane.

it looks like the free space was getting messed up as the result of un-reserving a 'bad' reservation in xfs_trans_commit(), where the 'bad' reservation occurred when t_blk_res_used was greater than t_blk_res


The free space count is then a secondary consequence of the actual bug, which seems like it could result in data loss due to the mismatched offset between issue and completion.

Comment 27 Dave Chinner 2018-05-18 01:31:49 UTC

(In reply to Frank Sorenson from comment #26)
> The free space count is then a secondary consequence of the actual bug,
> which seems like it could result in data loss due to the mismatched offset
> between issue and completion.

Any application using O_APPEND to provide file extension via concurrent DIO+AIO writes is going to corrupt their data. There is no guarantee of order of O_APPEND writes, not to mention that the concurrent writes can race against file size update and hence overwrite each other. That's probably why it's taken so long for anyone to notice this - it's an insane thing for anyone who cares about their data to do. O_DIRECT gives you the rope, AIO hangs you with it....

I can only image that this was found via some artificial IO benchmark that never validates the data it writes. /me goes and checks the original report.

Oh, yeah. Iometer. 'nuf said.


-Dave.

Comment 28 Frank Sorenson 2018-05-18 03:51:58 UTC

It took me a while to understand why this would be the case.

in the manpage for pwrite(2):

BUGS
       POSIX requires that opening a file with the O_APPEND flag should
       have  no  effect  on the location at which pwrite() writes data.
       However, on Linux, if a file is opened with  O_APPEND,  pwrite()
       appends  data to the end of the file, regardless of the value of
       offset.


only, as you point out, that's not entirely accurate (or perhaps 'precise' is better)...  more specifically (at least behavior-wise), 'offset' is changed to be 'size of the file' when the pwrite() began.

Because we threw aio into the mix, with high parallelism, multiple in-flight IOs may have the same (new) 'offset', so the writes do not all get laid-out sequentially.  If the number of outstanding aio ops and number of threads were both small, they may not race at all, and the end file size could be the expected size multiplied by the number of threads.


Okay, so we don't need to care about the data here.  Just about not breaking the free block accounting.

Comment 29 Dave Wysochanski 2018-11-24 21:49:30 UTC

So what's the story here - is this a CLOSED WONT_FIX for rhel7?

Comment 30 Dave Chinner 2018-11-26 00:40:44 UTC

No idea, and not really my call. It needs a custom fix for RHEL 7 that won't ever go upstream, but it doesn't look that complex to fix (i.e. just use the modified offset everywhere). Eric?

-Dave.

Comment 31 Eric Sandeen 2019-01-15 16:40:48 UTC

Missed this needinfo, sorry.

I'm torn on this one.  If Iometer is the only thing currently triggering it, it's not a real-world problem at this point.  And it's never been reported in a non-synthetic workload for the life of RHEL7, as far as I know. (correct?) Making changes to the DIO path this far into the life of RHEL7 seems unwise to me.

I'd prefer to CLOSE / WONTFIX this one, in general.  However, I'm not super happy with leaving the ability of unprivileged userspace to corrupt the filesystem just lying around unfixed.  So it's a risk/reward thing.

Dave, how much subtlety do you think there would be in the "just use the modified offset everywhere" change?  My fear is that some other bug will pop out post-release in a more common codepath.

Comment 32 Dave Chinner 2019-01-15 23:03:48 UTC

IIRC, I think it's just a matter of replacing local "pos" variables with kiocb->ki_pos in the XFS direct IO write path, and wasn't that complex to do. But, yeah, historically speaking changing this code to fix "simple" things has always lead to either introducing new bugs or unmasking existing bugs in other parts of the IO path. i.e. it's code we played whack-a-mole with so many times we ended up re-implementing it completely (i.e. iomap infrastructure). Hence I'd be very cautious about changing RHEL 7 code for to avoid a problem on a workload that simply can't be used for anything but benchmarketing....

-Dave.

Comment 33 Eric Sandeen 2019-01-15 23:38:09 UTC

thanks Dave.  Yeah, that was my point - the unknown unknowns that may pop up.  My gut feeling is that we should not make changes at this point.

I'm going to close it WONTFIX, but if support has any significant heartburn over that, please let us know.

Note You need to log in before you can comment on or make changes to this bug.