Bug 1829662 - kernel BUG at fs/fscache/operation.c:70! FS-Cache: 4 == 5 is false - current state is FSCACHE_OP_ST_COMPLETE but should be FSCACHE_OP_CANCELLED in fscache_enqueue_operation
Summary: kernel BUG at fs/fscache/operation.c:70! FS-Cache: 4 == 5 is false - current ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.7
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Dave Wysochanski
QA Contact: JianHong Yin
URL:
Whiteboard:
Depends On:
Blocks: 1822123 1839756 1839757
TreeView+ depends on / blocked
 
Reported: 2020-04-30 02:28 UTC by Jay Shin
Modified: 2020-09-29 21:15 UTC (History)
11 users (show)

Fixed In Version: kernel-3.10.0-1145.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1839756 1839757 (view as bug list)
Environment:
Last Closed: 2020-09-29 21:13:47 UTC
Target Upstream Version:


Attachments (Terms of Use)
Reproducer outline v1 (2.89 KB, text/plain)
2020-05-01 11:22 UTC, Dave Wysochanski
no flags Details
patch submitted to linux-nfs (5.10 KB, text/plain)
2020-05-07 12:52 UTC, Dave Wysochanski
no flags Details
Debug patches on top of 5.7-rc3 used to track down this problem (6.41 KB, application/gzip)
2020-05-19 10:30 UTC, Dave Wysochanski
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1683490 1 None None None 2021-01-20 06:05:38 UTC
Red Hat Knowledge Base (Solution) 5061061 0 None None None 2020-05-08 09:42:35 UTC
Red Hat Product Errata RHSA-2020:4060 0 None None None 2020-09-29 21:15:11 UTC

Comment 4 Dave Wysochanski 2020-04-30 13:25:43 UTC
I don't see anything related to this problem not yet backported from upstream to RHEL7.  Also the latest RHEL7.7.z just built contains all fscache patches in RHEL7.

This looks like either a new problem, driver problem, or something else.  It's interesting the crash is in a similar code path to the previous patches that just went in for https://bugzilla.redhat.com/show_bug.cgi?id=1683490, but I've not looked too deeply yet or tried to reproduce or trace.

Next steps may be ask customer to re-run with tracing and/or debug kernel with fscache debugging turned on to track down what is occurring.

Comment 6 Dave Wysochanski 2020-04-30 22:25:55 UTC
fwiw, in the vmcore, the IOs are around 28kb reads to files of 4MB in size.  I'm going to closely mimic these sizes especially the odder size that may cross page boundaries.

Comment 7 Dave Wysochanski 2020-05-01 03:05:36 UTC
I just reproduced this!

[ 8921.107413] FS-Cache:
[ 8921.122996] FS-Cache: Assertion failed
[ 8921.124677] FS-Cache: 4 == 5 is false
[ 8923.282147] ------------[ cut here ]------------
[ 8923.289081] kernel BUG at fs/fscache/operation.c:70!
[ 8923.291119] invalid opcode: 0000 [#1] SMP
[ 8923.292860] Modules linked in: nfsv3 nfs_acl nfs lockd grace cachefiles fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ext4 mbcache jbd2 iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev raid0 aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr virtio_balloon virtio_rng parport_pc parport i2c_piix4 auth_rpcgss sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi virtio_net virtio_blk ata_piix crct10dif_pclmul
[ 8923.323869]  crct10dif_common libata virtio_pci virtio_ring crc32c_intel serio_raw virtio floppy dm_mirror dm_region_hash dm_log dm_mod
[ 8923.328644] CPU: 2 PID: 32001 Comm: dd Kdump: loaded Not tainted 3.10.0-1062.23.1.el7.x86_64 #1
[ 8923.332067] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 8923.334347] task: ffff931db5ee1070 ti: ffff931d29c40000 task.ti: ffff931d29c40000
[ 8923.337296] RIP: 0010:[<ffffffffc07194cf>]  [<ffffffffc07194cf>] fscache_enqueue_operation+0xff/0x170 [fscache]
[ 8923.341320] RSP: 0018:ffff931dbfc83ca8  EFLAGS: 00010046
[ 8923.343432] RAX: 0000000000000019 RBX: ffff931d2232f300 RCX: 0000000000000006
[ 8923.346237] RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffff931dbfc93890
[ 8923.349048] RBP: ffff931dbfc83cb0 R08: 0000000000000000 R09: 0000000000000000
[ 8923.351855] R10: 00000000000059e1 R11: ffff931dbfc839a6 R12: ffff931d2232f300
[ 8923.354629] R13: ffff931d2ee02d20 R14: ffff931dbfc83d48 R15: ffff931d015a81e0
[ 8923.357433] FS:  00007f539cbfd740(0000) GS:ffff931dbfc80000(0000) knlGS:0000000000000000
[ 8923.360598] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8923.362870] CR2: 00000000006f2000 CR3: 0000000052ca2000 CR4: 00000000000406e0
[ 8923.382822] Call Trace:
[ 8923.383913]  <IRQ>
[ 8923.384732]  [<ffffffffc0733c12>] cachefiles_read_waiter+0x92/0xf0 [cachefiles]
[ 8923.387735]  [<ffffffff858d3942>] __wake_up_common+0x82/0x120
[ 8923.389994]  [<ffffffff858d3a63>] __wake_up_common_lock+0x83/0xc0
[ 8923.392394]  [<ffffffff858d3510>] ? task_rq_unlock+0x20/0x20
[ 8923.394634]  [<ffffffff858d3ab3>] __wake_up+0x13/0x20
[ 8923.396625]  [<ffffffff858c72a0>] __wake_up_bit+0x50/0x70
[ 8923.398768]  [<ffffffff859bdcdb>] unlock_page+0x2b/0x30
[ 8923.400845]  [<ffffffff859bdd09>] page_endio+0x29/0x90
[ 8923.460251]  [<ffffffff85a8fc81>] mpage_end_io+0x51/0x80

Comment 9 Jay Shin 2020-05-01 08:43:17 UTC
Similar discussion with patch but it seems this is not accepted at upstream.
It's a new bug.

https://lore.kernel.org/patchwork/patch/889373/

diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 883bc7bb12c5..9d5d13e150fb 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -58,9 +58,9 @@  static int cachefiles_read_waiter(wait_queue_entry_t *wait, unsigned mode,
 
 	spin_lock(&object->work_lock);
 	list_add_tail(&monitor->op_link, &monitor->op->to_do);
+	fscache_enqueue_retrieval(monitor->op);
 	spin_unlock(&object->work_lock);
 
-	fscache_enqueue_retrieval(monitor->op);
 	return 0;
 }

Comment 10 Jay Shin 2020-05-01 09:13:31 UTC
ah below has been committed, instead.

    commit 934140ab028713a61de8bca58c05332416d037d1
    Author: Kiran Kumar Modukuri <kiran.modukuri@gmail.com>
    Date: 2018-07-25 15:04:25 +0100
    
        cachefiles: Fix refcounting bug in backing-file read monitoring

Comment 12 Dave Wysochanski 2020-05-01 10:57:34 UTC
(In reply to Jay Shin from comment #10)
> ah below has been committed, instead.
> 
>     commit 934140ab028713a61de8bca58c05332416d037d1
>     Author: Kiran Kumar Modukuri <kiran.modukuri@gmail.com>
>     Date: 2018-07-25 15:04:25 +0100
>     
>         cachefiles: Fix refcounting bug in backing-file read monitoring

This patch is already in RHEL7 oddly via https://bugzilla.redhat.com/show_bug.cgi?id=1356390#c52

    O-Subject: [RHEL7 PATCH 2/5] cachefiles: Fix refcounting bug in backing-file read monitoring
    Bugzilla: 1356390

The patch header sounds like this identical problem though so I'm not sure what's happening here.  Need to look further if this problem really was never fixed, multiple vectors to it, or something else.

Comment 13 Dave Wysochanski 2020-05-01 11:22:20 UTC
Created attachment 1683654 [details]
Reproducer outline v1

Reproducer outline - needs refined as I've only reproduced once so far, I do not know if md in the stack is significant or not, etc

Comment 14 Dave Wysochanski 2020-05-01 12:59:29 UTC
I wonder if this is an overly aggressive assertion or a real bug...

Comment 16 Dave Wysochanski 2020-05-01 14:39:52 UTC
This crash may provide a clue but not sure.  When I run the test I get a lot of console spam due to low memory situations but they do not produce the crash.  However it looks like the crash happened due to an interrupt while during alloc_pages of another read so it may provide some clue for refining.


      KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-1062.23.1.el7.x86_64/vmlinux
    DUMPFILE: /cores/retrace/tasks/289876432/crash/vmcore  [PARTIAL DUMP]
        CPUS: 8
        DATE: Thu Apr 30 19:51:43 2020
      UPTIME: 02:28:40
LOAD AVERAGE: 962.38, 667.31, 651.21
       TASKS: 2083
    NODENAME: rhel7u7-node1.dwysocha.net
     RELEASE: 3.10.0-1062.23.1.el7.x86_64
     VERSION: #1 SMP Mon Apr 27 07:15:53 EDT 2020
     MACHINE: x86_64  (1795 Mhz)
      MEMORY: 4 GB
       PANIC: "kernel BUG at fs/fscache/operation.c:70!"
         PID: 32001
     COMMAND: "dd"
        TASK: ffff931db5ee1070  [THREAD_INFO: ffff931d29c40000]
         CPU: 2
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 32001  TASK: ffff931db5ee1070  CPU: 2   COMMAND: "dd"
 #0 [ffff931dbfc83950] machine_kexec at ffffffff85865b34
 #1 [ffff931dbfc839b0] __crash_kexec at ffffffff859225d2
 #2 [ffff931dbfc83a80] crash_kexec at ffffffff859226c0
 #3 [ffff931dbfc83a98] oops_end at ffffffff85f87798
 #4 [ffff931dbfc83ac0] die at ffffffff85830a7b
 #5 [ffff931dbfc83af0] do_trap at ffffffff85f86ee0
 #6 [ffff931dbfc83b40] do_invalid_op at ffffffff8582d2a4
 #7 [ffff931dbfc83bf0] invalid_op at ffffffff85f9322e
    [exception RIP: fscache_enqueue_operation+255]
    RIP: ffffffffc07194cf  RSP: ffff931dbfc83ca8  RFLAGS: 00010046
    RAX: 0000000000000019  RBX: ffff931d2232f300  RCX: 0000000000000006
    RDX: 0000000000000000  RSI: 0000000000000046  RDI: ffff931dbfc93890
    RBP: ffff931dbfc83cb0   R8: 0000000000000000   R9: 0000000000000000
    R10: 00000000000059e1  R11: ffff931dbfc839a6  R12: ffff931d2232f300
    R13: ffff931d2ee02d20  R14: ffff931dbfc83d48  R15: ffff931d015a81e0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffff931dbfc83ca0] fscache_enqueue_operation at ffffffffc07194cf [fscache]
 #9 [ffff931dbfc83cb8] cachefiles_read_waiter at ffffffffc0733c12 [cachefiles]
#10 [ffff931dbfc83ce0] __wake_up_common at ffffffff858d3942
#11 [ffff931dbfc83d38] __wake_up_common_lock at ffffffff858d3a63
#12 [ffff931dbfc83da8] __wake_up at ffffffff858d3ab3
#13 [ffff931dbfc83db8] __wake_up_bit at ffffffff858c72a0
#14 [ffff931dbfc83de8] unlock_page at ffffffff859bdcdb
#15 [ffff931dbfc83e00] page_endio at ffffffff859bdd09
#16 [ffff931dbfc83e20] mpage_end_io at ffffffff85a8fc81
#17 [ffff931dbfc83e48] bio_endio at ffffffff85a89de7
#18 [ffff931dbfc83e70] blk_update_request at ffffffff85b519d0
#19 [ffff931dbfc83eb0] blk_mq_end_request at ffffffff85b5bb7a
#20 [ffff931dbfc83ed0] virtblk_request_done at ffffffffc02bc3bf [virtio_blk]
#21 [ffff931dbfc83ee0] blk_done_softirq at ffffffff85b592f6
#22 [ffff931dbfc83f20] __do_softirq at ffffffff858a54b5
#23 [ffff931dbfc83f90] call_softirq at ffffffff85f9342c
#24 [ffff931dbfc83fa8] do_softirq at ffffffff8582f715
#25 [ffff931dbfc83fc8] irq_exit at ffffffff858a5835
#26 [ffff931dbfc83fe0] smp_call_function_single_interrupt at ffffffff85859479
#27 [ffff931dbfc83ff0] call_function_single_interrupt at ffffffff85f921aa
--- <IRQ stack> ---
#28 [ffff931d29c43638] call_function_single_interrupt at ffffffff85f921aa
    [exception RIP: get_page_from_freelist+679]
    RIP: ffffffff859c7ae7  RSP: ffff931d29c436e0  RFLAGS: 00000246
    RAX: 0000000000000022  RBX: ffff931d29c436d0  RCX: 000000000001fc30
    RDX: 000000000001fcff  RSI: 000000000000001f  RDI: 0000000000000246
    RBP: ffff931d29c437e8   R8: fffffffffffffff6   R9: 0000000000000000
    R10: ffff931dbfb83c60  R11: 000000000000327a  R12: ffffe9b6406f9940
    R13: 0000000000000001  R14: ffff931d22276d00  R15: ffffffff85b581ff
    ORIG_RAX: ffffffffffffff04  CS: 0010  SS: 0018
#29 [ffff931d29c437f0] __alloc_pages_nodemask at ffffffff859c8446
#30 [ffff931d29c43898] alloc_pages_current at ffffffff85a16d38
#31 [ffff931d29c438e0] __page_cache_alloc at ffffffff859bd957
#32 [ffff931d29c43918] cachefiles_read_or_alloc_pages at ffffffffc07349f2 [cachefiles]
#33 [ffff931d29c43a18] __fscache_read_or_alloc_pages at ffffffffc071bba6 [fscache]
#34 [ffff931d29c43a60] __nfs_readpages_from_fscache at ffffffffc077b294 [nfs]
#35 [ffff931d29c43ab0] nfs_readpages at ffffffffc0771583 [nfs]
#36 [ffff931d29c43ba8] __do_page_cache_readahead at ffffffff859cabef
#37 [ffff931d29c43c68] ondemand_readahead at ffffffff859cad9f
#38 [ffff931d29c43cb0] page_cache_sync_readahead at ffffffff859cb1c4
#39 [ffff931d29c43cf0] generic_file_aio_read at ffffffff859bf612
#40 [ffff931d29c43dc8] nfs_file_read at ffffffffc0764161 [nfs]
#41 [ffff931d29c43e00] do_sync_read at ffffffff85a49fa3
#42 [ffff931d29c43ed8] vfs_read at ffffffff85a4a9df
#43 [ffff931d29c43f08] sys_read at ffffffff85a4b84f
#44 [ffff931d29c43f50] system_call_fastpath at ffffffff85f8fede
    RIP: 00007f539c709990  RSP: 00007fffc5dea030  RFLAGS: 00010206
    RAX: 0000000000000000  RBX: 0000000000611640  RCX: ffffffff85f8fe21
    RDX: 0000000000007000  RSI: 00000000006eb000  RDI: 0000000000000000
    RBP: 0000000000007000   R8: 0000000000000003   R9: 0000000000009003
    R10: 00007fffc5de9aa0  R11: 0000000000000246  R12: 00000000006eb000
    R13: 0000000000000000  R14: 00007fffc5deb819  R15: 00007fffc5dea288
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

Comment 17 Dave Wysochanski 2020-05-01 15:01:22 UTC
I just got a second crash with only 18 minutes uptime - it does not have the same initial backtrace but the crashing thread got an interrupt from the idle thread.

Comment 19 Dave Wysochanski 2020-05-01 15:07:19 UTC
The second crash. Only up 18 minutes, and was from an idle thread.


      KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-1062.23.1.el7.x86_64/vmlinux
    DUMPFILE: /cores/retrace/tasks/596434237/crash/vmcore  [PARTIAL DUMP]
        CPUS: 8
        DATE: Fri May  1 10:48:39 2020
      UPTIME: 00:18:02
LOAD AVERAGE: 388.53, 707.87, 547.47
       TASKS: 1659
    NODENAME: rhel7u7-node1.dwysocha.net
     RELEASE: 3.10.0-1062.23.1.el7.x86_64
     VERSION: #1 SMP Mon Apr 27 07:15:53 EDT 2020
     MACHINE: x86_64  (1795 Mhz)
      MEMORY: 4 GB
       PANIC: "kernel BUG at fs/fscache/operation.c:70!"

crash> log | grep "=="
[ 1083.310044] FS-Cache: 4 == 5 is false
crash> bt
PID: 0      TASK: ffff9ae279510000  CPU: 4   COMMAND: "swapper/4"
 #0 [ffff9ae27fd03950] machine_kexec at ffffffff85465b34
 #1 [ffff9ae27fd039b0] __crash_kexec at ffffffff855225d2
 #2 [ffff9ae27fd03a80] crash_kexec at ffffffff855226c0
 #3 [ffff9ae27fd03a98] oops_end at ffffffff85b87798
 #4 [ffff9ae27fd03ac0] die at ffffffff85430a7b
 #5 [ffff9ae27fd03af0] do_trap at ffffffff85b86ee0
 #6 [ffff9ae27fd03b40] do_invalid_op at ffffffff8542d2a4
 #7 [ffff9ae27fd03bf0] invalid_op at ffffffff85b9322e
    [exception RIP: fscache_enqueue_operation+255]
    RIP: ffffffffc04674cf  RSP: ffff9ae27fd03ca8  RFLAGS: 00010046
    RAX: 0000000000000019  RBX: ffff9ae1f80b2480  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000082  RDI: 0000000000000082
    RBP: ffff9ae27fd03cb0   R8: 000000008e9111e2   R9: ffffffff864e49ad
    R10: 000000000001ef38  R11: 0000000000100000  R12: ffff9ae1f80b2480
    R13: ffff9ae143ed0ba0  R14: ffff9ae27fd03d48  R15: ffff9ae1c56089c0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffff9ae27fd03ca0] fscache_enqueue_operation at ffffffffc04674cf [fscache]
 #9 [ffff9ae27fd03cb8] cachefiles_read_waiter at ffffffffc0443c12 [cachefiles]
#10 [ffff9ae27fd03ce0] __wake_up_common at ffffffff854d3942
#11 [ffff9ae27fd03d38] __wake_up_common_lock at ffffffff854d3a63
#12 [ffff9ae27fd03da8] __wake_up at ffffffff854d3ab3
#13 [ffff9ae27fd03db8] __wake_up_bit at ffffffff854c72a0
#14 [ffff9ae27fd03de8] unlock_page at ffffffff855bdcdb
#15 [ffff9ae27fd03e00] page_endio at ffffffff855bdd09
#16 [ffff9ae27fd03e20] mpage_end_io at ffffffff8568fc81
#17 [ffff9ae27fd03e48] bio_endio at ffffffff85689de7
#18 [ffff9ae27fd03e70] blk_update_request at ffffffff857519d0
#19 [ffff9ae27fd03eb0] blk_mq_end_request at ffffffff8575bb7a
#20 [ffff9ae27fd03ed0] virtblk_request_done at ffffffffc010a3bf [virtio_blk]
#21 [ffff9ae27fd03ee0] blk_done_softirq at ffffffff857592f6
#22 [ffff9ae27fd03f20] __do_softirq at ffffffff854a54b5
#23 [ffff9ae27fd03f90] call_softirq at ffffffff85b9342c
#24 [ffff9ae27fd03fa8] do_softirq at ffffffff8542f715
#25 [ffff9ae27fd03fc8] irq_exit at ffffffff854a5835
#26 [ffff9ae27fd03fe0] smp_call_function_single_interrupt at ffffffff85459479
#27 [ffff9ae27fd03ff0] call_function_single_interrupt at ffffffff85b921aa
--- <IRQ stack> ---
#28 [ffff9ae27950fdf8] call_function_single_interrupt at ffffffff85b921aa
    [exception RIP: native_safe_halt+11]
    RIP: ffffffff85b842eb  RSP: ffff9ae27950fea8  RFLAGS: 00000246
    RAX: ffffffff85b840a0  RBX: ffffffff854d51ae  RCX: 0100000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000046
    RBP: ffff9ae27950fea8   R8: 0000000000000000   R9: 0000000000000001
    R10: 0000000000000000  R11: 000000000000000c  R12: ffff9ae27950dfd8
    R13: 0000000000000000  R14: ffff9ae279510698  R15: 0000000100000286
    ORIG_RAX: ffffffffffffff04  CS: 0010  SS: 0018
#29 [ffff9ae27950feb0] default_idle at ffffffff85b840be
#30 [ffff9ae27950fed0] arch_cpu_idle at ffffffff85437c80
#31 [ffff9ae27950fee0] cpu_startup_entry at ffffffff8550185a
#32 [ffff9ae27950ff28] start_secondary at ffffffff8545a0c7
#33 [ffff9ae27950ff50] start_cpu at ffffffff854000d5

Comment 20 Dave Wysochanski 2020-05-01 19:43:07 UTC
Just reproduced on upstream plus a few patches to address other upstream issues.
This is going to be a longer road to get addressed because it is an upstream problem and is behind other upstream patches in the queue not yet accepted, just to reproduce it.


[  681.410052] 0 pages hwpoisoned
[  707.035117] bash (13115): drop_caches: 3
[  707.107184] FS-Cache:
[  707.115381] FS-Cache: Assertion failed
[  707.117343] FS-Cache: 4 == 5 is false
[  707.122937] ------------[ cut here ]------------
[  707.146624] kernel BUG at fs/fscache/operation.c:69!
[  707.157357] invalid opcode: 0000 [#1] SMP PTI
[  707.159160] CPU: 5 PID: 11070 Comm: kworker/u16:13 Kdump: loaded Not tainted 5.7.0-rc3-bz1829662-1+ #4
[  707.176470] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  707.189125] Workqueue:  0x0 (fscache_object)
[  707.195660] RIP: 0010:fscache_enqueue_operation.cold+0x8c/0x12a [fscache]
[  707.198639] Code: c7 28 86 4b c0 e8 c5 0e c9 f3 48 c7 c7 36 86 4b c0 e8 b9 0e c9 f3 8b 75 40 ba 05 00 00 00 48 c7 c7 b8 6c 4b c0 e8 a5 0e c9 f3 <0f> 0b 65 48 8b 34 25 c0 8b 01 00 48 
c7 c7 05 87 4b c0 48 81 c6 78
[  707.206758] RSP: 0018:ffffb11680188d80 EFLAGS: 00010046
[  707.209509] RAX: 0000000000000019 RBX: ffff9ea1e564e0c0 RCX: 0000000000000007
[  707.213054] RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff9ea3b7d59c20
[  707.216444] RBP: ffff9ea1dcd23840 R08: 00000000000002d6 R09: 0000000000000028
[  707.219751] R10: ffffb11680188c30 R11: 0000000000000000 R12: ffff9ea35a2ccf00
[  707.222812] R13: ffff9ea1e564eac0 R14: ffff9ea1e564e100 R15: ffff9ea1dcd238c8
[  707.225707] FS:  0000000000000000(0000) GS:ffff9ea3b7d40000(0000) knlGS:0000000000000000
[  707.229339] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  707.231734] CR2: 00000000000000b0 CR3: 000000020927e000 CR4: 00000000000406e0
[  707.234711] Call Trace:
[  707.235769]  <IRQ>
[  707.236829]  cachefiles_read_waiter+0x13b/0x160 [cachefiles]
[  707.239147]  __wake_up_common+0x80/0x180
[  707.240962]  wake_up_page_bit+0xab/0x100
[  707.242768]  __read_end_io+0xb6/0x120
[  707.244350]  blk_update_request+0x17b/0x2d0
[  707.246066]  blk_mq_end_request+0x1c/0x130
[  707.267961]  blk_done_softirq+0xa1/0xd0
[  707.269567]  __do_softirq+0xee/0x2ff
[  707.271055]  irq_exit+0xe9/0xf0
[  707.272355]  do_IRQ+0x4d/0xd0
[  707.273595]  common_interrupt+0xf/0xf
[  707.275057]  </IRQ>
[  707.275936] RIP: 0010:finish_task_switch+0x7d/0x250
[  707.277850] Code: 90 65 48 8b 1c 25 c0 8b 01 00 66 66 66 66 90 66 66 66 66 90 41 c7 45 38 00 00 00 00 4c 89 e7 e8 19 cf 02 00 66 90 fb 66 66 90 <66> 66 90 4d 85 f6 74 21 65 48 8b 04

Comment 27 Dave Wysochanski 2020-05-05 21:40:08 UTC
Upon further review, I think Jay and Radomir are correct for suspecting that one of the patches to fix the previous bug is related to this issue:
https://bugzilla.redhat.com/show_bug.cgi?id=1683490#c11

Even though the commits fix obvious problems with inverted logic, I also started being suspicious of the below commit being related based on code paths and some testing data, even though so far I have not 100% put the pieces together:
b27ddd462453 fscache: Pass the correct cancelled indications to fscache_op_complete()

If we wanted, and people are concerned, we could revert the above commit and use that in the z-streams and for the shorter term.  However this is an upstream bug as well so we will need to track this down and fix it.

Note that we only started seeing this new bug _after_ we pulled in the above patch, which was a fix for inverted logic in multiple places in the code.  However it was not the primary fix to the original bug, so in that sense was a peripheral change.  What follows is a summary of my understanding of this so far.  So far I have not obtained a very good trace showing definitively the full picture of what is happening, but I'm working on it.

This bug involves hitting an assertion involving the op->state == FSCACHE_OP_ST_COMPLETE(4) but should be FSCACHE_OP_CANCELLED(5)
kernel BUG at fs/fscache/operation.c:70! FS-Cache: 4 == 5 is false
     50 /**
     51  * fscache_enqueue_operation - Enqueue an operation for processing
     52  * @op: The operation to enqueue
     53  *
     54  * Enqueue an operation for processing by the FS-Cache thread pool.
     55  *
     56  * This will get its own ref on the object.
     57  */
     58 void fscache_enqueue_operation(struct fscache_operation *op)
     59 {
     60         struct fscache_cookie *cookie = op->object->cookie;
     61
     62         _enter("{OBJ%x OP%x,%u}",
     63                op->object->debug_id, op->debug_id, atomic_read(&op->usage));
     64
     65         ASSERT(list_empty(&op->pend_link));
     66         ASSERT(op->processor != NULL);
     67         ASSERT(fscache_object_is_available(op->object));
     68         ASSERTCMP(atomic_read(&op->usage), >, 0);
     69         ASSERTIFCMP(op->state != FSCACHE_OP_ST_IN_PROGRESS,
---> 70                     op->state, ==,  FSCACHE_OP_ST_CANCELLED);

I started by looking at which code actually sets op->state == FSCACHE_OP_ST_COMPLETE.  As it turns out, this is only done in one place, and not surprisingly this is inside the async operation code of fscache (for more info, see Documentation/filesystems/caching/operations.txt), inside fscache_op_complete()
fs/fscache/operation.c
    439 /*
    440  * Record the completion or cancellation of an in-progress operation.
    441  */
    442 void fscache_op_complete(struct fscache_operation *op, bool cancelled)
    443 {
    444         struct fscache_object *object = op->object;
    445 
    446         _enter("OBJ%x", object->debug_id);
    447 
    448         ASSERTCMP(op->state, ==, FSCACHE_OP_ST_IN_PROGRESS);
    449         ASSERTCMP(object->n_in_progress, >, 0);
    450         ASSERTIFCMP(test_bit(FSCACHE_OP_EXCLUSIVE, &op->flags),
    451                     object->n_exclusive, >, 0);
    452         ASSERTIFCMP(test_bit(FSCACHE_OP_EXCLUSIVE, &op->flags),
    453                     object->n_in_progress, ==, 1);
    454 
    455         spin_lock(&object->lock);
    456 
    457         if (!cancelled) {
    458                 op->state = FSCACHE_OP_ST_COMPLETE;
    459         } else {
    460                 op->cancel(op);
    461                 op->state = FSCACHE_OP_ST_CANCELLED;
    462         }
    463 

I looked at that commit and said, ok so we're messing with that last argument because the logic was erroneously inverted.  If we get into a situation where FSCACHE_OP_ST_COMPLETE was set by someone racing ahead then that could give us this issue whereas an erroneous FSCACHE_OP_ST_CANCELLED would be benign (the latter would pass the ASSERTIFCMP hit in this bug), and that only happens when we call fscache_op_complete(..., false).

So then I look at the commit and noticed there were 3 instances where we change the last argument from true to false, and I built 3 test kernels, with each of the 3 hunks reverted.  I ran with the first hunk reverted, and the problem reproduced easily.  The second hunk reverted rendered fscache nonfunctional.  But the third hunk reverted fscache_retrieval_complete(), and after I reverted it the test ran ok again.  Why might this make sense?  Well I'm pretty sure what is actually happening is some sort of race inside cachefiles_read_backing_file() and the "page monitor callback" cachefiles_read_waiter().  I'm pretty sure we somehow getting into the following situation:
cachefiles_read_backing_file() 
 -> fscache_retrieval_complete()
   -> fscache_op_complete(.., false)

then at some later time
cachefiles_read_waiter() gets called (via backtrace in this bug) 
-> fscache_enqueue_operation()

The last operation blows up because the op->state is not correct - it has already completed.  So if this is what is happening either the monitor callback needs cancelled or we need to recognize that inside the monitor and not call fscache_enqueue_operation().  But I need to get a trace of what is happening, unless myself or someone else can spot something obvious there.  I also note that unfortunately, callers of fscache_enqueue_operation() are fairly large number so reverting hunk #3 is fairly significant logic change.


Undo hunk #3 in commit b27ddd46245311850f850024df54d0537506f3c
Test #3:
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
index 3b03e29e2f1a..b19fa8592fc2 100644
--- a/include/linux/fscache-cache.h
+++ b/include/linux/fscache-cache.h
@@ -185,7 +185,7 @@ static inline void fscache_retrieval_complete(struct fscache_retrieval *op,
 {
        atomic_sub(n_pages, &op->n_pages);
        if (atomic_read(&op->n_pages) <= 0)
-               fscache_op_complete(&op->op, true);
+               fscache_op_complete(&op->op, false);
 }

Comment 29 Dave Wysochanski 2020-05-06 18:50:43 UTC
Jay - thanks for the willingness to help.  At this point it is a matter of tracking down the code path(s) in question and then crafting an acceptable patch.  I was having problems with kdump and ftrace but I decided to just forgo ftrace and be more laser-focused on this specific problem and the code paths, so I wrote some tracing code specific to this bug.  While ftrace is very useful to see general picture, and there are some good fscache trace points in upstream, to my knowledge ftrace does not have the ability to trace code paths in a per-structure buffer, which is what is needed here.  With ftrace it is very easy to blow out the whole buffer with irrelevant information, and/or change the timing to affect reproducibility, both of which were happening to me.

So after some custom tracing code to trace specific events on a per fscache_operation basis, I have a bit clearer picture of this bug per the below.  It is a race between cachefiles_read_copier() and cachefiles_read_waiter(), where read_copier calls fscache_retrieval_complete() and then ends up inside that fscache_op_complete() (this is the same code that was changed in hunk #3 per comment #27).  And right after read_copier runs, happens cachefiles_read_waiter() is called via the end_io() wake_up on the page queue.  So I now have better clarity of what is going on in this bug, though a few of the specifics (and thus the solution) are still unclear.

[   79.653142] 4294745512 pid=11130 comm=kworker/u16:11 file:line=fs/fscache/operation.c:661 func=fscache_op_work_func op->state=3 event=fscache_put_operation^M
[   79.653143] 4294745512 pid=13074 comm=pmgetopt file:line=fs/cachefiles/rdwr.c:65 func=cachefiles_read_waiter op->state=3 event=fscache_enqueue_retrieval->fscache_enqueue_operation^M
[   79.653144] 4294745512 pid=13074 comm=pmgetopt file:line=include/linux/fscache-cache.h:252 func=fscache_put_retrieval op->state=3 event=fscache_put_operation^M
[   79.653145] 4294745512 pid=554 comm=kworker/u16:4 file:line=fs/cachefiles/rdwr.c:204 func=cachefiles_read_copier op->state=3 event=fscache_retrieval_complete^M
[   79.653146] 4294745512 pid=13074 comm=pmgetopt file:line=include/linux/fscache-cache.h:252 func=fscache_put_retrieval op->state=3 event=fscache_put_operation^M
[   79.653147] 4294745512 pid=13074 comm=pmgetopt file:line=fs/cachefiles/rdwr.c:65 func=cachefiles_read_waiter op->state=3 event=fscache_enqueue_retrieval->fscache_enqueue_operation^M
[   79.653147] 4294745512 pid=554 comm=kworker/u16:4 file:line=include/linux/fscache-cache.h:252 func=fscache_put_retrieval op->state=3 event=fscache_put_operation^M
[   79.653148] 4294745512 pid=13074 comm=pmgetopt file:line=fs/fscache/operation.c:110 func=fscache_enqueue_operation_real op->state=3 event=fscache_put_operation^M
[   79.653149] 4294745512 pid=554 comm=kworker/u16:4 file:line=fs/cachefiles/rdwr.c:204 func=cachefiles_read_copier op->state=3 event=fscache_retrieval_complete^M
[   79.653150] 4294745512 pid=554 comm=kworker/u16:4 file:line=include/linux/fscache-cache.h:252 func=fscache_put_retrieval op->state=3 event=fscache_put_operation^M
[   79.653151] 4294745512 pid=554 comm=kworker/u16:4 file:line=fs/fscache/operation.c:661 func=fscache_op_work_func op->state=3 event=fscache_put_operation^M
[   79.653152] 4294745512 pid=59 comm=kworker/u16:1 file:line=fs/cachefiles/rdwr.c:204 func=cachefiles_read_copier op->state=3 event=fscache_retrieval_complete^M
[   79.653152] 4294745512 pid=59 comm=kworker/u16:1 file:line=fs/cachefiles/rdwr.c:204 func=cachefiles_read_copier op->state=3 event=fscache_retrieval_complete->fscache_op_complete^M
[   79.653155] 4294745512 pid=59 comm=kworker/u16:1 file:line=fs/cachefiles/rdwr.c:204 func=cachefiles_read_copier op->state=3 event=fscache_op_complete cancelled=false^M
[   79.662397] 4294745512 pid=59 comm=kworker/u16:1 file:line=include/linux/fscache-cache.h:252 func=fscache_put_retrieval op->state=4 event=fscache_put_operation^M
[   79.662398] 4294745512 pid=59 comm=kworker/u16:1 file:line=fs/fscache/operation.c:661 func=fscache_op_work_func op->state=4 event=fscache_put_operation^M
[   79.662399] 4294745512 pid=13074 comm=pmgetopt file:line=fs/cachefiles/rdwr.c:65 func=cachefiles_read_waiter op->state=4 event=fscache_enqueue_retrieval->fscache_enqueue_operation^M
[   79.662410] FS-Cache: ^M
[   79.662411] FS-Cache: Assertion failed^M
[   79.662412] FS-Cache: 4 == 5 is false^M
[   79.662754] ------------[ cut here ]------------^M
[   79.677409] kernel BUG at fs/fscache/operation.c:100!^M
[   79.677498] invalid opcode: 0000 [#1] SMP PTI^M
[   79.690824] CPU: 7 PID: 13074 Comm: pmgetopt Kdump: loaded Not tainted 5.7.0-rc3-1836833f67ab-dbg+ #13^M
[   79.690825] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011^M
[   79.690841] RIP: 0010:fscache_enqueue_operation_real.cold+0xa3/0x141 [fscache]^M
[   79.690845] Code: c7 64 1a 7e c0 e8 6e 80 96 e7 48 c7 c7 72 1a 7e c0 e8 62 80 96 e7 8b 75 40 ba 05 00 00 00 48 c7 c7 60 ff 7d c0 e8 4e 80 96 e7 <0f> 0b 65 48 8b 34 25 c0 8b 01 00 48 c7 c
[   79.690846] RSP: 0018:ffffb740001e0da0 EFLAGS: 00010046^M
[   79.690848] RAX: 0000000000000019 RBX: ffff893b102c2640 RCX: 0000000000000007^M
[   79.690849] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff893c37dd9c20^M
[   79.690850] RBP: ffff893b1e640000 R08: 000000128c3ff314 R09: ffffffffa9efc024^M
[   79.690852] R10: 0000000000000368 R11: 000000000002c288 R12: ffff893b48e3bf80^M
[   79.690853] R13: ffff893b102c30b8 R14: ffff893a1203e580 R15: ffff893b1e650090^M
[   79.690855] FS:  0000000000000000(0000) GS:ffff893c37dc0000(0000) knlGS:0000000000000000^M
[   79.690856] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[   79.690857] CR2: 00007fc5964c72f0 CR3: 00000000348d0000 CR4: 00000000000406e0^M
[   79.690862] Call Trace:^M
[   79.707917]  <IRQ>^M
[   79.707956]  cachefiles_read_waiter+0x1a7/0x230 [cachefiles]^M
[   79.707983]  __wake_up_common+0x80/0x180^M
[   79.717234]  wake_up_page_bit+0xab/0x100^M
[   79.725948]  __read_end_io+0xb6/0x120^M
[   79.737417]  blk_update_request+0x17b/0x2d0^M
[   79.744330]  blk_mq_end_request+0x1c/0x130^M
[   79.751155]  blk_done_softirq+0xa1/0xd0^M
[   79.762223]  __do_softirq+0xee/0x2ff^M
[   79.762256]  irq_exit+0xe9/0xf0^M

Comment 30 Dave Wysochanski 2020-05-06 22:24:43 UTC
Ok so my next trace proves that this bug is another incantation of the first situation as Jay pointed out in comment #9 that started the email thread:
https://lore.kernel.org/patchwork/patch/889373/

diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 883bc7bb12c5..9d5d13e150fb 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -58,9 +58,9 @@  static int cachefiles_read_waiter(wait_queue_entry_t *wait, unsigned mode,
 
 	spin_lock(&object->work_lock);
 	list_add_tail(&monitor->op_link, &monitor->op->to_do);
+	fscache_enqueue_retrieval(monitor->op);
 	spin_unlock(&object->work_lock);
 
-	fscache_enqueue_retrieval(monitor->op);
 	return 0;
 }

Basically the patch that was committed in comment #10 does indeed fix a use-after-free bug.  However there still remains a race here where the read_waiter can add to to_do but then release the spinlock and in that time the read_copier runs through to fscache_retrieval_complete() even before the read_waiter calls fscache_enqueue_retrieval().  The discussion died out and missed that the original posters patch in comment #9 was needed to fix this other race.

Since this is a fscache_retrieval operation, op->flags has FSCACHE_OP_MYTHREAD set, and if you look at fscache_enqueue_operation() for that case (note that fscache_enqueue_retrieval is a 1-liner that calls fscache_enqueue_operation), the logic is purely a few ASSERT checks, stats accounting and ftrace so it does not have any functional value.  In theory another approach could be to rework the ASSERT being hit to just ignore the race in that case and say we'll accept the FSCACHE_OP_ST_COMPLETE (assume our thread already picked it up).  In essence there looks to be a small missing piece in the API here since the "enqueue" really happened in the caller, not here, but that's probably just rambling on further than we need to go.

 58 void fscache_enqueue_operation(struct fscache_operation *op)
 59 {
 60         struct fscache_cookie *cookie = op->object->cookie;
 61 
 62         _enter("{OBJ%x OP%x,%u}",
 63                op->object->debug_id, op->debug_id, atomic_read(&op->usage));
 64 
 65         ASSERT(list_empty(&op->pend_link));
 66         ASSERT(op->processor != NULL);
 67         ASSERT(fscache_object_is_available(op->object));
 68         ASSERTCMP(atomic_read(&op->usage), >, 0);
 69         ASSERTIFCMP(op->state != FSCACHE_OP_ST_IN_PROGRESS,
 70                     op->state, ==,  FSCACHE_OP_ST_CANCELLED);
 71 
 72         fscache_stat(&fscache_n_op_enqueue);
 73         switch (op->flags & FSCACHE_OP_TYPE) {
 74         case FSCACHE_OP_ASYNC:
 75                 trace_fscache_op(cookie, op, fscache_op_enqueue_async);
 76                 _debug("queue async");
 77                 atomic_inc(&op->usage);
 78                 if (!queue_work(fscache_op_wq, &op->work))
 79                         fscache_put_operation(op);
 80                 break;
 81         case FSCACHE_OP_MYTHREAD:
 82                 trace_fscache_op(cookie, op, fscache_op_enqueue_mythread);
 83                 _debug("queue for caller's attention");
 84                 break;
 85         default:
 86                 pr_err("Unexpected op type %lx", op->flags);
 87                 BUG();
 88                 break;
 89         }
 90 }
 91 EXPORT_SYMBOL(fscache_enqueue_operation);


I am going to prep the original patch that moves the call to fscache_enqueue_retrieval() inside the spinlock, with credit to the author, and probably add a sentence or two of explanation, because that is probably the most reasonable approach given where we are.


[   94.996481] 4294760822 pid=41 comm=ksoftirqd/6 file:line=fs/cachefiles/rdwr.c:62 func=cachefiles_read_waiter op->state=3 event=cachefiles_read_waiter: adding to monitor to_do^M
[   95.002117] 4294760822 pid=11120 comm=kworker/u16:8 file:line=include/linux/fscache-cache.h:252 func=fscache_put_retrieval op->state=3 event=fscache_put_operation^M
[   95.008096] 4294760822 pid=41 comm=ksoftirqd/6 file:line=fs/cachefiles/rdwr.c:66 func=cachefiles_read_waiter op->state=3 event=fscache_enqueue_retrieval->fscache_enqueue_operation^M
[   95.013673] 4294760822 pid=11120 comm=kworker/u16:8 file:line=fs/cachefiles/rdwr.c:201 func=cachefiles_read_copier op->state=3 event=cachefiles_read_copier: error=0^M
[   95.019839] 4294760822 pid=11120 comm=kworker/u16:8 file:line=fs/cachefiles/rdwr.c:206 func=cachefiles_read_copier op->state=3 event=fscache_retrieval_complete^M
[   95.025515] 4294760822 pid=11120 comm=kworker/u16:8 file:line=fs/cachefiles/rdwr.c:206 func=cachefiles_read_copier op->state=3 event=fscache_retrieval_complete->fscache_op_complete^M
[   95.030996] 4294760822 pid=11120 comm=kworker/u16:8 file:line=fs/cachefiles/rdwr.c:206 func=cachefiles_read_copier op->state=3 event=fscache_op_complete cancelled=false^M
[   95.037197] 4294760822 pid=11120 comm=kworker/u16:8 file:line=include/linux/fscache-cache.h:252 func=fscache_put_retrieval op->state=4 event=fscache_put_operation^M
[   95.042968] 4294760822 pid=11120 comm=kworker/u16:8 file:line=fs/fscache/operation.c:661 func=fscache_op_work_func op->state=4 event=fscache_put_operation^M
[   95.048545] 4294760822 pid=11120 comm=kworker/u16:8 file:line=fs/fscache/operation.c:661 func=fscache_op_work_func op->state=4 event=fscache_put_operation^M
[   95.059218] FS-Cache: ^M
[   95.060174] FS-Cache: Assertion failed^M
[   95.061647] FS-Cache: 4 == 5 is false^M
[   95.063659] ------------[ cut here ]------------^M
[   95.065486] kernel BUG at fs/fscache/operation.c:100!^M
[   95.067596] invalid opcode: 0000 [#1] SMP PTI^M
[   95.069333] CPU: 6 PID: 41 Comm: ksoftirqd/6 Kdump: loaded Not tainted 5.7.0-rc3-1836833f67ab-dbg+ #13^M
[   95.072973] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011^M
[   95.075333] RIP: 0010:fscache_enqueue_operation_real.cold+0xa3/0x141 [fscache]^M
[   95.078163] Code: c7 64 3a 4c c0 e8 6e 60 c8 c7 48 c7 c7 72 3a 4c c0 e8 62 60 c8 c7 8b 75 40 ba 05 00 00 00 48 c7 c7 60 1f 4c c0 e8 4e 60 c8 c7 <0f> 0b 65 48 8b 34 25 c0 8b 01 00 48 c7 c
[   95.085371] RSP: 0018:ffffb80a4019fca0 EFLAGS: 00010046^M
[   95.087419] RAX: 0000000000000019 RBX: ffffa061f6faa640 RCX: 0000000000000007^M
[   95.090172] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffffa061f7d99c20^M
[   95.092915] RBP: ffffa06032120000 R08: 00000000000003a8 R09: 0000000000000028^M
[   95.095708] R10: ffffb80a4019fb50 R11: 0000000000000000 R12: ffffa06130590f00^M
[   95.098486] R13: ffffa061f6fab0b8 R14: ffffa061f13bb820 R15: ffffa06032130090^M
[   95.101254] FS:  0000000000000000(0000) GS:ffffa061f7d80000(0000) knlGS:0000000000000000^M
[   95.104397] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[   95.106631] CR2: 00007fb88756c320 CR3: 00000000ae2f6000 CR4: 00000000000406e0^M
[   95.109419] Call Trace:^M
[   95.110504]  cachefiles_read_waiter+0x216/0x290 [cachefiles]^M
[   95.112818]  __wake_up_common+0x80/0x180^M
[   95.114459]  wake_up_page_bit+0xab/0x100^M
[   95.116054]  __read_end_io+0xb6/0x120^M
[   95.117572]  blk_update_request+0x17b/0x2d0^M
[   95.119299]  blk_mq_end_request+0x1c/0x130^M
[   95.120967]  blk_done_softirq+0xa1/0xd0^M
[   95.122536]  __do_softirq+0xee/0x2ff^M
[   95.123986]  run_ksoftirqd+0x26/0x40^M
[   95.125451]  smpboot_thread_fn+0xc5/0x160^M
[   95.127059]  kthread+0xf9/0x130^M


    In read_waiter:
         20 static int cachefiles_read_waiter(wait_queue_entry_t *wait, unsigned mode,
         21                                   int sync, void *_key)
         22 {
    ..
         49
         50         /* move onto the action list and queue for FS-Cache thread pool */
         51         ASSERT(op);
         52
         53         /* We need to temporarily bump the usage count as we don't own a ref
         54          * here otherwise cachefiles_read_copier() may free the op between the
         55          * monitor being enqueued on the op->to_do list and the op getting
         56          * enqueued on the work queue.
         57          */
         58         fscache_get_retrieval(op);
         59
         60         object = container_of(op->op.object, struct cachefiles_object, fscache);
         61         spin_lock(&object->work_lock);
         62         fscache_op_event_record(&op->op, "cachefiles_read_waiter: adding to monitor to_do");
         63         list_add_tail(&monitor->op_link, &op->to_do);
         64         spin_unlock(&object->work_lock);
         65
         66         fscache_enqueue_retrieval(op);
         67         fscache_put_retrieval(op);
         68         return 0;
         69 }

    In read_copier:
        149 /*
        150  * copy data from backing pages to netfs pages to complete a read operation
        151  * - driven by FS-Cache's thread pool
        152  */
        153 static void cachefiles_read_copier(struct fscache_operation *_op)
        154 {
    ...
        167         spin_lock_irq(&object->work_lock);
        168
        169         while (!list_empty(&op->to_do)) {
        170                 monitor = list_entry(op->to_do.next,
        171                                      struct cachefiles_one_read, op_link);
        172                 list_del(&monitor->op_link);
        173
        174                 spin_unlock_irq(&object->work_lock);
    ...
        178         recheck:
        179                 if (test_bit(FSCACHE_COOKIE_INVALIDATING,
        180                              &object->fscache.cookie->flags)) {
        181                         error = -ESTALE;
        182                 } else if (PageUptodate(monitor->back_page)) {
        183                         copy_highpage(monitor->netfs_page, monitor->back_page);
        184                         fscache_mark_page_cached(monitor->op,
        185                                                  monitor->netfs_page);
        186                         error = 0;
        187                 } else if (!PageError(monitor->back_page)) {
        188                         /* the page has probably been truncated */
        189                         error = cachefiles_read_reissue(object, monitor);
        190                         if (error == -EINPROGRESS)
        191                                 goto next;
        192                         goto recheck;
        193                 } else {
        194                         cachefiles_io_error_obj(
        195                                 object,
        196                                 "Readpage failed on backing file %lx",
        197                                 (unsigned long) monitor->back_page->flags);
        198                         error = -EIO;
        199                 }
        200
        201                 fscache_op_event_record(&op->op, error == 0 ? "cachefiles_read_copier: error=0":"cachefiles_read_copier: error=-ESTALE or -EIO");
        202                 put_page(monitor->back_page);
        203
        204                 fscache_end_io(op, monitor->netfs_page, error);
        205                 put_page(monitor->netfs_page);
        206                 fscache_retrieval_complete(op, 1);
        207                 fscache_put_retrieval(op);
        208                 kfree(monitor);

Comment 31 Dave Wysochanski 2020-05-07 10:16:37 UTC
With the patch in comment #30 added to dhowells fscache-fixes branch (5.7-rc3 + a few recent small fixes), the reproducer ran for over 10 hours whereas prior to the patch it would oops within a few minutes.  Also with the exception of a periodic below page_allocation error, which per dhowells is a separate / known issue, and one reason for the in-progress rewrite of fscache, I saw no other issue with the test, and I could kill the test and unmount the share, etc.

One final note - in the course of running these tests, I did see a periodic cachefilesd crash on startup, and I do have a corefile of this, but I've not yet filed a bug.  The crash resolved upon a simple restart of cachefilesd though so it was not deemed a critical issue.  Once I sort this bug out, I'll investigate that further and file another bug as needed, if one doesn't already exist.

[38532.455608] warn_alloc: 2 callbacks suppressed
[38532.455624] dd: page allocation failure: order:0, mode:0x90c00(GFP_NOIO|__GFP_NORETRY|__GFP_NOMEMALLOC), nodemask=(null),cpuset=/,mems_allowed=0
[38532.511641] CPU: 3 PID: 462539 Comm: dd Kdump: loaded Not tainted 5.7.0-rc3-1836833f67ab-bz1829662-fix+ #14
[38532.516041] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[38532.527383] Call Trace:
[38532.534441]  dump_stack+0x66/0x90
[38532.542477]  warn_alloc.cold+0x7b/0xdf
[38532.550760]  ? unreserve_highatomic_pageblock+0x16e/0x190
[38532.562946]  __alloc_pages_slowpath.constprop.0+0xd9c/0xdd0
[38532.570996]  ? mem_cgroup_commit_charge+0xcb/0x1a0
[38532.573624]  ? __add_to_page_cache_locked+0x349/0x3f0
[38532.578266]  __alloc_pages_nodemask+0x2e8/0x330
[38532.580195]  cachefiles_read_or_alloc_pages+0x3d6/0x1020 [cachefiles]
[38532.582885]  ? __wake_up_bit+0x52/0x70
[38532.584527]  ? fscache_run_op.isra.0+0x55/0xb0 [fscache]
[38532.586686]  __fscache_read_or_alloc_pages+0x2a8/0x310 [fscache]
[38532.589320]  __nfs_readpages_from_fscache+0x60/0x160 [nfs]
[38532.591585]  nfs_readpages+0xb5/0x1a0 [nfs]
[38532.593718]  ? __switch_to_asm+0x40/0x70
[38532.595331]  read_pages+0x6b/0x1b0
[38532.596737]  __do_page_cache_readahead+0x1ba/0x1d0
[38532.598686]  ondemand_readahead+0x171/0x2e0
[38532.600355]  ? page_cache_async_readahead+0x6c/0xe0
[38532.604840]  generic_file_read_iter+0x7cb/0xe40
[38532.608240]  ? nfs3_proc_commit_setup+0x10/0x10 [nfsv3]
[38532.610356]  ? nfs_check_cache_invalid+0x33/0x90 [nfs]
[38532.612432]  nfs_file_read+0x6d/0xa0 [nfs]
[38532.614182]  new_sync_read+0x12a/0x1c0
[38532.615773]  vfs_read+0x9d/0x150
[38532.617134]  ksys_read+0x5f/0xe0
[38532.619116]  do_syscall_64+0x5b/0x1c0
[38532.620624]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[38532.622682] RIP: 0033:0x7ff8816ec3e2

Comment 33 Dave Wysochanski 2020-05-07 12:52:00 UTC
Created attachment 1686164 [details]
patch submitted to linux-nfs

Comment 50 Dave Wysochanski 2020-05-19 10:30:30 UTC
Created attachment 1689835 [details]
Debug patches on top of 5.7-rc3 used to track down this problem

These are a couple debug patches I wrote to track down this problem.  Specifically due to the nature of the bug, I found ftrace would create too much overhead and change the timing so it dramatically slowed the reproducer or made it fail entirely.  So I shifted to a bit of custom tracing inside the structure in question.  I could do this since it was not a 'use after free' problem but more of a 'structure in the wrong state' problem.  Placing the tracing inside the structure allows for 'playback' of the specific trace attached to the one structure that fails right before the machine panics, rather than a deluge of trace information that is irrelevant to the problem.

Comment 51 Jan Stancek 2020-05-25 10:56:04 UTC
Patch(es) committed on kernel-3.10.0-1145.el7

Comment 62 errata-xmlrpc 2020-09-29 21:13:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4060


Note You need to log in before you can comment on or make changes to this bug.