Bug 1849196

Summary: [ARK] kernel bug list_del corruption on s390x from stress-ng mknod and stress-ng symlink
Product: [Fedora] Fedora Reporter: Jeff Bastian <jbastian>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: NEW --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: medium    
Version: rawhideCC: acaringi, airlied, bskeggs, bugproxy, dan, dzickus, hdegoede, ichavero, itamar, jarodwilson, jeremy, jglisse, john.j5live, jonathan, josef, kernel-maint, lgoncalv, linville, masami256, mchehab, mjg59, prudo, rasibley, steved, tstaudt
Target Milestone: ---   
Target Release: ---   
Hardware: s390x   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 467765    
Attachments:
Description Flags
5.7.2-88cd4de.cki-console.log.xz
none
5.7.4-4418f34.cki-console.log.xz
none
5.7.4-9b26e20.cki-console.log.xz
none
5.7.3-bbd1511.cki-console.log.xz
none
5.7.4-inf.cki-console.log.xz
none
5.8.0-rc1-c1f840d.cki-console.log.xz none

Description Jeff Bastian 2020-06-19 19:40:49 UTC
1. Please describe the problem:
The stress-ng mknod and symlink stressors triggera a kernel bug on the ARK kernel on s390x:

mknod:
[ 1256.534428] list_del corruption. next->prev should be 000003e000be7c98, but was 00000001a7a6d1b0
[ 1256.534463] ------------[ cut here ]------------
[ 1256.534466] kernel BUG at lib/list_debug.c:54!
[ 1256.534535] monitor event: 0040 ilc:2 [#1] SMP
[ 1256.534540] Modules linked in: ...<snip>...
[ 1256.534806] CPU: 2 PID: 582352 Comm: stress-ng-mknod Kdump: loaded Not tainted 5.8.0-rc1-c1f840d.cki #1
[ 1256.534810] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
...

symlink:
[ 1754.761295] list_del corruption. prev->next should be 000003e000e27a68, but was 00000001d6daa1f0
[ 1754.880585] ------------[ cut here ]------------
[ 1754.880588] kernel BUG at lib/list_debug.c:51!
[ 1754.880656] monitor event: 0040 ilc:2 [#1] SMP
[ 1754.880662] Modules linked in: ...<snip>...                            
[ 1754.880738] CPU: 3 PID: 592107 Comm: stress-ng-symli Kdump: loaded Not tainted 5.7.2-88cd4de.cki #1
[ 1754.880740] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
...

2. What is the Version-Release number of the kernel:
several recent ARK kernel builds including:
5.8.0-rc1-c1f840d.cki
5.7.4-9b26e20.cki
5.7.4-4418f34.cki
5.7.4-inf.cki
5.7.3-bbd1511.cki
5.7.2-88cd4de.cki

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

Judging by the CKI logs, this first appeared in 5.7.2-88cd4de.cki

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Build and run stress-ng and focus on the mknod stressor.

git clone git://kernel.ubuntu.com/cking/stress-ng.git
cd stress-ng
git checkout -b V0.09.56 V0.09.56
make
./stress-ng --mknod 0 --timeout 5 --log-file mknod.log

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Unknown, but I can try the Rawhide kernel if it's valuable.

6. Are you running any modules that not shipped with directly Fedora's kernel?:

No

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Coming soon.

Comment 1 Jeff Bastian 2020-06-19 19:47:22 UTC
Created attachment 1698160 [details]
5.7.2-88cd4de.cki-console.log.xz

serial console log from kernel 5.7.2-88cd4de.cki

Comment 2 Jeff Bastian 2020-06-19 19:50:51 UTC
Created attachment 1698163 [details]
5.7.4-4418f34.cki-console.log.xz

serial console log from kernel 5.7.4-4418f34.cki

Comment 3 Jeff Bastian 2020-06-19 19:50:54 UTC
Created attachment 1698164 [details]
5.7.4-9b26e20.cki-console.log.xz

serial console log from kernel 5.7.4-9b26e20.cki

Comment 4 Jeff Bastian 2020-06-19 19:50:58 UTC
Created attachment 1698165 [details]
5.7.3-bbd1511.cki-console.log.xz

serial console log from kernel 5.7.3-bbd1511.cki

Comment 5 Jeff Bastian 2020-06-19 19:51:01 UTC
Created attachment 1698166 [details]
5.7.4-inf.cki-console.log.xz

serial console log from kernel 5.7.4-inf.cki

Comment 6 Jeff Bastian 2020-06-19 19:51:05 UTC
Created attachment 1698167 [details]
5.8.0-rc1-c1f840d.cki-console.log.xz

serial console log from kernel 5.8.0-rc1-c1f840d.cki

Comment 8 Jeff Bastian 2020-06-19 20:00:02 UTC
The full trace from kernel 5.8.0-rc1-c1f840d.cki

[ 1256.534428] list_del corruption. next->prev should be 000003e000be7c98, but was 00000001a7a6d1b0
[ 1256.534463] ------------[ cut here ]------------
[ 1256.534466] kernel BUG at lib/list_debug.c:54!
[ 1256.534535] monitor event: 0040 ilc:2 [#1] SMP
[ 1256.534540] Modules linked in: loop binfmt_misc psnap llc salsa20_generic camellia_generic cast6_generic cast_common serpent_generic twofish_generic twofish_common ofb lrw tgr192 wp512 rmd320 rmd256 rmd160 rmd128 md4 lcs ctcm fsm zfcp scsi_transport_fc dasd_fba_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfslockd grace fscache sunrpc qeth_l2 qeth qdio ccwgroup vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio drm drm_panel_orientation_quirks backlight i2c_core ip_tables xfs libcrc32c crc32_vx_s390 ghash_s390 prng aes_s390 des_s390 libdes sha512_s390 sha256_s390 sha1_s390 sha_common dasd_eckd_mod dasd_mod pkey zcrypt
[ 1256.534806] CPU: 2 PID: 582352 Comm: stress-ng-mknod Kdump: loaded Not tainted 5.8.0-rc1-c1f840d.cki #1
[ 1256.534810] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
[ 1256.534820] Krnl PSW : 0404e00180000000 000000009d9b173c (__list_del_entry_valid+0x8c/0xb8)
[ 1256.534831]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[ 1256.534833] Krnl GPRS: 0000000000000064 000000009e5eace8 0000000000000054 0000001f5042a08
[ 1256.534834]            00000001f5051800 0000000000000000 0000000190040180 00000001f1c86570
[ 1256.534836]            00000001f1c86620 070000009ddc1582 000003e000be7c80 00000001a7a6d1a8
[ 1256.534837]            00000001f11ac000 00000001e44f1400 000000009d9b1738 000003e000be7b30
[ 1256.534851] Krnl Code: 000000009d9b172c: e33010000004        lg      %r3,0(%r1)
[ 1256.534851]            000000009d9b1732: c0e5ffdda063        brasl   %r14,000000009d5657f8
[ 1256.534851]           #000000009d9b1738: af000000            mc      0,0
[ 1256.534851]           >000000009d9b173c: b9040032            lgr     %r3,%r2
[ 1256.534851]            000000009d9b1740: c020003251ab        larl    %r2,000000009dffba96
[ 1256.534851]            000000009d9b1746: c0e5ffdda059        brasl   %r14,000000009d5657f8
[ 1256.534851]            000000009d9b174c: af000000            mc      0,0
[ 1256.534851]            000000009d9b1750: b9040032            lgr     %r3,%r2
[ 1256.534866] Call Trace:
[ 1256.534868]  [<000000009d9b173c>] __list_del_entry_valid+0x8c/0xb8
[ 1256.534871] ([<000000009d9b1738>] __list_del_entry_valid+0x88/0xb8)
[ 1256.534876]  [<000000009d54d800>] remove_wait_queue+0x48/0xa0
[ 1256.535024]  [<000003ff8019f3d0>] xfs_log_commit_cil+0x900/0xa50 [xfs]
[ 1256.535056]  [<000003ff80197704>] __xfs_trans_commit+0x9c/0x3a8 [xfs]
[ 1256.535089]  [<000003ff80189a9c>] xfs_remove+0x274/0x328 [xfs]
[ 1256.535121]  [<000003ff80183962>] xfs_vn_unlink+0x5a/0xa8 [xfs]
[ 1256.535126]  [<000000009d768874>] vfs_unlink+0x134/0x250
[ 1256.535128]  [<000000009d76d00a>] do_unlinkat+0x1ba/0x318
[ 1256.535133]  [<000000009ddc692c>] system_call+0xe0/0x2b0
[ 1256.535134] Last Breaking-Event-Address:
[ 1256.810858]  [<000000009ddc7b40>] __s390_indirect_jump_r14+0x0/0xc
[ 1256.810925] ---[ end trace d4f63cd47d1c630e ]---

Comment 9 Jeff Bastian 2020-06-19 20:07:50 UTC
The full trace from kernel 5.7.2-88cd4de.cki

[ 1754.761295] list_del corruption. prev->next should be 000003e000e27a68, but was 00000001d6daa1f0
[ 1754.880585] ------------[ cut here ]------------
[ 1754.880588] kernel BUG at lib/list_debug.c:51!
[ 1754.880656] monitor event: 0040 ilc:2 [#1] SMP
[ 1754.880662] Modules linked in: unix_diag binfmt_misc psnap llc salsa20_generic camellia_generic cast6_generic cast_common serpent_generic twofish_generic twofish_common ofb lrw tgr192 wp512 rmd320 rmd256 rmd160 rmd128 md4 loop tun af_kecrypto_user scsi_transport_iscsi xt_multiport overlay xt_CONNSECMARK xt_SECMARKnft_counter xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink ah6 ah4 sctp lcs ctcm fsm zfcp scsi_transport_fc dasd_fba_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc qeth_l2 qeth qdio ccwgroup vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio drm drm_panel_orientation_quirks backlight i2c_core ip_tables xfs libcrc32c crc32_vx_s390 ghash_s390 prng aes_s390 des_s390 libdes sha512_s390 sha256_s390 sha1_s390 sha_common dasd_eckd_mod dasd_mod pkey zcrypt
[ 1754.880738] CPU: 3 PID: 592107 Comm: stress-ng-symli Kdump: loaded Not tainted 5.7.2-88cd4de.cki #1
[ 1754.880740] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
[ 1754.880742] Krnl PSW : 0404e00180000000 0000000021651180 (__list_del_entry_valid+0xa0/0xb8)
[ 1754.880754]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[ 1754.880756] Krnl GPRS: 0000000000000064 000000002226fbe8 0000000000000054 00000001f509fa10
[ 1754.880757]            00000001f50ae408 0000000000000000 0000000184ca0d30 00000001f104aae0
[ 1754.880759]            00000001f104ab90 0700000021a58c52 000003e000e27a50 00000001d6daa1e8
[ 1754.880760]            0000000089acc000 00000001e4125c00 000000002165117c 000003e000e27900
[ 1754.880768] Krnl Code: 0000000021651170: c0200031e8a3        larl    %r2,0000000021c8e2b6
[ 1754.880768]            0000000021651176: c0e5ffddf2d9        brasl   %r14,000000002120f728
[ 1754.880768]           #000000002165117c: af000000            mc      0,0
[ 1754.880768]           >0000000021651180: b9040032            lgr     %r3,%r2
[ 1754.880768]            0000000021651184: c0200031e87d        larl    %r2,0000000021c8e27e
[ 1754.880768]            000000002165118a: c0e5ffddf2cf        brasl   %r14,000000002120f728
[ 1754.880768]            0000000021651190: af000000            mc      0,0
[ 1754.880768]            0000000021651194: 0707                bcr     0,%r7
[ 1754.880783] Call Trace:
[ 1754.880786]  [<0000000021651180>] __list_del_entry_valid+0xa0/0xb8
[ 1754.880789] ([<000000002165117c>] __list_del_entry_valid+0x9c/0xb8)
[ 1754.880793]  [<00000000211f77d8>] remove_wait_queue+0x48/0xa0
[ 1754.884725]  [<000003ff801e2240>] xfs_log_commit_cil+0x900/0xa50 [xfs]
[ 1754.884763]  [<000003ff801da55c>] __xfs_trans_commit+0x9c/0x3a8 [xfs]
[ 1754.884795]  [<000003ff8015dfd8>] xfs_attr_try_sf_addname+0x68/0xc8 [xfs]
[ 1754.884827]  [<000003ff8015f09a>] xfs_attr_set_args+0x9a/0x128 [xfs]
[ 1754.885084]  [<000003ff8015f37e>] xfs_attr_set+0x1be/0x2f8 [xfs]
[ 1754.885117]  [<000003ff801c61e8>] xfs_initxattrs+0x98/0xb8 [xfs]
[ 1754.885122]  [<000000002157d542>] security_inode_init_security+0x152/0x160
[ 1754.885154]  [<000003ff801c6144>] xfs_init_security+0x2c/0x38 [xfs]
[ 1754.885187]  [<000003ff801c7bc8>] xfs_vn_symlink+0xb0/0x1d0 [xfs]
[ 1754.885190]  [<000000002140d7ae>] vfs_symlink+0xfe/0x1c8
[ 1754.885192]  [<000000002141020a>] do_symlinkat+0xa2/0xf8
[ 1754.885196]  [<0000000021a5dfd0>] system_call+0xdc/0x2c8
[ 1754.885197] Last Breaking-Event-Address:
[ 1754.885199]  [<0000000021a5f560>] __s390_indirect_jump_r14+0x0/0xc
[ 1754.885249] ---[ end trace 9e0dbe149edf1c8a ]---

Comment 10 Jeff Bastian 2020-06-19 20:20:02 UTC
Hanns-Joachim, can you mirror this for IBM BZ?

Comment 11 IBM Bug Proxy 2020-06-22 10:51:36 UTC
------- Comment From geraldsc.com 2020-06-22 06:41 EDT-------
This look like a common code / xfs issue, and probably should be reported to xfs maintainer. No s390 code involved here.

I also cannot reproduce this on my ext4 system, can you verify that this only shows with xfs? Does it also show on other architectures?

Comment 12 Jeff Bastian 2020-06-24 18:36:05 UTC
The same tests pass fine on x86_64, ppc64le, and aarch64.  It only fails on s390x, and fails fairly regularly.

I'll try ext4 and see what happens.

Comment 13 Jeff Bastian 2020-07-10 21:02:55 UTC
I ran xfs tests 4-times on an ext4 file system and could not reproduce the problem.