| Summary: | GFS2: AIL error, inode is actually data | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Nate Straz <nstraz> |
| Component: | kernel | Assignee: | gfs2-maint |
| Status: | CLOSED WORKSFORME | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | rawhide | CC: | cz172638, gansalmon, ichavero, itamar, jonathan, kernel-maint, madhu.chinakonda, mchehab, nstraz, rpeterso |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-11-18 20:07:18 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Nate Straz
2016-11-22 19:47:29 UTC
While trying to reproduce this I hit an assert before recover started. [ 3794.698704] gfs2: fsid=STSRHTS30869:STSRHTS308690.0: fatal: assertion "!atomic_read(&gl->gl_ail_count)" failed [ 3794.698704] function = inode_go_inval, file = fs/gfs2/glops.c, line = 258 [ 3794.700436] Kernel panic - not syncing: GFS2: fsid=STSRHTS30869:STSRHTS308690.0: panic requested [ 3794.700436] [ 3794.701427] CPU: 0 PID: 29006 Comm: kworker/0:42H Not tainted 4.9.0-rc4-splice-Nov16+ #40 [ 3794.701427] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [ 3794.701427] Workqueue: glock_workqueue glock_work_func [gfs2] [ 3794.701427] ffffc90001dfbc28 ffffffff8135b77c ffff8800227ec000 ffffffffa03f6448 [ 3794.701427] ffffc90001dfbca8 ffffffff8119c754 ffffc90000000010 ffffc90001dfbcb8 [ 3794.701427] ffffc90001dfbc58 0000000046ac9281 ffff880000000018 ffff8800227ec9f8 [ 3794.701427] Call Trace: [ 3794.701427] [<ffffffff8135b77c>] dump_stack+0x63/0x87 [ 3794.701427] [<ffffffff8119c754>] panic+0xeb/0x239 [ 3794.701427] [<ffffffffa03eb36e>] gfs2_lm_withdraw+0xbe/0x180 [gfs2] [ 3794.701427] [<ffffffffa03eb454>] gfs2_assert_withdraw_i+0x24/0x40 [gfs2] [ 3794.701427] [<ffffffffa03ce1f9>] inode_go_inval+0xe9/0xf0 [gfs2] [ 3794.701427] [<ffffffffa03cd097>] do_xmote+0x147/0x200 [gfs2] [ 3794.701427] [<ffffffffa03cd242>] run_queue+0xf2/0x1b0 [gfs2] [ 3794.701427] [<ffffffffa03cd360>] glock_work_func+0x60/0x130 [gfs2] [ 3794.701427] [<ffffffff810a88f2>] process_one_work+0x152/0x400 [ 3794.701427] [<ffffffff810a91e5>] worker_thread+0x125/0x4b0 [ 3794.701427] [<ffffffff810a90c0>] ? rescuer_thread+0x380/0x380 [ 3794.701427] [<ffffffff810aed49>] kthread+0xd9/0xf0 [ 3794.701427] [<ffffffff810aec70>] ? kthread_park+0x60/0x60 [ 3794.701427] [<ffffffff81702bd5>] ret_from_fork+0x25/0x30 [ 3794.701427] Kernel Offset: disabled [ 3794.701427] ---[ end Kernel panic - not syncing: GFS2: fsid=STSRHTS30869:STSRHTS308690.0: panic requested Second try, got something different. [ 5019.657914] gfs2: fsid=STSRHTS30869:STSRHTS308690.1: fatal: invalid metadata block [ 5019.657914] bh = 12060162 (magic number) [ 5019.657914] function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 420 [ 5019.659697] Kernel panic - not syncing: GFS2: fsid=STSRHTS30869:STSRHTS308690.1: panic requested [ 5019.659697] [ 5019.660613] CPU: 0 PID: 3687 Comm: kworker/0:223H Not tainted 4.9.0-rc4-splice-Nov16+ #40 [ 5019.660686] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [ 5019.660686] Workqueue: glock_workqueue glock_work_func [gfs2] [ 5019.660686] ffffc9000220bb78 ffffffff8135b77c ffff880020ced000 ffffffffa03f8448 [ 5019.660686] ffffc9000220bbf8 ffffffff8119c754 ffffc90000000010 ffffc9000220bc08 [ 5019.660686] ffffc9000220bba8 00000000874d0077 ffffc90000000018 ffff880020ced9f8 [ 5019.660686] Call Trace: [ 5019.660686] [<ffffffff8135b77c>] dump_stack+0x63/0x87 [ 5019.660686] [<ffffffff8119c754>] panic+0xeb/0x239 [ 5019.660686] [<ffffffffa03ed36e>] gfs2_lm_withdraw+0xbe/0x180 [gfs2] [ 5019.660686] [<ffffffffa03d4c6c>] ? gfs2_meta_read+0x21c/0x2c0 [gfs2] [ 5019.660686] [<ffffffffa03ed65f>] gfs2_meta_check_ii+0x2f/0x40 [gfs2] [ 5019.660686] [<ffffffffa03d50d4>] gfs2_meta_indirect_buffer+0xe4/0x140 [gfs2] [ 5019.660686] [<ffffffffa03d0aa6>] gfs2_inode_refresh+0x36/0x2d0 [gfs2] [ 5019.660686] [<ffffffffa03d0db2>] inode_go_lock+0x72/0xf0 [gfs2] [ 5019.660686] [<ffffffffa03cd219>] do_promote+0x159/0x270 [gfs2] [ 5019.660686] [<ffffffffa03ced38>] finish_xmote+0x158/0x370 [gfs2] [ 5019.660686] [<ffffffffa03cf3cb>] glock_work_func+0xcb/0x130 [gfs2] [ 5019.660686] [<ffffffff810a88f2>] process_one_work+0x152/0x400 [ 5019.660686] [<ffffffff810a91e5>] worker_thread+0x125/0x4b0 [ 5019.660686] [<ffffffff810a90c0>] ? rescuer_thread+0x380/0x380 [ 5019.660686] [<ffffffff810aed49>] kthread+0xd9/0xf0 [ 5019.660686] [<ffffffff810aec70>] ? kthread_park+0x60/0x60 [ 5019.660686] [<ffffffff81702bd5>] ret_from_fork+0x25/0x30 [ 5019.660686] Kernel Offset: disabled [ 5019.660686] ---[ end Kernel panic - not syncing: GFS2: fsid=STSRHTS30869:STSRHTS308690.1: panic requested [ 5019.660686] Block 12060162 in this case is a data block from a file that appears to be unlinked. I was not able to find it on the file system. [root@host-054 18]# gfs2_edit /dev/STSRHTS30869/STSRHTS308690 -p 12060162 -x | head Block #12060162 (0xb80602) of 13106176 (0xc7fc00) 0000000b80602000 686d6876 665f3138 5f686f73 742d3035 [hmhvf_18_host-05] 0000000b80602010 332e7669 72742e6c 61622e6d 73702e72 [3.virt.lab.msp.r] 0000000b80602020 65646861 742e636f 6d5f3330 38325f39 [edhat.com_3082_9] 0000000b80602030 36333536 325f3134 37393834 34323637 [63562_1479844267] Hit the original problem again. [ 2437.647199] gfs2: fsid=STSRHTS30869:STSRHTS308690.0: AIL buffer ffff88000e46ee38: blocknr 835821 state 0x00000003 mapping (nul l) page state 0x1fffff80000874 [ 2437.648556] gfs2: fsid=STSRHTS30869:STSRHTS308690.0: AIL glock 2:3966951 mapping ffff88003cc54f20 [ 2437.649430] gfs2: fsid=STSRHTS30869:STSRHTS308690.0: AIL error [ 2437.649955] Kernel panic - not syncing: GFS2: fsid=STSRHTS30869:STSRHTS308690.0: panic requested [ 2437.649955] [ 2437.650422] CPU: 0 PID: 22512 Comm: kworker/0:82H Not tainted 4.9.0-rc4-splice-Nov16+ #40 [ 2437.650422] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [ 2437.650422] Workqueue: glock_workqueue glock_work_func [gfs2] [ 2437.650422] ffffc90001187b20 ffffffff8135b77c ffff88003c3c6000 ffffffffa03f4448 [ 2437.650422] ffffc90001187ba0 ffffffff8119c754 ffffc90000000010 ffffc90001187bb0 [ 2437.650422] ffffc90001187b50 0000000095943fad ffffffff00000018 ffff88003c3c69f8 [ 2437.650422] Call Trace: [ 2437.650422] [<ffffffff8135b77c>] dump_stack+0x63/0x87 [ 2437.650422] [<ffffffff8119c754>] panic+0xeb/0x239 [ 2437.650422] [<ffffffffa03e936e>] gfs2_lm_withdraw+0xbe/0x180 [gfs2] [ 2437.650422] [<ffffffffa03cc61d>] __gfs2_ail_flush+0x14d/0x1a0 [gfs2] [ 2437.650422] [<ffffffffa03cc758>] gfs2_ail_empty_gl+0xe8/0x130 [gfs2] [ 2437.650422] [<ffffffffa03cc84b>] ? inode_go_sync+0xab/0x130 [gfs2] [ 2437.650422] [<ffffffffa03cc84b>] inode_go_sync+0xab/0x130 [gfs2] [ 2437.650422] [<ffffffffa03cb02a>] do_xmote+0xda/0x200 [gfs2] [ 2437.650422] [<ffffffffa03cb242>] run_queue+0xf2/0x1b0 [gfs2] [ 2437.650422] [<ffffffffa03cb360>] glock_work_func+0x60/0x130 [gfs2] [ 2437.650422] [<ffffffff810a88f2>] process_one_work+0x152/0x400 [ 2437.650422] [<ffffffff810a91e5>] worker_thread+0x125/0x4b0 [ 2437.650422] [<ffffffff810a90c0>] ? rescuer_thread+0x380/0x380 [ 2437.650422] [<ffffffff810aed49>] kthread+0xd9/0xf0 [ 2437.650422] [<ffffffff810aec70>] ? kthread_park+0x60/0x60 [ 2437.650422] [<ffffffff81702bd5>] ret_from_fork+0x25/0x30 [ 2437.650422] Kernel Offset: disabled [ 2437.650422] ---[ end Kernel panic - not syncing: GFS2: fsid=STSRHTS30869:STSRHTS308690.0: panic requested The AIL buffer points to a disk inode for a file in directory 23. [root@host-053 9]# gfs2_edit /dev/STSRHTS30869/STSRHTS308690 -p 835821 | head Block #835821 (0xcc0ed) of 13106176 (0xc7fc00) (disk inode) [root@host-053 9]# gfs2_edit /dev/STSRHTS30869/STSRHTS308690 -p 835866 -x | head Block #835866 (0xcc11a) of 13106176 (0xc7fc00) cc11a000 76777578 775f3233 5f686f73 742d3035 [vwuxw_23_host-05] cc11a010 342e7669 72742e6c 61622e6d 73702e72 [4.virt.lab.msp.r] cc11a020 65646861 742e636f 6d5f3232 3430305f [edhat.com_22400_] cc11a030 35353738 32355f31 34373938 34373035 [557825_147984705] cc11a040 37767775 78775f32 335f686f 73742d30 [7vwuxw_23_host-0] The glock points to a data block of a file in directory 9. [root@host-053 ~]# gfs2_edit /dev/STSRHTS30869/STSRHTS308690 -p 3966951 -x | head -n 6 Block #3966951 (0x3c87e7) of 13106176 (0xc7fc00) 00000003c87e7000 76776d75 755f395f 686f7374 2d303533 [vwmuu_9_host-053] 00000003c87e7010 2e766972 742e6c61 622e6d73 702e7265 [.virt.lab.msp.re] 00000003c87e7020 64686174 2e636f6d 5f313537 34365f32 [dhat.com_15746_2] 00000003c87e7030 38333334 385f3134 37393834 37313234 [83348_1479847124] [root@host-053 ~]# cd /mnt/STSRHTS308690/9 [root@host-053 9]# ls -li vwmuu_9_host-053.virt.lab.msp.redhat.com_15746_283348_1479847124 3966901 -rw-r--r--. 1 root root 283348 Nov 22 14:38 vwmuu_9_host-053.virt.lab.msp.redhat.com_15746_283348_1479847124 Nate, can you try to reproduce this problem on a relatively recent for-next kernel? Since this bug hasn't been updated in more than a year, I'm thinking the problem might be something we've already fixed, namely either (1) the gl_object issue, or (2) the recent file system corruption issue with truncates where file size extends beyond its metadata height. I've been running the Cityscape test with revolver for more than a year, and while I've seen my fair share of problems, I've not seen this one exactly. In this case, the corruption is almost guaranteed to be caused by gfs2 journal replay replaying a rgrp block that's since been revised by a different node. In an ideal world, that should never happen, but in fact it's very easy for me to reproduce using revolver and CITY, as long as you don't have errors=panic. I've got rhel8 bug #1724769 for this problem, and rhel7 bug #1364234 for this. Now I've got a growing pile of upstream patches to fix the different situations that can cause it. I'll continue to work on those problems (under the guise of those two BZs) but for now I'm going to close this one. |