Bug 1397558 - GFS2: AIL error, inode is actually data
Summary: GFS2: AIL error, inode is actually data
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: gfs2-maint
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-22 19:47 UTC by Nate Straz
Modified: 2020-05-22 13:20 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-18 20:07:18 UTC
Type: Bug


Attachments (Terms of Use)

Description Nate Straz 2016-11-22 19:47:29 UTC
Description of problem:

While running `revolver -L CITY` on an upstream kernel I hit this panic.

[  144.696536] gfs2: fsid=STSRHTS30869:STSRHTS308690.1: AIL buffer ffff88001d6f1a28: blocknr 12321407 state 0x00000003 mapping           (null) page state 0x1
fffff80000874
[  144.698127] gfs2: fsid=STSRHTS30869:STSRHTS308690.1: AIL glock 2:951747 mapping ffff8800181f01a0
[  144.698908] gfs2: fsid=STSRHTS30869:STSRHTS308690.1: AIL error
[  144.699451] Kernel panic - not syncing: GFS2: fsid=STSRHTS30869:STSRHTS308690.1: panic requested
[  144.699451] 
[  144.700443] CPU: 0 PID: 3473 Comm: kworker/0:69H Not tainted 4.9.0-rc4-splice-Nov16+ #40
[  144.700443] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[  144.700443] Workqueue: glock_workqueue glock_work_func [gfs2]
[  144.700443]  ffffc90000febb20 ffffffff8135b77c ffff88003b4bc000 ffffffffa03f5448
[  144.700443]  ffffc90000febba0 ffffffff8119c754 ffffc90000000010 ffffc90000febbb0
[  144.700443]  ffffc90000febb50 00000000dbc1bf66 ffffffff00000018 ffff88003b4bc9f8
[  144.700443] Call Trace:
[  144.700443]  [<ffffffff8135b77c>] dump_stack+0x63/0x87
[  144.700443]  [<ffffffff8119c754>] panic+0xeb/0x239
[  144.700443]  [<ffffffffa03ea36e>] gfs2_lm_withdraw+0xbe/0x180 [gfs2]
[  144.700443]  [<ffffffffa03cd61d>] __gfs2_ail_flush+0x14d/0x1a0 [gfs2]
[  144.700443]  [<ffffffffa03cd758>] gfs2_ail_empty_gl+0xe8/0x130 [gfs2]
[  144.700443]  [<ffffffffa03cd84b>] ? inode_go_sync+0xab/0x130 [gfs2]
[  144.700443]  [<ffffffffa03cd84b>] inode_go_sync+0xab/0x130 [gfs2]
[  144.700443]  [<ffffffffa03cc02a>] do_xmote+0xda/0x200 [gfs2]
[  144.700443]  [<ffffffffa03cc242>] run_queue+0xf2/0x1b0 [gfs2]
[  144.700443]  [<ffffffffa03cc360>] glock_work_func+0x60/0x130 [gfs2]
[  144.700443]  [<ffffffff810a88f2>] process_one_work+0x152/0x400
[  144.700443]  [<ffffffff810a91e5>] worker_thread+0x125/0x4b0
[  144.700443]  [<ffffffff810a90c0>] ? rescuer_thread+0x380/0x380
[  144.700443]  [<ffffffff810aed49>] kthread+0xd9/0xf0
[  144.700443]  [<ffffffff810aec70>] ? kthread_park+0x60/0x60
[  144.700443]  [<ffffffff81702bd5>] ret_from_fork+0x25/0x30
[  144.700443] Kernel Offset: disabled
[  144.700443] ---[ end Kernel panic - not syncing: GFS2: fsid=STSRHTS30869:STSRHTS308690.1: panic requested

The blocknr (12321407) doesn't match the glock (2:951747).

block 12321407 is actually an inode on disk.
block 951747 is actually data from inode 951624


Version-Release number of selected component (if applicable):
4.9.0-rc4-splice-Nov16+

How reproducible:
Unknown

Steps to Reproduce:
1. revolver -L CITY
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Nate Straz 2016-11-22 19:59:29 UTC
While trying to reproduce this I hit an assert before recover started.  

[ 3794.698704] gfs2: fsid=STSRHTS30869:STSRHTS308690.0: fatal: assertion "!atomic_read(&gl->gl_ail_count)" failed
[ 3794.698704]    function = inode_go_inval, file = fs/gfs2/glops.c, line = 258
[ 3794.700436] Kernel panic - not syncing: GFS2: fsid=STSRHTS30869:STSRHTS308690.0: panic requested
[ 3794.700436] 
[ 3794.701427] CPU: 0 PID: 29006 Comm: kworker/0:42H Not tainted 4.9.0-rc4-splice-Nov16+ #40
[ 3794.701427] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[ 3794.701427] Workqueue: glock_workqueue glock_work_func [gfs2]
[ 3794.701427]  ffffc90001dfbc28 ffffffff8135b77c ffff8800227ec000 ffffffffa03f6448
[ 3794.701427]  ffffc90001dfbca8 ffffffff8119c754 ffffc90000000010 ffffc90001dfbcb8
[ 3794.701427]  ffffc90001dfbc58 0000000046ac9281 ffff880000000018 ffff8800227ec9f8
[ 3794.701427] Call Trace:
[ 3794.701427]  [<ffffffff8135b77c>] dump_stack+0x63/0x87
[ 3794.701427]  [<ffffffff8119c754>] panic+0xeb/0x239
[ 3794.701427]  [<ffffffffa03eb36e>] gfs2_lm_withdraw+0xbe/0x180 [gfs2]
[ 3794.701427]  [<ffffffffa03eb454>] gfs2_assert_withdraw_i+0x24/0x40 [gfs2]
[ 3794.701427]  [<ffffffffa03ce1f9>] inode_go_inval+0xe9/0xf0 [gfs2]
[ 3794.701427]  [<ffffffffa03cd097>] do_xmote+0x147/0x200 [gfs2]
[ 3794.701427]  [<ffffffffa03cd242>] run_queue+0xf2/0x1b0 [gfs2]
[ 3794.701427]  [<ffffffffa03cd360>] glock_work_func+0x60/0x130 [gfs2]
[ 3794.701427]  [<ffffffff810a88f2>] process_one_work+0x152/0x400
[ 3794.701427]  [<ffffffff810a91e5>] worker_thread+0x125/0x4b0
[ 3794.701427]  [<ffffffff810a90c0>] ? rescuer_thread+0x380/0x380
[ 3794.701427]  [<ffffffff810aed49>] kthread+0xd9/0xf0
[ 3794.701427]  [<ffffffff810aec70>] ? kthread_park+0x60/0x60
[ 3794.701427]  [<ffffffff81702bd5>] ret_from_fork+0x25/0x30
[ 3794.701427] Kernel Offset: disabled
[ 3794.701427] ---[ end Kernel panic - not syncing: GFS2: fsid=STSRHTS30869:STSRHTS308690.0: panic requested

Comment 2 Nate Straz 2016-11-22 20:22:04 UTC
Second try, got something different.  

[ 5019.657914] gfs2: fsid=STSRHTS30869:STSRHTS308690.1: fatal: invalid metadata block
[ 5019.657914]   bh = 12060162 (magic number)
[ 5019.657914]   function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 420
[ 5019.659697] Kernel panic - not syncing: GFS2: fsid=STSRHTS30869:STSRHTS308690.1: panic requested
[ 5019.659697] 
[ 5019.660613] CPU: 0 PID: 3687 Comm: kworker/0:223H Not tainted 4.9.0-rc4-splice-Nov16+ #40
[ 5019.660686] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[ 5019.660686] Workqueue: glock_workqueue glock_work_func [gfs2]
[ 5019.660686]  ffffc9000220bb78 ffffffff8135b77c ffff880020ced000 ffffffffa03f8448
[ 5019.660686]  ffffc9000220bbf8 ffffffff8119c754 ffffc90000000010 ffffc9000220bc08
[ 5019.660686]  ffffc9000220bba8 00000000874d0077 ffffc90000000018 ffff880020ced9f8
[ 5019.660686] Call Trace:
[ 5019.660686]  [<ffffffff8135b77c>] dump_stack+0x63/0x87
[ 5019.660686]  [<ffffffff8119c754>] panic+0xeb/0x239
[ 5019.660686]  [<ffffffffa03ed36e>] gfs2_lm_withdraw+0xbe/0x180 [gfs2]
[ 5019.660686]  [<ffffffffa03d4c6c>] ? gfs2_meta_read+0x21c/0x2c0 [gfs2]
[ 5019.660686]  [<ffffffffa03ed65f>] gfs2_meta_check_ii+0x2f/0x40 [gfs2]
[ 5019.660686]  [<ffffffffa03d50d4>] gfs2_meta_indirect_buffer+0xe4/0x140 [gfs2]
[ 5019.660686]  [<ffffffffa03d0aa6>] gfs2_inode_refresh+0x36/0x2d0 [gfs2]
[ 5019.660686]  [<ffffffffa03d0db2>] inode_go_lock+0x72/0xf0 [gfs2]
[ 5019.660686]  [<ffffffffa03cd219>] do_promote+0x159/0x270 [gfs2]
[ 5019.660686]  [<ffffffffa03ced38>] finish_xmote+0x158/0x370 [gfs2]
[ 5019.660686]  [<ffffffffa03cf3cb>] glock_work_func+0xcb/0x130 [gfs2]
[ 5019.660686]  [<ffffffff810a88f2>] process_one_work+0x152/0x400
[ 5019.660686]  [<ffffffff810a91e5>] worker_thread+0x125/0x4b0
[ 5019.660686]  [<ffffffff810a90c0>] ? rescuer_thread+0x380/0x380
[ 5019.660686]  [<ffffffff810aed49>] kthread+0xd9/0xf0
[ 5019.660686]  [<ffffffff810aec70>] ? kthread_park+0x60/0x60
[ 5019.660686]  [<ffffffff81702bd5>] ret_from_fork+0x25/0x30
[ 5019.660686] Kernel Offset: disabled
[ 5019.660686] ---[ end Kernel panic - not syncing: GFS2: fsid=STSRHTS30869:STSRHTS308690.1: panic requested
[ 5019.660686] 

Block 12060162 in this case is a data block from a file that appears to be unlinked.  I was not able to find it on the file system.

[root@host-054 18]# gfs2_edit /dev/STSRHTS30869/STSRHTS308690 -p 12060162 -x | head
Block #12060162    (0xb80602) of 13106176 (0xc7fc00)

0000000b80602000 686d6876 665f3138 5f686f73 742d3035 [hmhvf_18_host-05]
0000000b80602010 332e7669 72742e6c 61622e6d 73702e72 [3.virt.lab.msp.r]
0000000b80602020 65646861 742e636f 6d5f3330 38325f39 [edhat.com_3082_9]
0000000b80602030 36333536 325f3134 37393834 34323637 [63562_1479844267]

Comment 3 Nate Straz 2016-11-22 20:47:27 UTC
Hit the original problem again.

[ 2437.647199] gfs2: fsid=STSRHTS30869:STSRHTS308690.0: AIL buffer ffff88000e46ee38: blocknr 835821 state 0x00000003 mapping           (nul
l) page state 0x1fffff80000874
[ 2437.648556] gfs2: fsid=STSRHTS30869:STSRHTS308690.0: AIL glock 2:3966951 mapping ffff88003cc54f20
[ 2437.649430] gfs2: fsid=STSRHTS30869:STSRHTS308690.0: AIL error
[ 2437.649955] Kernel panic - not syncing: GFS2: fsid=STSRHTS30869:STSRHTS308690.0: panic requested
[ 2437.649955] 
[ 2437.650422] CPU: 0 PID: 22512 Comm: kworker/0:82H Not tainted 4.9.0-rc4-splice-Nov16+ #40
[ 2437.650422] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[ 2437.650422] Workqueue: glock_workqueue glock_work_func [gfs2]
[ 2437.650422]  ffffc90001187b20 ffffffff8135b77c ffff88003c3c6000 ffffffffa03f4448
[ 2437.650422]  ffffc90001187ba0 ffffffff8119c754 ffffc90000000010 ffffc90001187bb0
[ 2437.650422]  ffffc90001187b50 0000000095943fad ffffffff00000018 ffff88003c3c69f8
[ 2437.650422] Call Trace:
[ 2437.650422]  [<ffffffff8135b77c>] dump_stack+0x63/0x87
[ 2437.650422]  [<ffffffff8119c754>] panic+0xeb/0x239
[ 2437.650422]  [<ffffffffa03e936e>] gfs2_lm_withdraw+0xbe/0x180 [gfs2]
[ 2437.650422]  [<ffffffffa03cc61d>] __gfs2_ail_flush+0x14d/0x1a0 [gfs2]
[ 2437.650422]  [<ffffffffa03cc758>] gfs2_ail_empty_gl+0xe8/0x130 [gfs2]
[ 2437.650422]  [<ffffffffa03cc84b>] ? inode_go_sync+0xab/0x130 [gfs2]
[ 2437.650422]  [<ffffffffa03cc84b>] inode_go_sync+0xab/0x130 [gfs2]
[ 2437.650422]  [<ffffffffa03cb02a>] do_xmote+0xda/0x200 [gfs2]
[ 2437.650422]  [<ffffffffa03cb242>] run_queue+0xf2/0x1b0 [gfs2]
[ 2437.650422]  [<ffffffffa03cb360>] glock_work_func+0x60/0x130 [gfs2]
[ 2437.650422]  [<ffffffff810a88f2>] process_one_work+0x152/0x400
[ 2437.650422]  [<ffffffff810a91e5>] worker_thread+0x125/0x4b0
[ 2437.650422]  [<ffffffff810a90c0>] ? rescuer_thread+0x380/0x380
[ 2437.650422]  [<ffffffff810aed49>] kthread+0xd9/0xf0
[ 2437.650422]  [<ffffffff810aec70>] ? kthread_park+0x60/0x60
[ 2437.650422]  [<ffffffff81702bd5>] ret_from_fork+0x25/0x30
[ 2437.650422] Kernel Offset: disabled
[ 2437.650422] ---[ end Kernel panic - not syncing: GFS2: fsid=STSRHTS30869:STSRHTS308690.0: panic requested

The AIL buffer points to a disk inode for a file in directory 23.
[root@host-053 9]# gfs2_edit /dev/STSRHTS30869/STSRHTS308690 -p 835821 | head
Block #835821    (0xcc0ed) of 13106176 (0xc7fc00) (disk inode)
[root@host-053 9]# gfs2_edit /dev/STSRHTS30869/STSRHTS308690 -p 835866 -x | head
Block #835866    (0xcc11a) of 13106176 (0xc7fc00)

cc11a000 76777578 775f3233 5f686f73 742d3035 [vwuxw_23_host-05]
cc11a010 342e7669 72742e6c 61622e6d 73702e72 [4.virt.lab.msp.r]
cc11a020 65646861 742e636f 6d5f3232 3430305f [edhat.com_22400_]
cc11a030 35353738 32355f31 34373938 34373035 [557825_147984705]
cc11a040 37767775 78775f32 335f686f 73742d30 [7vwuxw_23_host-0]

The glock points to a data block of a file in directory 9.

[root@host-053 ~]# gfs2_edit /dev/STSRHTS30869/STSRHTS308690 -p 3966951 -x | head -n 6
Block #3966951    (0x3c87e7) of 13106176 (0xc7fc00)

00000003c87e7000 76776d75 755f395f 686f7374 2d303533 [vwmuu_9_host-053]
00000003c87e7010 2e766972 742e6c61 622e6d73 702e7265 [.virt.lab.msp.re]
00000003c87e7020 64686174 2e636f6d 5f313537 34365f32 [dhat.com_15746_2]
00000003c87e7030 38333334 385f3134 37393834 37313234 [83348_1479847124]
[root@host-053 ~]# cd /mnt/STSRHTS308690/9
[root@host-053 9]# ls -li vwmuu_9_host-053.virt.lab.msp.redhat.com_15746_283348_1479847124
3966901 -rw-r--r--. 1 root root 283348 Nov 22 14:38 vwmuu_9_host-053.virt.lab.msp.redhat.com_15746_283348_1479847124

Comment 4 Robert Peterson 2018-03-27 17:53:01 UTC
Nate, can you try to reproduce this problem on a relatively recent
for-next kernel? Since this bug hasn't been updated in more
than a year, I'm thinking the problem might be something we've
already fixed, namely either (1) the gl_object issue, or (2) the
recent file system corruption issue with truncates where file size
extends beyond its metadata height.

Comment 5 Robert Peterson 2019-11-18 20:07:18 UTC
I've been running the Cityscape test with revolver for more than a year, and
while I've seen my fair share of problems, I've not seen this one exactly.

In this case, the corruption is almost guaranteed to be caused by gfs2 journal
replay replaying a rgrp block that's since been revised by a different node.

In an ideal world, that should never happen, but in fact it's very easy for
me to reproduce using revolver and CITY, as long as you don't have errors=panic.

I've got rhel8 bug #1724769 for this problem, and rhel7 bug #1364234 for this.
Now I've got a growing pile of upstream patches to fix the different situations
that can cause it. I'll continue to work on those problems (under the guise of
those two BZs) but for now I'm going to close this one.


Note You need to log in before you can comment on or make changes to this bug.