Bug 460317 - RHEL5 cmirror tracker: filesystem corruption detected after node recovery
RHEL5 cmirror tracker: filesystem corruption detected after node recovery
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cmirror (Show other bugs)
5.3
All Linux
high Severity high
: rc
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-08-27 10:41 EDT by Corey Marthaler
Modified: 2010-01-11 21:08 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-20 16:25:59 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2008-08-27 10:41:04 EDT
Description of problem:
I was running revolver in an attempt to reproduce bz 459820, when I came across this issue. It's possible that this may be another version of the other corruption issues out there (bzs 446255 441970) but unlike those, there was no device failure in this case. Also, bz 243013 is a similar rhel4 cmirror corruption issue. Anyways, after taft-04 was shot, taft-01 paniced due to detected fs corruption.


================================================================================
Senario iteration 4.2 started at Tue Aug 26 18:30:52 CDT 2008
Sleeping 2 minute(s) to let the I/O get its lock count up...
Senario: DLM kill one node

Those picked to face the revolver... taft-04
Feeling lucky taft-04? Well do ya? Go'head make my day...
Didn't receive heartbeat for 2 seconds

Verify that taft-04 has been removed from cluster on remaining nodes
Verifying that the dueler(s) are alive
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
<ignore name="taft-04_1" pid="19781" time="Tue Aug 26 18:35:05 2008" type="cmd" duration="887" ec="127" />
<ignore name="taft-04_0" pid="19779" time="Tue Aug 26 18:35:05 2008" type="cmd" duration="887" ec="127" />
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
All killed nodes are back up (able to be pinged), making sure they're qarshable...
<fail name="taft-01_0" pid="20705" time="Tue Aug 26 18:35:47 2008" type="cmd" duration="297" ec="127" />
ALL STOP!
<stop name="iogen_803" pid="12016" time="Tue Aug 26 18:35:47 2008" type="cmd" duration="5608" ec="1" />
<stop name="iogen_993" pid="12018" time="Tue Aug 26 18:35:47 2008" type="cmd" duration="5608" ec="1" />
<stop name="taft-01_1" pid="20708" time="Tue Aug 26 18:35:47 2008" type="cmd" duration="297" ec="127" />
still not all qarshable, sleeping another 10 seconds
still not all qarshable, sleeping another 10 seconds
<killed name="taft-02_0" pid="20233" time="Tue Aug 26 18:36:06 2008" type="cmd" duration="625" signal="2" />
<killed name="taft-03_0" pid="20245" time="Tue Aug 26 18:36:06 2008" type="cmd" duration="625" signal="2" />
still not all qarshable, sleeping another 10 seconds
All killed nodes are now qarshable

Verifying that recovery properly took place (on the nodes that stayed in the cluster)
checking that all of the cluster nodes are now/still cman members...
taft-01 is not a member on taft-02
taft-01 is not a member on taft-03

GFS: fsid=TAFT:1.1: jid=3: Trying to acquire journal lock...
GFS: fsid=TAFT:2.1: jid=3: Trying to acquire journal lock...
GFS: fsid=TAFT:2.1: jid=3: Looking at journal...
GFS: fsid=TAFT:1.1: jid=3: Looking at journal...
GFS: fsid=TAFT:1.1: jid=3: Acquiring the transaction lock...
GFS: fsid=TAFT:2.1: jid=3: Acquiring the transaction lock...
GFS: fsid=TAFT:1.1: jid=3: Replaying journal...
GFS: fsid=TAFT:1.1: jid=3: Replayed 0 of 1 blocks
GFS: fsid=TAFT:1.1: jid=3: replays = 0, skips = 0, sames = 1
GFS: fsid=TAFT:2.1: fatal: filesystem consistency error
GFS: fsid=TAFT:2.1:   function = trans_go_xmote_bh
GFS: fsid=TAFT:2.1:   file = /builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/glops.c,2
GFS: fsid=TAFT:2.1:   time = 1219793543
GFS: fsid=TAFT:2.1: about to withdraw from the cluster
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at ...ld/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/lm.c:110
invalid opcode: 0000 [1] SMP 
last sysfs file: /fs/gfs/TAFT:2/lock_module/recover
CPU 0 
Modules linked in: sctp gfs(U) autofs4 hidp rfcomm l2cap bluetooth dm_log_clustered(U) lock_dlm d
Pid: 6163, comm: dlm_astd Tainted: G      2.6.18-98.el5 #1
RIP: 0010:[<ffffffff88602057>]  [<ffffffff88602057>] :gfs:gfs_lm_withdraw+0x87/0xd3
RSP: 0018:ffff81021cb43c50  EFLAGS: 00010202
RAX: 000000000000003a RBX: ffffc2001045a000 RCX: ffffffff802ee9a8
RDX: ffffffff802ee9a8 RSI: 0000000000000000 RDI: ffffffff802ee9a0
RBP: ffffc200104929ac R08: ffffffff802ee9a8 R09: 0000000000000046
R10: 0000000000000000 R11: 0000000000000280 R12: 0000000000000000
R13: 0000000000000005 R14: 0000000000000003 R15: ffffffff8862d860
FS:  0000000000000000(0000) GS:ffffffff803a0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000000064e000 CR3: 0000000213086000 CR4: 00000000000006e0
Process dlm_astd (pid: 6163, threadinfo ffff81021cb42000, task ffff81021b497860)
Stack:  0000003000000030 ffff81021cb43d60 ffff81021cb43c70 0000000000000000
 0000000000000000 0000000000000000 ffffc200104929ac ffffc200104929ac
 ffffffff88618270 ffffc200104929ac 0000000801161970 0000000000000000
Call Trace:
 [<ffffffff88617991>] :gfs:gfs_consist_i+0x2f/0x34
 [<ffffffff885faf9c>] :gfs:trans_go_xmote_bh+0x9a/0xc9
 [<ffffffff885f9831>] :gfs:xmote_bh+0x334/0x3fe
 [<ffffffff8009de2a>] keventd_create_kthread+0x0/0xc4
 [<ffffffff885fa6f5>] :gfs:gfs_glock_cb+0xc2/0x15d
 [<ffffffff8855c85c>] :lock_dlm:gdlm_ast+0x306/0x311
 [<ffffffff8855c2c1>] :lock_dlm:gdlm_bast+0x0/0x8d
 [<ffffffff8855c2c1>] :lock_dlm:gdlm_bast+0x0/0x8d
 [<ffffffff884b320e>] :dlm:dlm_astd+0xd7/0x14f
 [<ffffffff884b3137>] :dlm:dlm_astd+0x0/0x14f
 [<ffffffff8003258b>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009de2a>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003248d>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code: 0f 0b 68 cf b1 61 88 c2 6e 00 48 89 ee 48 c7 c7 0f b2 61 88 
RIP  [<ffffffff88602057>] :gfs:gfs_lm_withdraw+0x87/0xd3
 RSP <ffff81021cb43c50>
 <0>Kernel panic - not syncing: Fatal exception
GFS: fsid=TAFT:1.1: jid=3: Journal replayed in 1s
GFS: fsid=TAFT:1.1: jid=3: Done



Version-Release number of selected component (if applicable):
2.6.18-98.el5

lvm2-2.02.32-4.el5    BUILT: Fri Apr  4 06:15:19 CDT 2008
lvm2-cluster-2.02.32-4.el5    BUILT: Wed Apr  2 03:56:50 CDT 2008
device-mapper-1.02.24-1.el5    BUILT: Thu Jan 17 16:46:05 CST 2008
cmirror-1.1.22-1.el5    BUILT: Thu Jul 24 15:59:03 CDT 2008
kmod-cmirror-0.1.13-2.el5    BUILT: Thu Jul 24 16:00:48 CDT 2008
cman-2.0.87-1.el5.test.plock.3
openais-0.80.3-17.el5

How reproducible:
Only once so far
Comment 1 Corey Marthaler 2008-08-27 14:56:13 EDT
Reproduced this.
Comment 3 Jonathan Earl Brassow 2008-09-29 17:42:28 EDT
Modified by the following check-in:

commit 85d1423ec47e48ab844088ebaf4157327b928ae9
Author: Jonathan Brassow <jbrassow@redhat.com>
Date:   Fri Sep 19 16:19:02 2008 -0500

    dm-log-clustered/clogd: Fix off-by-one error and compilation errors

    Needed to tweek included header files to make dm-log-clustered compile
    again.

    Found an off-by-one error that was causing mirror corruption in the
    case where the primary mirror device was killed in a mirror.


This off-by-one error will manifest itself anytime you are doing I/O while the
mirror is sync'ing.  This could be during the initial sync or a resync after a
failure.
Comment 5 Corey Marthaler 2008-11-20 12:36:31 EST
Fix verified with the following rpms:

2.6.18-123.el5                                                                                                                                                                                                                      

lvm2-2.02.40-6.el5    BUILT: Fri Oct 24 07:37:33 CDT 2008
lvm2-cluster-2.02.40-6.el5    BUILT: Fri Oct 24 07:38:44 CDT 2008
device-mapper-1.02.28-2.el5    BUILT: Fri Sep 19 02:50:32 CDT 2008
cmirror-1.1.34-5.el5    BUILT: Thu Nov  6 15:10:44 CST 2008
kmod-cmirror-0.1.21-2.el5    BUILT: Thu Nov  6 14:12:07 CST 2008
Comment 7 errata-xmlrpc 2009-01-20 16:25:59 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0158.html

Note You need to log in before you can comment on or make changes to this bug.