Bug 177577 - cman_serviced panic while doing recovery
cman_serviced panic while doing recovery
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cman (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
:
Depends On:
Blocks: 180185
  Show dependency treegraph
 
Reported: 2006-01-11 16:43 EST by Corey Marthaler
Modified: 2009-04-16 16:00 EDT (History)
1 user (show)

See Also:
Fixed In Version: RHBA-2006-0559
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-08-10 17:32:27 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Patch to test (3.22 KB, patch)
2006-01-20 10:40 EST, Christine Caulfield
no flags Details | Diff

  None (edit)
Description Corey Marthaler 2006-01-11 16:43:39 EST
Description of problem:
I hit this while trying to reproduce bz 176872.
I was running derringer on the taft cluster which was serving up GFS and EXT
services. On the 7th iteration, derringer shot taft-01, and that caused taft-04
to panic. After that, taft-01 eventually came back up and joined the cluster
without problems.

Jan 11 15:29:30 taft-04 kernel: CMAN: node taft-01 has been removed from the
cluster: Missed too many heartbeats
Jan 11 15:29:31 taft-04 fenced[2728]: fencing deferred to taft-03
Jan 11 15:29:41 taft-04 qarshd[13148]: Talking to peer 10.15.87.99:51194
Jan 11 15:29:41 taft-04 qarshd[13148]: Running cmdline: cat /proc/cluster/nodes
Jan 11 15:29:41 taft-04 qarshd[13148]: That's enough
GFS: fsid=TAFT2<344>_CGLFUSS:T fERsi:dta=TfAtF9.T23:34 j_iCdL=U1ST:E
RT:rytiafngt 8t.o3 :a cjqiudi=1r:e  jToryuinrnga tlo  laocckqu.i.r.e
jGoFuSr:na lf sliod=ckT.AF..T2
4_GCFSL:U SfTsERi:dt=TafAtFT72.33:4 _jCLiUd=ST1:E RB:tuasfyt
.G3F: Sj: ifds=1id:= TTrAFyiT2n3g 4t_oCL UacSTqEuiRr:et afjto6u.r3na: l
jildo=ck1.: ..B
syG
S: fsGiFd=S:TA fFTsi2d3=4T_CAFLUTS23T4ER_C:LtaUfSTt6ER.3:t: afjit5d.=13:: 
jTridyi=n1g:  tBuos yac
quire jouGFrSn:a l fsloidc=k.TA..F
3G4F_SCL:U fSsTiEdR=:tTaAFftT243.34:_C jLUidS=TE1:R: Btafust5y
3: jid=1: Trying to acquire journal lock...                   .
GFS: fsid=TAFT234_CLUSTER:taft8.3: jid=1: Busy
GFS: fsid=TAFT234_CLUSTER:taft4.3: jid=1: Trying to acquire journal lock...
Unable to handle kernel paging request<4> at 00000000001001G0F0 S:R IfPs:i d
=T<A4FT>2<f34ff_CfLffUfSTfEa0R:2t2acf1t8a9.>{3:: cmjiadn:=1fi: ndL_oobakirrngie
ra+t13j}ou
naPl.ML.4.
20be46067 PGD 0
Oops: 0000 [1] SMP
CPU 3
Modules linked in: radeon nfsd exportfs lockd nfs_acl parport_pc lp parport
autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5
ipv6 sunrpc dsyenta_socket pcmcia_core button battery ac uhci_hcd ehci_hcd
e752x_edac edac_mc hw_random shpchp e1000 floppy qla2300 qla2xxx sg dm_snapshot
dm_zero dm_mirror ext3 jbd dm_mod lpfc scsi_transport_fc megaraid_mbox
megaraid_mm sd_mod scsi_mod
Pid: 2690, c<o4m>m:G FScm:a fns_sied=rTviAcFeTd2 34N_oCt LtUSaTiEnRte:dt
a2f.t96..39-: 27j.iEd=Ls1:mp
neRI             Do
: 0010:[<ffffffffa022c18a>] <ffffffffa022c18a>{:cman:find_barrier+13}
RSP: 0018:000001021e629e08  EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000100100 RCX: 0000000000000015
RDX: 0000000000000035 RSI: 0000010219b8deda RDI: 000001021e629e81
RBP: 000001021e629e78 R08: 00000000fffffffb R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 000001021e629e78
R13: 0000000000000003 R14: 0000000000000000 R15: ffffffffa024b520
FS:  0000000000000000(0000) GS:ffffffff804d7a80(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000100100 CR3: 0000000037e34000 CR4: 00000000000006e0
Process cman_serviced (pid: 2690, threadinfo 000001021e628000, task
000001021eb4b7f0)
Stack: 0000000000000015 0000010037c48600 ffffffffa024a1a0 ffffffffa022c2cb
       0000010037c48600 000001021e629e78 ffffffffa0235136 00000000fffffffc
       000000000000004b ffffffffa0235194
Call Trace:<ffffffffa022c2cb>{:cman:kcl_barrier_register+130}
       <ffffffffa0235136>{:cman:callback_recovery_barrier+0}
       <ffffffffa0235194>{:cman:sm_barrier+67} <ffffffffa0235625>{:cman:serviced+0}
       <ffffffffa0238fde>{:cman:recovery_barrier+134}
<ffffffffa0235625>{:cman:serviced+0}
       <ffffffffa0235625>{:cman:serviced+0}
<ffffffffa023918e>{:cman:process_recoveries+397}
       <ffffffffa0235625>{:cman:serviced+0}
<ffffffff8014aad0>{keventd_create_kthread+0}
       <ffffffffa0235698>{:cman:serviced+115} <ffffffff8014aaa7>{kthread+200}
       <ffffffff80110e17>{child_rip+8} <ffffffff8014aad0>{keventd_create_kthread+0}
       <ffffffff8014a9df>{kthread+0} <ffffffff80110e0f>{child_rip+0}


Code: 48 8b 03 0f 18 08 48 81 fb 80 a1 24 a0 74 1a 48 8d 73 10 48
RIP <ffffffffa022c18a>{:cman:find_barrier+13} RSP <000001021e629e08>
CR2: 0000000000100100
 <0>Kernel panic - not syncing: Oops



Version-Release number of selected component (if applicable):
[root@taft-03 ~]# uname -ar
Linux taft-03 2.6.9-27.ELsmp #1 SMP Tue Dec 20 19:21:06 EST 2005 x86_64 x86_64
x86_64 GNU/Linux
[root@taft-03 ~]# rpm -q cman
cman-1.0.4-0
Comment 1 Corey Marthaler 2006-01-17 12:15:48 EST
This is reproducable, hit it again last night with a similar testing senario and
got a little clearer looking stack trace:

GFS: fsid=TAFT234_CLUSTER:taft5.2: jid=1: Acquiring the transaction lock...
GFS: fsid=TAFT234_CLUSTER:taft5.2: jid=1: Replaying journal...
GFS: fsid=TAFT234_CLUSTER:taft5.2: jid=1: Replayed 0 of 0 blocks
GFS: fsid=TAFT234_CLUSTER:taft5.2: jid=1: replays = 0, skips = 0, sames = 0
GFS: fsid=TAFT234_CLUSTER:taft5.2: jid=1: Journal replayed in 1s
GFS: fsid=TAFT234_CLUSTER:taft5.2: jid=1: Done
GFS: fsid=TAFT234_CLUSTER:taft9.2: jid=1: Busy
Unable to handle kernel paging request at 0000000000100100 RIP:
<ffffffffa021419c>{:cman:find_barrier+13}
PML4 214f52067 PGD 0
Oops: 0000 [1] SMP
CPU 2
Modules linked in: radeon nfsd exportfs lockd nfs_acl parport_pc lp parport
autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5
ipv6 sunrpc dsyenta_socket pcmcia_core button battery ac uhci_hcd ehci_hcd
e752x_edac edac_mc hw_random e1000 floppy qla2300 qla2xxx sg dm_snapshot dm_zero
dm_mirror ext3 jbd dm_mod lpfc scsi_transport_fc megaraid_mbox megaraid_mm
sd_mod scsi_mod
Pid: 2746, comm: cman_serviced Tainted: GF     2.6.9-27.ELsmp
RIP: 0010:[<ffffffffa021419c>] <ffffffffa021419c>{:cman:find_barrier+13}
RSP: 0018:000001021d3bbe08  EFLAGS: 00010286
RAX: 00000000fffffffc RBX: 0000000000100100 RCX: 0000000000000015
RDX: 0000000000000031 RSI: 0000010037d3469a RDI: 000001021d3bbe81
RBP: 000001021d3bbe78 R08: 00000000fffffffb R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 000001021d3bbe78
R13: 0000000000000003 R14: 0000000000000000 R15: ffffffffa02385e0
FS:  0000000000000000(0000) GS:ffffffff804d7a00(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000100100 CR3: 00000000dffbe000 CR4: 00000000000006e0
Process cman_serviced (pid: 2746, threadinfo 000001021d3ba000, task
000001021d59a7f0)
Stack: 0000000000000015 00000101ffc0b600 ffffffffa0236760 ffffffffa02142dd
       00000101ffc0b600 000001021d3bbe78 ffffffffa021d71e 00000000fffffffc
       0000000000000064 ffffffffa021d77c
Call Trace:<ffffffffa02142dd>{:cman:kcl_barrier_register+130}
       <ffffffffa021d71e>{:cman:callback_recovery_barrier+0}
       <ffffffffa021d77c>{:cman:sm_barrier+67} <ffffffffa021dc0d>{:cman:serviced+0}
       <ffffffffa022165a>{:cman:recovery_barrier+134}
<ffffffffa021dc0d>{:cman:serviced+0}
       <ffffffffa021dc0d>{:cman:serviced+0}
<ffffffffa022180a>{:cman:process_recoveries+397}
       <ffffffffa021dc0d>{:cman:serviced+0}
<ffffffff8014aad0>{keventd_create_kthread+0}
       <ffffffffa021dc80>{:cman:serviced+115} <ffffffff8014aaa7>{kthread+200}
       <ffffffff80110e17>{child_rip+8} <ffffffff8014aad0>{keventd_create_kthread+0}
       <ffffffff8014a9df>{kthread+0} <ffffffff80110e0f>{child_rip+0}


Code: 48 8b 03 0f 18 08 48 81 fb 40 67 23 a0 74 1a 48 8d 73 10 48
RIP <ffffffffa021419c>{:cman:find_barrier+13} RSP <000001021d3bbe08>
CR2: 0000000000100100
 <0>Kernel panic - not syncing: Oops



Comment 2 Christine Caulfield 2006-01-20 05:09:11 EST
Is there any chance of finding out what the other CPU is doing when this
happens? I have a hunch I know where it will be, but it would be nice to be sure.
Comment 3 Christine Caulfield 2006-01-20 10:40:05 EST
Created attachment 123489 [details]
Patch to test

If it's what I think it is, then this patch should fix it. 

I'd like a little more evidence before committing this into the tree though. if
that's possible.
Comment 9 Christine Caulfield 2006-04-19 03:55:23 EDT
Patch committed to STABLE:
Checking in cnxman-private.h;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman-private.h,v  <--  cnxman-private.h
new revision: 1.12.2.2.6.3; previous revision: 1.12.2.2.6.2
done
Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.12.4.1.2.11; previous revision: 1.42.2.12.4.1.2.10
done

and RHEL4:
Checking in cnxman-private.h;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman-private.h,v  <--  cnxman-private.h
new revision: 1.12.2.5; previous revision: 1.12.2.4
done
Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.24; previous revision: 1.42.2.23
done
Comment 10 Corey Marthaler 2006-08-03 15:20:28 EDT
Haven't seen this bug in almost 4 months since the fix went in. Marking verified.
Comment 12 Red Hat Bugzilla 2006-08-10 17:32:27 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0559.html

Note You need to log in before you can comment on or make changes to this bug.