Bug 222820 - Kernel panic while trying to rejoin cluster
Kernel panic while trying to rejoin cluster
Status: CLOSED NOTABUG
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: dlm (Show other bugs)
4
All Linux
medium Severity high
: ---
: ---
Assigned To: David Teigland
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-01-16 07:51 EST by Pawel Sadowski
Modified: 2009-04-16 16:31 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-05-30 14:20:37 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Pawel Sadowski 2007-01-16 07:51:51 EST
Description of problem:
 We have cluster with two node with shared storage (GFS).
During test we got 'kernel panic'. We disable possibility to reboot
other cluster by fencing iLO.

Version-Release number of selected component (if applicable):
cman-devel-1.0.11-0
kernel-devel-2.6.9-42.0.3.EL
cman-kernel-2.6.9-45.8
dlm-kernel-2.6.9-44.3
GFS-kernel-smp-2.6.9-60.3
dlm-devel-1.0.1-1
kernel-smp-2.6.9-42.0.3.EL
cman-kernheaders-2.6.9-45.8
dlm-kernel-smp-2.6.9-44.3
dlm-kernheaders-2.6.9-44.3
GFS-kernel-2.6.9-60.3
cman-1.0.11-0
dlm-1.0.1-1
cmanic-7.6.0-5.rhel4
kernel-smp-devel-2.6.9-42.0.3.EL
cman-kernel-smp-2.6.9-45.8

cat /etc/redhat-release
Red Hat Enterprise Linux ES release 4 (Nahant Update 4)



How reproducible:
  1. disable fencing (wrong password for iLO)
  2. ifdown bond0 on first node
  3. there is no communication beetwen nodes and also there is no access to GFS
shared storage (ls -l /shared hanging on both nodes, correct behavior)
  4. ifup bond0 on first node after 4 minutes break
  5. after few second 'kernel panic' on first node
  6. first node restarted manually
  7. after successfull reboot both nodes can operate on GFS shared storage


Actual results:
  First node got 'kernel panic' after communication has been restored.

Expected results:
  Why first node gets 'kernel panic' after restoring communication? Is this
correct to 'panic' after communication has been restored?

Logs from first node:
Jan 16 09:23:01 node1 kernel: bonding: bond0: released all slaves
Jan 16 09:23:05 node1 kernel: CMAN: sendmsg failed: -22
Jan 16 09:23:10 node1 kernel: CMAN: sendmsg failed: -22
Jan 16 09:23:15 node1 kernel: CMAN: sendmsg failed: -22
Jan 16 09:23:20 node1 kernel: CMAN: sendmsg failed: -22
Jan 16 09:23:20 node1 kernel: CMAN: sendmsg failed: -22
Jan 16 09:23:20 node1 kernel: CMAN: removing node node2 from the cluster :
Missed too many heartbeats
Jan 16 09:23:20 node1 kernel: CMAN: sendmsg failed: -22
Jan 16 09:23:20 node1 fenced[3457]: node2 not a cluster member after 0 sec
post_fail_delay
Jan 16 09:23:20 node1 fenced[3457]: fencing node "node2"
Jan 16 09:23:20 node1 fenced[3457]: agent "fence_ilo" reports: Connect failed:
connect: Network is unreachable; Network is unreachable at
/usr/lib/perl5/vendor_perl/5.8.5/i386-linux-thread-multi/Net/SSL.pm line 104, <>
line 4.
Jan 16 09:23:20 node1 fence_manual: Node node2 needs to be reset before recovery
can procede.  Waiting for node2 to rejoin the cluster or for manual
acknowledgement that it has been reset (i.e. fence_ack_manual -n node2)
Jan 16 09:23:25 node1 kernel: CMAN: sendmsg failed: -22
Jan 16 09:23:25 node1 kernel: CMAN: No functional network interfaces, leaving
cluster
Jan 16 09:23:25 node1 kernel: CMAN: sendmsg failed: -22
Jan 16 09:23:25 node1 kernel: CMAN: we are leaving the cluster.
Jan 16 09:23:25 node1 kernel: WARNING: dlm_emergency_shutdown
Jan 16 09:23:25 node1 clurgmgrd[5737]: <warning> #67: Shutting down uncleanly
Jan 16 09:23:25 node1 kernel: WARNING: dlm_emergency_shutdown
Jan 16 09:23:25 node1 kernel: SM: 00000001 sm_stop: SG still joined
Jan 16 09:23:25 node1 kernel: SM: 01000002 sm_stop: SG still joined
Jan 16 09:23:25 node1 kernel: SM: 02000003 sm_stop: SG still joined
Jan 16 09:23:25 node1 kernel: SM: 03000006 sm_stop: SG still joined
Jan 16 09:23:25 node1 ccsd[3338]: Cluster manager shutdown.  Attemping to
reconnect...
Jan 16 09:23:33 node1 ccsd[3338]: Unable to connect to cluster infrastructure
after 30 seconds.
Jan 16 09:24:03 node1 ccsd[3338]: Unable to connect to cluster infrastructure
after 60 seconds.
Jan 16 09:24:33 node1 ccsd[3338]: Unable to connect to cluster infrastructure
after 90 seconds.
Jan 16 09:25:04 node1 ccsd[3338]: Unable to connect to cluster infrastructure
after 120 seconds.
Jan 16 09:25:34 node1 ccsd[3338]: Unable to connect to cluster infrastructure
after 150 seconds.
Jan 16 09:26:04 node1 ccsd[3338]: Unable to connect to cluster infrastructure
after 180 seconds.
Jan 16 09:26:35 node1 ccsd[3338]: Unable to connect to cluster infrastructure
after 210 seconds.
Jan 16 09:26:57 node1 kernel: bonding: bond0: enslaving eth0 as a backup
interface with a down link.
Jan 16 09:26:57 node1 kernel: bonding: bond0: enslaving eth1 as a backup
interface with a down link.
Jan 16 09:26:58 node1 kernel: tg3: eth0: Link is up at 100 Mbps, full duplex.
Jan 16 09:26:58 node1 kernel: tg3: eth0: Flow control is off for TX and off for RX.
Jan 16 09:26:59 node1 kernel: bonding: bond0: link status definitely up for
interface eth0.
Jan 16 09:26:59 node1 kernel: bonding: bond0: making interface eth0 the new
active one.
Jan 16 09:26:59 node1 kernel: tg3: eth1: Link is up at 100 Mbps, full duplex.
Jan 16 09:26:59 node1 kernel: tg3: eth1: Flow control is off for TX and off for RX.
Jan 16 09:26:59 node1 kernel: bonding: bond0: link status definitely up for
interface eth1.
Jan 16 09:27:05 node1 ccsd[3338]: Unable to connect to cluster infrastructure
after 240 seconds.
Jan 16 09:27:35 node1 ccsd[3338]: Unable to connect to cluster infrastructure
after 270 seconds.
Jan 16 09:27:50 node1 kernel: dlm: dlm_unlock: lkid 2004a lockspace not found
Jan 16 09:27:50 node1 kernel: (first)
Jan 16 09:27:50 node1 kernel: shared add nodes
Jan 16 09:27:50 node1 kernel: shared total nodes 2
Jan 16 09:27:50 node1 kernel: shared rebuild resource directory
Jan 16 09:27:50 node1 kernel: shared rebuilt 113 resources
Jan 16 09:27:50 node1 kernel: shared recover event 2 done
Jan 16 09:27:50 node1 kernel: shared move flags 0,0,1 ids 0,2,2
Jan 16 09:27:50 node1 kernel: shared process held requests
Jan 16 09:27:50 node1 kernel: shared processed 0 requests
Jan 16 09:27:50 node1 kernel: shared recover event 2 finished
Jan 16 09:27:50 node1 kernel: shared_data move flags 0,1,0 ids 0,4,0
Jan 16 09:27:50 node1 kernel: shared_data move use event 4
Jan 16 09:27:50 node1 kernel: shared_data recover event 4 (first)
Jan 16 09:27:50 node1 kernel: shared_data add nodes
Jan 16 09:27:50 node1 kernel: shared_data total nodes 2
Jan 16 09:27:50 node1 kernel: shared_data rebuild resource directory
Jan 16 09:27:50 node1 kernel: shared_data rebuilt 52 resources
Jan 16 09:27:50 node1 kernel: shared_data recover event 4 done
Jan 16 09:27:50 node1 kernel: shared_data move flags 0,0,1 ids 0,4,4
Jan 16 09:27:50 node1 kernel: shared_data process held requests
Jan 16 09:27:50 node1 kernel: shared_data processed 0 requests
Jan 16 09:27:50 node1 kernel: shared_data recover event 4 finished
Jan 16 09:27:50 node1 kernel: Magma move flags 0,1,0 ids 0,7,0
Jan 16 09:27:50 node1 kernel: Magma move use event 7
Jan 16 09:27:50 node1 kernel: Magma recover event 7 (first)
Jan 16 09:27:51 node1 kernel: Magma add nodes
Jan 16 09:27:51 node1 kernel: Magma total nodes 2
Jan 16 09:27:51 node1 kernel: Magma rebuild resource directory
Jan 16 09:27:51 node1 kernel: Magma rebuilt 4 resources
Jan 16 09:27:51 node1 kernel: Magma recover event 7 done
Jan 16 09:27:51 node1 kernel: Magma move flags 0,0,1 ids 0,7,7
Jan 16 09:27:51 node1 kernel: Magma process held requests
Jan 16 09:27:51 node1 kernel: Magma processed 0 requests
Jan 16 09:27:51 node1 kernel: Magma recover event 7 finished
Jan 16 09:27:51 node1 kernel: shared move flags 1,0,0 ids 2,2,2
Jan 16 09:27:51 node1 kernel: shared_data move flags 1,0,0 ids 4,4,4
Jan 16 09:27:51 node1 kernel: Magma move flags 1,0,0 ids 7,7,7
Jan 16 09:27:51 node1 kernel: 3487 pr_start last_stop 0 last_start 3 last_finish 0
Jan 16 09:27:51 node1 kernel: 3487 pr_start count 2 type 2 event 3 flags 250
Jan 16 09:27:51 node1 kernel: 3487 claim_jid 0
Jan 16 09:27:51 node1 kernel: 3487 pr_start 3 done 1
Jan 16 09:27:51 node1 kernel: 3487 pr_finish flags 5a
Jan 16 09:27:51 node1 kernel: 3469 recovery_done jid 0 msg 309 a
Jan 16 09:27:51 node1 kernel: 3469 recovery_done nodeid 1 flg 18
Jan 16 09:27:51 node1 kernel: 3498 pr_start last_stop 0 last_start 5 last_finish 0
Jan 16 09:27:51 node1 kernel: 3498 pr_start count 2 type 2 event 5 flags 250
Jan 16 09:27:52 node1 kernel: 3498 claim_jid 0
Jan 16 09:27:52 node1 kernel: 3498 pr_start 5 done 1
Jan 16 09:27:52 node1 kernel: 3498 pr_finish flags 5a
Jan 16 09:27:52 node1 kernel: 3469 recovery_done jid 0 msg 309 a
Jan 16 09:27:52 node1 kernel: 3469 recovery_done nodeid 1 flg 18
Jan 16 09:27:52 node1 kernel:
Jan 16 09:27:52 node1 kernel: lock_dlm:  Assertion failed on line 357 of file
/usr/src/redhat/BUILD/gfs-kernel-2.6.9-60/smp/src/dlm/lock.c
Jan 16 09:27:52 node1 kernel: lock_dlm:  assertion:  "!error"
Jan 16 09:27:52 node1 kernel: lock_dlm:  time = 56781181
Jan 16 09:27:52 node1 kernel: shared_data: error=-22 num=2,19 lkf=0 flags=84
Jan 16 09:27:52 node1 kernel:
Jan 16 09:27:52 node1 kernel: ------------[ cut here ]------------
Jan 16 09:27:52 node1 kernel: kernel BUG at
/usr/src/redhat/BUILD/gfs-kernel-2.6.9-60/smp/src/dlm/lock.c:357!
Jan 16 09:27:52 node1 kernel: invalid operand: 0000 [#1]
Jan 16 09:27:52 node1 kernel: SMP
Jan 16 09:27:52 node1 kernel: Modules linked in: iptable_filter mptctl mptbase
8021q sg cpqci(U) i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U)
cman(U) iptable_nat ip_conntrack ip_tables dm_mirror dm_mod button battery ac
uhci_hcd ehci_hcd hw_random tg3 bonding(U) floppy ext3 jbd qla6312 qla2400
qla2300 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod
Jan 16 09:27:52 node1 kernel: CPU:    1
Jan 16 09:27:52 node1 kernel: EIP:    0060:[<f8bbc5f3>]    Tainted: P      VLI
Jan 16 09:27:52 node1 kernel: EFLAGS: 00010246   (2.6.9-42.0.3.ELsmp)
Jan 16 09:27:52 node1 kernel: EIP is at do_dlm_unlock+0x8b/0xa0 [lock_dlm]
Jan 16 09:27:52 node1 kernel: eax: 00000001   ebx: f7d9c400   ecx: f6cb4f2c  
edx: f8bc1217
Jan 16 09:27:53 node1 kernel: esi: ffffffea   edi: f7d9c400   ebp: f8bc7000  
esp: f6cb4f28
Jan 16 09:27:53 node1 kernel: ds: 007b   es: 007b   ss: 0068
Jan 16 09:27:53 node1 kernel: Process gfs_glockd (pid: 3500, threadinfo=f6cb4000
task=f73368b0)
Jan 16 09:27:53 node1 kernel: Stack: f8bc1217 f8bc7000 00000003 f8bbc893
f8c2a97a 00000001 f73a5f2c f73a5f10
Jan 16 09:27:53 node1 kernel:        f8c20bc7 f8c56460 c3799e00 f73a5f10
f8c56460 f73a5f10 f8c200bb f73a5f10
Jan 16 09:27:53 node1 kernel:        00000001 f73a5fa4 f8c20172 f73a5f10
f73a5f2c f8c20295 00000001 f73a5f2c
Jan 16 09:27:53 node1 kernel: Call Trace:
Jan 16 09:27:53 node1 kernel:  [<f8bbc893>] lm_dlm_unlock+0x14/0x1c [lock_dlm]
Jan 16 09:27:53 node1 kernel:  [<f8c2a97a>] gfs_lm_unlock+0x2c/0x42 [gfs]
Jan 16 09:27:53 node1 kernel:  [<f8c20bc7>] gfs_glock_drop_th+0xf3/0x12d [gfs]
Jan 16 09:27:53 node1 kernel:  [<f8c200bb>] rq_demote+0x7f/0x98 [gfs]
Jan 16 09:27:53 node1 kernel:  [<f8c20172>] run_queue+0x5a/0xc1 [gfs]
Jan 16 09:27:53 node1 kernel:  [<f8c20295>] unlock_on_glock+0x1f/0x28 [gfs]
Jan 16 09:27:53 node1 kernel:  [<f8c2224d>] gfs_reclaim_glock+0xc3/0x13c [gfs]
Jan 16 09:27:53 node1 kernel:  [<f8c14e05>] gfs_glockd+0x39/0xde [gfs]
Jan 16 09:27:53 node1 kernel:  [<c011e794>] default_wake_function+0x0/0xc
Jan 16 09:27:53 node1 kernel:  [<c02d46e6>] ret_from_fork+0x6/0x14
Jan 16 09:27:53 node1 kernel:  [<c011e794>] default_wake_function+0x0/0xc
Jan 16 09:27:53 node1 kernel:  [<f8c14dcc>] gfs_glockd+0x0/0xde [gfs]
Jan 16 09:27:53 node1 kernel:  [<c01041f5>] kernel_thread_helper+0x5/0xb
Jan 16 09:27:53 node1 kernel: Code: 73 34 8b 03 ff 73 2c ff 73 08 ff 73 04 ff 73
0c 56 ff 70 18 68 0f 13 bc f8 e8 b8 62 56 c7 83 c4 34 68 17 12 bc f8 e8 ab 62 56
c7 <0f> 0b 65 01 5f 11 bc f8 68 19 12 bc f8 e8 66 5a 56 c7 5b 5e c3
Jan 16 09:27:54 node1 kernel:  <0>Fatal exception: panic in 5 seconds


Logs from second node:
Jan 16 09:23:23 node2 kernel: CMAN: removing node node1 from the cluster :
Missed too many heartbeats
Jan 16 09:23:23 node2 fenced[3437]: node1 not a cluster member after 0 sec
post_fail_delay
Jan 16 09:23:23 node2 fenced[3437]: fencing node "node1"
Jan 16 09:23:34 node2 fenced[3437]: agent "fence_ilo" reports: error: User login
name was not found power_status: unexpected error
Jan 16 09:23:34 node2 fence_manual: Node node1 needs to be reset before recovery
can nodeede.  Waiting for node1 to rejoin the cluster or for manual
acknowledgement that it has been reset (i.e. fence_ack_manual -n node1)
Jan 16 09:33:28 node2 kernel: CMAN: node node1 rejoining
Jan 16 09:33:29 node2 fenced[3437]: fence "node1" success
Jan 16 09:33:34 node2 ccsd[3338]: Attempt to close an unopened CCS descriptor
(172230).
Jan 16 09:33:34 node2 ccsd[3338]: Error while nodeessing disconnect: Invalid
request descriptor
Jan 16 09:33:36 node2 kernel: GFS: fsid=cl-node:shared_data.1: jid=0: Trying to
acquire journal lock...
Jan 16 09:33:36 node2 kernel: GFS: fsid=cl-node:node1: jid=0: Trying to acquire
journal lock...
Jan 16 09:33:36 node2 kernel: GFS: fsid=cl-node:shared_data.1: jid=0: Looking at
journal...
Jan 16 09:33:36 node2 kernel: GFS: fsid=cl-node:node1: jid=0: Looking at journal...
Jan 16 09:33:36 node2 kernel: GFS: fsid=cl-node:node1: jid=0: Acquiring the
transaction lock...
Jan 16 09:33:36 node2 kernel: GFS: fsid=cl-node:shared_data.1: jid=0: Acquiring
the transaction lock...
Jan 16 09:33:36 node2 kernel: GFS: fsid=cl-node:shared_data.1: jid=0: Replaying
journal...
Jan 16 09:33:36 node2 kernel: GFS: fsid=cl-node:shared_data.1: jid=0: Replayed 0
of 0 blocks
Jan 16 09:33:36 node2 kernel: GFS: fsid=cl-node:shared_data.1: jid=0: replays =
0, skips = 0, sames = 0
Jan 16 09:33:36 node2 kernel: GFS: fsid=cl-node:shared_data.1: jid=0: Journal
replayed in 1s
Jan 16 09:33:36 node2 kernel: GFS: fsid=cl-node:shared_data.1: jid=0: Done
Jan 16 09:33:36 node2 kernel: GFS: fsid=cl-node:node1: jid=0: Replaying journal...
Jan 16 09:33:36 node2 kernel: GFS: fsid=cl-node:node1: jid=0: Replayed 0 of 0 blocks
Jan 16 09:33:36 node2 kernel: GFS: fsid=cl-node:node1: jid=0: replays = 0, skips
= 0, sames = 0
Jan 16 09:33:36 node2 clurgmgrd[5923]: <info> Magma Event: Membership Change
Jan 16 09:33:36 node2 kernel: GFS: fsid=cl-node:node1: jid=0: Journal replayed in 1s
Jan 16 09:33:36 node2 clurgmgrd[5923]: <info> State change: node1 DOWN
Jan 16 09:33:36 node2 kernel: GFS: fsid=cl-node:node1: jid=0: Done
Jan 16 09:34:23 node2 clurgmgrd[5923]: <info> Magma Event: Membership Change
Jan 16 09:34:23 node2 clurgmgrd[5923]: <info> State change: node1 UP


I was looking for a while and I found that everything begins when first node is
able to communicate. It is trying to do dlm_unlock:
dlm: dlm_unlock: lkid 2004a lockspace not found
Comment 1 Christine Caulfield 2007-01-16 08:40:38 EST
What's happened here is that cman has left the cluster because it can't find any
usable interfaces to communicate on (because you're removed them all).

That sort of forced leave causes the DLM to remove any lockspaces and do an
emergency tidy up. The assert you're seeing is from the GFS lock_dlm module
which is complaining that the DLM is returning an error for all of its lock
operations (because the node is no longer in a cluster).

The oddity (which I haven't seen before) is the way that the DLM tries to
process the requests that come in well after it has been shut down - this is odd
because it should close all its sockets and stop the daemons.
Comment 2 David Teigland 2007-01-16 13:51:19 EST
this is rhel4 (version 4) code
Comment 3 Pawel Sadowski 2007-01-22 05:18:12 EST
Hi,

 Any ideas how to avoid this kernel panic?

Thanks
Comment 4 Christine Caulfield 2007-01-22 05:33:24 EST
Enable fencing and don't down live ethernet interfaces ? ;-)
Comment 5 Pawel Sadowski 2007-07-02 04:35:02 EDT
(In reply to comment #1)

> The oddity (which I haven't seen before) is the way that the DLM tries to
> process the requests that come in well after it has been shut down - this is odd
> because it should close all its sockets and stop the daemons.
> 

And what about above comment? Is it proper bevhaviour? Or should it be treated
like a bug?
Comment 6 Christine Caulfield 2007-07-02 04:48:57 EDT
It's probably a bug, though far from being a high priority one. If I can get a
better reproduction scenario then it might get looked at. 

Note You need to log in before you can comment on or make changes to this bug.