Bug 129655 - dlm_recoverd Ooops while running I/O and two nodes are taken down
Summary: dlm_recoverd Ooops while running I/O and two nodes are taken down
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: dlm
Version: 4
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Christine Caulfield
QA Contact: GFS Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-08-11 14:55 UTC by Corey Marthaler
Modified: 2009-04-16 19:58 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-01-10 22:25:00 UTC
Embargoed:


Attachments (Terms of Use)

Description Corey Marthaler 2004-08-11 14:55:10 UTC
Description of problem:
I was running the accordion test for many hours and then took down two
of the nodes in the cluster (morph-04 and morph-05). This then cause
morph-02 to Oops:


Aug 11 09:31:38 morph-02 kernel: dlm: gfs1: purge locks of departed nodes
Aug 11 09:31:38 morph-02 kernel: dlm: gfs1: purged 3142 locks
Aug 11 09:31:38 morph-02 kernel: dlm: gfs1: update remastered resources
dlm: gfs1: updated 22853 resources
dlm: gfs1: rebuild locks
Unable to handle kernel paging request at virtual address 20202020
 printing eip:
c02cb344
*pde = 00000000
Oops: 0002 [#1]
Modules linked in: gnbd lock_gulm lock_nolock lock_dlm dlm cman gfs
lock_harness ipv6 autofs4 sunrpc e1000 microcode dm_mod uhci_hcd
ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx
scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<c02cb344>]    Not tainted
EFLAGS: 00010202   (2.6.7)
EIP is at rwsem_down_write_failed+0x44/0x16a
eax: ffffffff   ebx: e6695554   ecx: e6695558   edx: 20202020
esi: f74dfe90   edi: f77591b0   ebp: e6695554   esp: f74dfe88
ds: 007b   es: 007b   ss: 0068
Process dlm_recoverd (pid: 3212, threadinfo=f74de000 task=f77591b0)
Stack: 646c6975 e6695558 e6695558 00000246 f77591b0 00000002 ffffffe4
e66954f0
       f74dff04 e6695554 f8a46eea 00000000 e66954f0 00000000 f74dff04
f7cd1e38
       f8a4617b f74dff00 f7cd1e38 dff08000 f74dff2c f7cd1f08 f8a4626f
f7cd1e38
Call Trace:
 [<f8a46eea>] .text.lock.rebuild+0x5a/0xc0 [dlm]
 [<f8a4617b>] fill_rcom_buffer+0x9b/0xe0 [dlm]
 [<f8a4626f>] rebuild_rsbs_send+0xaf/0x1e0 [dlm]
 [<f8a48dca>] ls_reconfig+0xca/0x230 [dlm]
 [<f8a49be5>] do_ls_recovery+0x175/0x430 [dlm]
 [<f8a49fc8>] dlm_recoverd+0x128/0x170 [dlm]
 [<c0118850>] default_wake_function+0x0/0x10
 [<c0105c12>] ret_from_fork+0x6/0x14
 [<c0118850>] default_wake_function+0x0/0x10
 [<f8a49ea0>] dlm_recoverd+0x0/0x170 [dlm]
 [<c010429d>] kernel_thread_helper+0x5/0x18

Code: 89 32 89 54 24 0c 0f c1 03 48 66 85 c0 75 2d 8d b6 00 00 00
 Aug 11 09:31:39 morph-02 kernel: dlm: gfs1: restbl_rsb_update_recv
rsb not found 16243
Aug 11 09:31:40 morph-02 kernel: dlm: gfs1: updated 22853 resources
Aug 11 09:31:40 morph-02 kernel: dlm: gfs1: rebuild locks
Aug 11 09:31:40 morph-02 kernel: Unable to handle kernel paging
request at virtual address 20202020
Aug 11 09:31:40 morph-02 kernel:  printing eip:
Aug 11 09:31:40 morph-02 kernel: c02cb344
Aug 11 09:31:40 morph-02 kernel: *pde = 00000000
Aug 11 09:31:40 morph-02 kernel: Oops: 0002 [#1]
Aug 11 09:31:40 morph-02 kernel: Modules linked in: gnbd lock_gulm
lock_nolock lock_dlm dlm cman gfs lock_harness ipv6 autofs4 sunrpc
e1000 microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac
ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
Aug 11 09:31:40 morph-02 kernel: CPU:    0
Aug 11 09:31:40 morph-02 kernel: EIP:    0060:[<c02cb344>]    Not tainted
Aug 11 09:31:40 morph-02 kernel: EFLAGS: 00010202   (2.6.7)
Aug 11 09:31:40 morph-02 kernel: EIP is at
rwsem_down_write_failed+0x44/0x16a
Aug 11 09:31:40 morph-02 kernel: eax: ffffffff   ebx: e6695554   ecx:
e6695558   edx: 20202020
Aug 11 09:31:40 morph-02 kernel: esi: f74dfe90   edi: f77591b0   ebp:
e6695554   esp: f74dfe88
Aug 11 09:31:40 morph-02 kernel: ds: 007b   es: 007b   ss: 0068
Aug 11 09:31:40 morph-02 kernel: Process dlm_recoverd (pid: 3212,
threadinfo=f74de000 task=f77591b0)
Aug 11 09:31:40 morph-02 kernel: Stack: 646c6975 e6695558 e6695558
00000246 f77591b0 00000002 ffffffe4 e66954f0
Aug 11 09:31:40 morph-02 kernel:        f74dff04 e6695554 f8a46eea
00000000 e66954f0 00000000 f74dff04 f7cd1e38
Aug 11 09:31:40 morph-02 kernel:        f8a4617b f74dff00 f7cd1e38
dff08000 f74dff2c f7cd1f08 f8a4626f f7cd1e38
Aug 11 09:31:40 morph-02 kernel: Call Trace:
Aug 11 09:31:41 morph-02 kernel:  [<f8a46eea>]
.text.lock.rebuild+0x5a/0xc0 [dlm]
Aug 11 09:31:41 morph-02 kernel:  [<f8a4617b>]
fill_rcom_buffer+0x9b/0xe0 [dlm]
Aug 11 09:31:41 morph-02 kernel:  [<f8a4626f>]
rebuild_rsbs_send+0xaf/0x1e0 [dlm]
Aug 11 09:31:41 morph-02 kernel:  [<f8a48dca>] ls_reconfig+0xca/0x230
[dlm]
Aug 11 09:31:41 morph-02 kernel:  [<f8a49be5>]
do_ls_recovery+0x175/0x430 [dlm]
Aug 11 09:31:41 morph-02 kernel:  [<f8a49fc8>]
dlm_recoverd+0x128/0x170 [dlm]
Aug 11 09:31:41 morph-02 kernel:  [<c0118850>]
default_wake_function+0x0/0x10
Aug 11 09:31:41 morph-02 kernel:  [<c0105c12>] ret_from_fork+0x6/0x14
Aug 11 09:31:41 morph-02 kernel:  [<c0118850>]
default_wake_function+0x0/0x10
Aug 11 09:31:41 morph-02 kernel:  [<f8a49ea0>] dlm_recoverd+0x0/0x170
[dlm]
Aug 11 09:31:41 morph-02 kernel:  [<c010429d>]
kernel_thread_helper+0x5/0x18
Aug 11 09:31:41 morph-02 kernel:
Aug 11 09:31:41 morph-02 kernel: Code: 89 32 89 54 24 0c 0f c1 03 48
66 85 c0 75 2d 8d b6 00 00 00


How reproducible:
Didn't try

Comment 1 David Teigland 2004-08-12 07:05:28 UTC
I used the following args:
accordion -p 10 -L fcntl -s 1024000 -e 4097 -t -m 100 acc1 acc2 acc3 acc4

I only let it run for a couple hours on 8 nodes before killing 2 and
didn't have any problem.  The initial problem may be somewhat removed
from the actual oops based on the "restbl_rsb_update_recv" error
message.  We can add some better error checking and reporting
around that to lend some help the next time we're able to cause this.
It's probably not something that will appear every time, although
that would be nice.

I'm curious about the rather large number of locks ("updated 22853
resources") in your test.  Are there different accordion args than I
used that might explain that?


Comment 2 Corey Marthaler 2004-08-12 15:28:14 UTC
Dave,

I might have also been running genesis as well to bump up the lock
count. They were running on each of 6 nodes, to each of either 3 or 5
filesystems per node, to 10 files, over night (so like 15 hours)
before I took down the 2 other nodes which caused the oops. 

Here's the cmdlines I used:
./accordion -L flock -s 2097152 -e 1024 -t -m 100000 -S 54321 accd1
accd2 accd3 accd4 accd5 accd6 accd7 accd8 accd9 accd10

./genesis -S 12345 -n 7500 -s 1048576 -L flock -d 700 



Comment 3 Kiersten (Kerri) Anderson 2004-11-04 15:14:26 UTC
Updates with the proper version and component name.

Comment 4 Corey Marthaler 2005-01-10 22:25:00 UTC
hasn't been seen in 5 months (with a lot of recovery testing).


Note You need to log in before you can comment on or make changes to this bug.