Bug 204697 - rgmanager hangs waiting for DLM event
rgmanager hangs waiting for DLM event
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2006-08-30 17:23 EDT by Lon Hohberger
Modified: 2009-04-16 18:35 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2006-09-28 14:05:19 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Lon Hohberger 2006-08-30 17:23:55 EDT
Description of problem:

If rgmanager enters select() on the DLM file descriptor and the cluster loses
quorum before it gets woken up, no event is ever sent to wake up the call to
select().  This means that rgmanager can hang forever.  What is needed is a
method for loss of quorum to wake up rgmanager -- even if it's in in select()
waiting for a DLM event.

A possible solution to this is to have the lock subsystem open a pipe which a
waiter can use to wake up the thread in wait_for_dlm_event(), which should only
be used if quorum is lost.

Version: CVS/head
Comment 1 Lon Hohberger 2006-08-31 10:54:07 EDT
Note, after killing clurgmgrd which was stuck, the DLM oopsed.  I have not been
able to reproduce this problem sice.
Comment 2 Lon Hohberger 2006-09-01 11:49:29 EDT
dlm: got new/restarted association 1 nodeid 2
dlm: rgmanager: recover 1
dlm: rgmanager: add member 2
dlm: rgmanager: add member 1
dlm: rgmanager: total members 2
dlm: rgmanager: dlm_recover_directory
dlm: rgmanager: dlm_recover_directory 0 entries
dlm: rgmanager: recover 1 done: 56 ms
dlm: rgmanager: recover 3
dlm: rgmanager: remove member 2
dlm: rgmanager: total members 1
dlm: rgmanager: dlm_recover_directory
dlm: rgmanager: dlm_recover_directory 0 entries
dlm: rgmanager: pre recover waiter lkid 10304 type 1 flags 1
dlm: rgmanager: dlm_purge_locks
dlm: rgmanager: dlm_recover_masters
dlm: rgmanager: dlm_recover_masters 1 resources
dlm: rgmanager: dlm_recover_locks
dlm: rgmanager: dlm_recover_locks 0 locks
dlm: rgmanager: dlm_recover_rsbs
dlm: rgmanager: dlm_recover_rsbs 1 rsbs
dlm: rgmanager: recover_waiters_post 10304 type 1 flags 1 rg="service:test00"
dlm: rgmanager: recover 3 done: 0 ms
BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
*pde = 00000000
Oops: 0000 [#1]
last sysfs file: /kernel/dlm/rgmanager/control
Modules linked in: md5 sctp autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2
dlm configfs sunrpc acpi_cpufreq video sbs i2c_ec button battery asus_acpi ac
ipv6 parport_pc lp parport floppy sg i2c_piix4 i2c_savage4 e100 i2c_algo_bit
i2c_core mii ohci_hcd ide_cd cdrom pcspkr dm_snapshot dm_zero dm_mirror dm_mod
ext3 jbd aic7xxx scsi_transport_spi sd_mod scsi_mod
CPU:    1
EIP:    0060:[<c04e50d9>]    Not tainted VLI
EFLAGS: 00010286   (2.6.17-1.2439.fc6 #1) 
EIP is at list_del+0x9/0x62
eax: 00000000   ebx: cf471770   ecx: cf471738   edx: cf471738
esi: d18db1a4   edi: cf471738   ebp: cf5a7d74   esp: cf5a7d70
ds: 007b   es: 007b   ss: 0068
Process clurgmgrd (pid: 2751, ti=cf5a7000 task=d4076000 task.ti=cf5a7000)
Stack: cf471738 cf5a7d80 e0b41efc cf471738 cf5a7d8c e0b41f5f d18db1a4 cf5a7da4 
       e0b42d0b cf471738 00000000 d18db1a4 cf471738 cf5a7dbc e0b4320b d18db1ac 
       cf5a7dd8 cf471738 cfbecc04 cf5a7dfc e0b43f88 d6b59694 db0797a4 db079bb0 
Call Trace:
 [<e0b41efc>] del_lkb+0x12/0x6a [dlm]
 [<e0b41f5f>] _remove_lock+0xb/0x67 [dlm]
 [<e0b42d0b>] do_unlock+0x7c/0x9d [dlm]
 [<e0b4320b>] unlock_lock+0x94/0xaf [dlm]
 [<e0b43f88>] dlm_clear_proc_locks+0x100/0x13c [dlm]
 [<e0b4a670>] device_close+0x4d/0x8a [dlm]
 [<c04726cf>] __fput+0xb3/0x18a
 [<c04729c0>] fput+0x17/0x19
 [<c046ff8e>] filp_close+0x51/0x5b
 [<c0425f6a>] put_files_struct+0x6d/0xa9
 [<c0426faa>] do_exit+0x255/0x78c
 [<c042755a>] sys_exit_group+0x0/0x11
 [<df2e80f0>] 0xdf2e80f0
Code: e8 7f 00 00 00 8d 4b 0c 8b 51 04 8d 46 0c e8 71 00 00 00 89 f8 e8 9d fe ff
ff 8d 65 f4 5b 5e 5f 5d c3 55 89 e5 53 89 c3 8b 40 04 <8b> 00 39 d8 74 17 50 53
68 66 75 63 c0 e8 f3 fd f3 ff 0f 0b 41 
EIP: [<c04e50d9>] list_del+0x9/0x62 SS:ESP 0068:cf5a7d70
 <1>Fixing recursive fault but reboot is needed!

^^ Dlm oops
Comment 3 Lon Hohberger 2006-09-28 14:05:19 EDT
I have not seen this in some time.
Comment 4 Nate Straz 2007-12-13 12:18:30 EST
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.

Note You need to log in before you can comment on or make changes to this bug.