Description of problem: If rgmanager enters select() on the DLM file descriptor and the cluster loses quorum before it gets woken up, no event is ever sent to wake up the call to select(). This means that rgmanager can hang forever. What is needed is a method for loss of quorum to wake up rgmanager -- even if it's in in select() waiting for a DLM event. A possible solution to this is to have the lock subsystem open a pipe which a waiter can use to wake up the thread in wait_for_dlm_event(), which should only be used if quorum is lost. Version: CVS/head
Note, after killing clurgmgrd which was stuck, the DLM oopsed. I have not been able to reproduce this problem sice.
dlm: got new/restarted association 1 nodeid 2 dlm: rgmanager: recover 1 dlm: rgmanager: add member 2 dlm: rgmanager: add member 1 dlm: rgmanager: total members 2 dlm: rgmanager: dlm_recover_directory dlm: rgmanager: dlm_recover_directory 0 entries dlm: rgmanager: recover 1 done: 56 ms dlm: rgmanager: recover 3 dlm: rgmanager: remove member 2 dlm: rgmanager: total members 1 dlm: rgmanager: dlm_recover_directory dlm: rgmanager: dlm_recover_directory 0 entries dlm: rgmanager: pre recover waiter lkid 10304 type 1 flags 1 dlm: rgmanager: dlm_purge_locks dlm: rgmanager: dlm_recover_masters dlm: rgmanager: dlm_recover_masters 1 resources dlm: rgmanager: dlm_recover_locks dlm: rgmanager: dlm_recover_locks 0 locks dlm: rgmanager: dlm_recover_rsbs dlm: rgmanager: dlm_recover_rsbs 1 rsbs dlm: rgmanager: recover_waiters_post 10304 type 1 flags 1 rg="service:test00" dlm: rgmanager: recover 3 done: 0 ms BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c04e50d9 *pde = 00000000 Oops: 0000 [#1] SMP last sysfs file: /kernel/dlm/rgmanager/control Modules linked in: md5 sctp autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm configfs sunrpc acpi_cpufreq video sbs i2c_ec button battery asus_acpi ac ipv6 parport_pc lp parport floppy sg i2c_piix4 i2c_savage4 e100 i2c_algo_bit i2c_core mii ohci_hcd ide_cd cdrom pcspkr dm_snapshot dm_zero dm_mirror dm_mod ext3 jbd aic7xxx scsi_transport_spi sd_mod scsi_mod CPU: 1 EIP: 0060:[<c04e50d9>] Not tainted VLI EFLAGS: 00010286 (2.6.17-1.2439.fc6 #1) EIP is at list_del+0x9/0x62 eax: 00000000 ebx: cf471770 ecx: cf471738 edx: cf471738 esi: d18db1a4 edi: cf471738 ebp: cf5a7d74 esp: cf5a7d70 ds: 007b es: 007b ss: 0068 Process clurgmgrd (pid: 2751, ti=cf5a7000 task=d4076000 task.ti=cf5a7000) Stack: cf471738 cf5a7d80 e0b41efc cf471738 cf5a7d8c e0b41f5f d18db1a4 cf5a7da4 e0b42d0b cf471738 00000000 d18db1a4 cf471738 cf5a7dbc e0b4320b d18db1ac cf5a7dd8 cf471738 cfbecc04 cf5a7dfc e0b43f88 d6b59694 db0797a4 db079bb0 Call Trace: [<e0b41efc>] del_lkb+0x12/0x6a [dlm] [<e0b41f5f>] _remove_lock+0xb/0x67 [dlm] [<e0b42d0b>] do_unlock+0x7c/0x9d [dlm] [<e0b4320b>] unlock_lock+0x94/0xaf [dlm] [<e0b43f88>] dlm_clear_proc_locks+0x100/0x13c [dlm] [<e0b4a670>] device_close+0x4d/0x8a [dlm] [<c04726cf>] __fput+0xb3/0x18a [<c04729c0>] fput+0x17/0x19 [<c046ff8e>] filp_close+0x51/0x5b [<c0425f6a>] put_files_struct+0x6d/0xa9 [<c0426faa>] do_exit+0x255/0x78c [<c042755a>] sys_exit_group+0x0/0x11 [<df2e80f0>] 0xdf2e80f0 Code: e8 7f 00 00 00 8d 4b 0c 8b 51 04 8d 46 0c e8 71 00 00 00 89 f8 e8 9d fe ff ff 8d 65 f4 5b 5e 5f 5d c3 55 89 e5 53 89 c3 8b 40 04 <8b> 00 39 d8 74 17 50 53 68 66 75 63 c0 e8 f3 fd f3 ff 0f 0b 41 EIP: [<c04e50d9>] list_del+0x9/0x62 SS:ESP 0068:cf5a7d70 <1>Fixing recursive fault but reboot is needed! ^^ Dlm oops
I have not seen this in some time.
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.