Description of problem: I was running the accordion test for many hours and then took down two of the nodes in the cluster (morph-04 and morph-05). This then cause morph-02 to Oops: Aug 11 09:31:38 morph-02 kernel: dlm: gfs1: purge locks of departed nodes Aug 11 09:31:38 morph-02 kernel: dlm: gfs1: purged 3142 locks Aug 11 09:31:38 morph-02 kernel: dlm: gfs1: update remastered resources dlm: gfs1: updated 22853 resources dlm: gfs1: rebuild locks Unable to handle kernel paging request at virtual address 20202020 printing eip: c02cb344 *pde = 00000000 Oops: 0002 [#1] Modules linked in: gnbd lock_gulm lock_nolock lock_dlm dlm cman gfs lock_harness ipv6 autofs4 sunrpc e1000 microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<c02cb344>] Not tainted EFLAGS: 00010202 (2.6.7) EIP is at rwsem_down_write_failed+0x44/0x16a eax: ffffffff ebx: e6695554 ecx: e6695558 edx: 20202020 esi: f74dfe90 edi: f77591b0 ebp: e6695554 esp: f74dfe88 ds: 007b es: 007b ss: 0068 Process dlm_recoverd (pid: 3212, threadinfo=f74de000 task=f77591b0) Stack: 646c6975 e6695558 e6695558 00000246 f77591b0 00000002 ffffffe4 e66954f0 f74dff04 e6695554 f8a46eea 00000000 e66954f0 00000000 f74dff04 f7cd1e38 f8a4617b f74dff00 f7cd1e38 dff08000 f74dff2c f7cd1f08 f8a4626f f7cd1e38 Call Trace: [<f8a46eea>] .text.lock.rebuild+0x5a/0xc0 [dlm] [<f8a4617b>] fill_rcom_buffer+0x9b/0xe0 [dlm] [<f8a4626f>] rebuild_rsbs_send+0xaf/0x1e0 [dlm] [<f8a48dca>] ls_reconfig+0xca/0x230 [dlm] [<f8a49be5>] do_ls_recovery+0x175/0x430 [dlm] [<f8a49fc8>] dlm_recoverd+0x128/0x170 [dlm] [<c0118850>] default_wake_function+0x0/0x10 [<c0105c12>] ret_from_fork+0x6/0x14 [<c0118850>] default_wake_function+0x0/0x10 [<f8a49ea0>] dlm_recoverd+0x0/0x170 [dlm] [<c010429d>] kernel_thread_helper+0x5/0x18 Code: 89 32 89 54 24 0c 0f c1 03 48 66 85 c0 75 2d 8d b6 00 00 00 Aug 11 09:31:39 morph-02 kernel: dlm: gfs1: restbl_rsb_update_recv rsb not found 16243 Aug 11 09:31:40 morph-02 kernel: dlm: gfs1: updated 22853 resources Aug 11 09:31:40 morph-02 kernel: dlm: gfs1: rebuild locks Aug 11 09:31:40 morph-02 kernel: Unable to handle kernel paging request at virtual address 20202020 Aug 11 09:31:40 morph-02 kernel: printing eip: Aug 11 09:31:40 morph-02 kernel: c02cb344 Aug 11 09:31:40 morph-02 kernel: *pde = 00000000 Aug 11 09:31:40 morph-02 kernel: Oops: 0002 [#1] Aug 11 09:31:40 morph-02 kernel: Modules linked in: gnbd lock_gulm lock_nolock lock_dlm dlm cman gfs lock_harness ipv6 autofs4 sunrpc e1000 microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod Aug 11 09:31:40 morph-02 kernel: CPU: 0 Aug 11 09:31:40 morph-02 kernel: EIP: 0060:[<c02cb344>] Not tainted Aug 11 09:31:40 morph-02 kernel: EFLAGS: 00010202 (2.6.7) Aug 11 09:31:40 morph-02 kernel: EIP is at rwsem_down_write_failed+0x44/0x16a Aug 11 09:31:40 morph-02 kernel: eax: ffffffff ebx: e6695554 ecx: e6695558 edx: 20202020 Aug 11 09:31:40 morph-02 kernel: esi: f74dfe90 edi: f77591b0 ebp: e6695554 esp: f74dfe88 Aug 11 09:31:40 morph-02 kernel: ds: 007b es: 007b ss: 0068 Aug 11 09:31:40 morph-02 kernel: Process dlm_recoverd (pid: 3212, threadinfo=f74de000 task=f77591b0) Aug 11 09:31:40 morph-02 kernel: Stack: 646c6975 e6695558 e6695558 00000246 f77591b0 00000002 ffffffe4 e66954f0 Aug 11 09:31:40 morph-02 kernel: f74dff04 e6695554 f8a46eea 00000000 e66954f0 00000000 f74dff04 f7cd1e38 Aug 11 09:31:40 morph-02 kernel: f8a4617b f74dff00 f7cd1e38 dff08000 f74dff2c f7cd1f08 f8a4626f f7cd1e38 Aug 11 09:31:40 morph-02 kernel: Call Trace: Aug 11 09:31:41 morph-02 kernel: [<f8a46eea>] .text.lock.rebuild+0x5a/0xc0 [dlm] Aug 11 09:31:41 morph-02 kernel: [<f8a4617b>] fill_rcom_buffer+0x9b/0xe0 [dlm] Aug 11 09:31:41 morph-02 kernel: [<f8a4626f>] rebuild_rsbs_send+0xaf/0x1e0 [dlm] Aug 11 09:31:41 morph-02 kernel: [<f8a48dca>] ls_reconfig+0xca/0x230 [dlm] Aug 11 09:31:41 morph-02 kernel: [<f8a49be5>] do_ls_recovery+0x175/0x430 [dlm] Aug 11 09:31:41 morph-02 kernel: [<f8a49fc8>] dlm_recoverd+0x128/0x170 [dlm] Aug 11 09:31:41 morph-02 kernel: [<c0118850>] default_wake_function+0x0/0x10 Aug 11 09:31:41 morph-02 kernel: [<c0105c12>] ret_from_fork+0x6/0x14 Aug 11 09:31:41 morph-02 kernel: [<c0118850>] default_wake_function+0x0/0x10 Aug 11 09:31:41 morph-02 kernel: [<f8a49ea0>] dlm_recoverd+0x0/0x170 [dlm] Aug 11 09:31:41 morph-02 kernel: [<c010429d>] kernel_thread_helper+0x5/0x18 Aug 11 09:31:41 morph-02 kernel: Aug 11 09:31:41 morph-02 kernel: Code: 89 32 89 54 24 0c 0f c1 03 48 66 85 c0 75 2d 8d b6 00 00 00 How reproducible: Didn't try
I used the following args: accordion -p 10 -L fcntl -s 1024000 -e 4097 -t -m 100 acc1 acc2 acc3 acc4 I only let it run for a couple hours on 8 nodes before killing 2 and didn't have any problem. The initial problem may be somewhat removed from the actual oops based on the "restbl_rsb_update_recv" error message. We can add some better error checking and reporting around that to lend some help the next time we're able to cause this. It's probably not something that will appear every time, although that would be nice. I'm curious about the rather large number of locks ("updated 22853 resources") in your test. Are there different accordion args than I used that might explain that?
Dave, I might have also been running genesis as well to bump up the lock count. They were running on each of 6 nodes, to each of either 3 or 5 filesystems per node, to 10 files, over night (so like 15 hours) before I took down the 2 other nodes which caused the oops. Here's the cmdlines I used: ./accordion -L flock -s 2097152 -e 1024 -t -m 100000 -S 54321 accd1 accd2 accd3 accd4 accd5 accd6 accd7 accd8 accd9 accd10 ./genesis -S 12345 -n 7500 -s 1048576 -L flock -d 700
Updates with the proper version and component name.
hasn't been seen in 5 months (with a lot of recovery testing).