Description of problem: System crashes after two hours of operations, of reading and writing millions of small files (avg. 10K). All reads and writes occur from a main root directory which contain two subdirectories: incoming/ : all files are untar'd to this directory. outgoing/ : all the files from incoming are processed and written to this directory. Work/ | |---- ----| incoming/ outgoing/ | | 1/ 2/ 3/.../500K 1/ 2/ 3/.../4million This cluster configuration consist of 4 nodes and 15 gfs2 mount points, which are presented through clvm / multipath. The program that processes the files on the system is java which has a maximum thread count of 7. Before the system crashes the one thing that stands out while watching top, where glock_workqueue processes start to accumulate (16)...which also looks like delete_workqueue was spawned for each glock_workqueue. KERNEL: ./usr/lib/debug/lib/modules/2.6.18-164.el5/vmlinux DUMPFILE: cmil_224_vmcore-incomplete CPUS: 16 DATE: Wed Nov 18 17:50:15 2009 UPTIME: 02:43:43 LOAD AVERAGE: 42.88, 24.12, 13.60 TASKS: 711 NODENAME: cmil224 RELEASE: 2.6.18-164.el5 VERSION: #1 SMP Tue Aug 18 15:51:48 EDT 2009 MACHINE: x86_64 (2533 Mhz) MEMORY: 71 GB Pid: 18266, comm: glock_workqueue Tainted: G 2.6.18-164.el5 #1 RIP: 0010:[<ffffffff885846c7>] [<ffffffff885846c7>] :gfs2:gfs2_glock_put+0x38/0x135 RSP: 0018:ffff810970825e20 EFLAGS: 00010206 RAX: 0000000000100100 RBX: ffff810c9e80ae18 RCX: 0000000000000004 RDX: 0000000000200200 RSI: 0000000000000286 RDI: ffffffff885fbcb0 RBP: ffff810c9e80af08 R08: ffff810c7d227428 R09: 0000000000000011 R10: ffff81099f942d60 R11: ffffffff88593c3c R12: ffff81127d5fb340 R13: 0000000000000282 R14: ffff810c9e80ae18 R15: ffffffff88585996 FS: 0000000000000000(0000) GS:ffff8109800f5340(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000200200 CR3: 000000097d6c1000 CR4: 00000000000006e0 Process glock_workqueue (pid: 18266, threadinfo ffff810970824000, task ffff81127dad9080) Stack: ffff810c7d227428 ffff810c9e80af00 ffff810c9e80af08 ffffffff8004d80a ffff810970825e80 ffff81127d5fb340 ffffffff8004a052 ffff81127d5a3d98 0000000000000282 ffff81127d5a3d88 ffffffff8009f4a9 ffffffff8004a142 Call Trace: [<ffffffff8004d80a>] run_workqueue+0x94/0xe4 [<ffffffff8004a052>] worker_thread+0x0/0x122 [<ffffffff8009f4a9>] keventd_create_kthread+0x0/0xc4 [<ffffffff8004a142>] worker_thread+0xf0/0x122 [<ffffffff8008be71>] default_wake_function+0x0/0xe [<ffffffff8009f4a9>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003298b>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8009f4a9>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003288d>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Code: 48 89 02 74 04 48 89 50 08 8b 43 3c 48 c7 03 00 01 10 00 48 RIP [<ffffffff885846c7>] :gfs2:gfs2_glock_put+0x38/0x135 RSP <ffff810970825e20> Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 372530 [details] Proposed patch Originally added this to the wrong bz. Sorry about that.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 372531 [details] RHEL5 version from Abhi
Created attachment 374004 [details] Updated patch, based upon latest info Here is an updated RHEL5 patch. It is based upon the information we gained from the latest backtrace. Here is the low down: 1. Since the previous tests were done with an existing GFS2 patch, I've kept that as a base since its one less variable to consider 2. The test for trying to "hold" a glock with non-zero ref count is still there just in case this is not the correct fix, we should still gain further information about the issue 3. I've also added what I believe to be the fix for the shrinker issue. It has two parts: a) In gfs_glock_put() we now use atomic_dec_and_lock on the lru_lock since that means that we can be certain that there will never be any glocks on the lru_list which have a zero ref count. We basically make the last decrement of the ref count and the locking of lru_list to remove the glock an atomic operation. b) In the shrinker we remove the original check for two reasons: i) If this isn't the right fix, we'll hit the bug much faster and thus shorten the testing cycle ii) If it is the right fix, then this code is entirely redundant anyway and is thus removed to avoid future confusion. I'm fairly confident that this will fix the issue, but having said that we'd not managed to hit the issue ourselves since the previously attempted fix was included. I'll try and track down the tests which Ben was using to test his original patch, but that might be tricky since its thanksgiving and I'm sure he isn't going to be around until next week now. I've done a very basic sanity test of this patch in upstream, but nothing more than that at the moment.
Bug #508806 was the previous attempt to fix this issue. Looks like the testing was done using postmark in that case.
With all the latest patches we are seeing these soft lockups at the customer's site: Nov 27 14:46:22 cmil221 kernel: BUG: soft lockup - CPU#6 stuck for 10s! [cat:4840] Nov 27 14:46:22 cmil221 kernel: CPU 6: Nov 27 14:46:22 cmil221 kernel: Modules linked in: netconsole autofs4 ipmi_devintf ipmi_si ipmi_msghandler hidp rfcomm l2cap bluetooth dm_log_clustered(U) lock_dlm gfs2 dlm configfs su nrpc ipv6 xfrm_nalgo crypto_api cpufreq_ondemand acpi_cpufreq freq_table dm_round_robin dm_emc dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memho tplug ac parport_pc lp parport joydev sg shpchp i2c_i801 ide_cd cdrom cdc_ether i2c_core bnx2 usbnet pcspkr dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc ata_pi ix libata megaraid_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Nov 27 14:46:22 cmil221 kernel: Pid: 4840, comm: cat Tainted: G 2.6.18-164.6.1.el5.bz539240.fix.1 #1 Nov 27 14:46:22 cmil221 kernel: RIP: 0010:[<ffffffff80064bcc>] [<ffffffff80064bcc>] .text.lock.spinlock+0x2/0x30 Nov 27 14:46:22 cmil221 kernel: RSP: 0018:ffff8103c3bf5e30 EFLAGS: 00000286 Nov 27 14:46:22 cmil221 kernel: RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000002 Nov 27 14:46:22 cmil221 kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff88572580 Nov 27 14:46:22 cmil221 kernel: RBP: ffff810d49a08000 R08: 0000000000001000 R09: ffff8109fd03c380 Nov 27 14:46:22 cmil221 kernel: R10: 0000000000000003 R11: ffffffff88548d84 R12: ffffffff80008d84 Nov 27 14:46:22 cmil221 kernel: R13: 00000009f5d26025 R14: ffffffff00000000 R15: ffff810981aed050 Nov 27 14:46:22 cmil221 kernel: FS: 00002b038d91d260(0000) GS:ffff81099faaa8c0(0000) knlGS:0000000000000000 Nov 27 14:46:22 cmil221 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Nov 27 14:46:22 cmil221 kernel: CR2: 0000000000402fa0 CR3: 0000000c0f1fa000 CR4: 00000000000006e0 Nov 27 14:46:22 cmil221 kernel: Nov 27 14:46:22 cmil221 kernel: Call Trace: Nov 27 14:46:22 cmil221 kernel: [<ffffffff88546e72>] :gfs2:gfs2_glock_schedule_for_reclaim+0x5d/0x9a Nov 27 14:46:22 cmil221 kernel: [<ffffffff8854780f>] :gfs2:gfs2_glock_put+0x118/0x132 Nov 27 14:46:22 cmil221 kernel: [<ffffffff885478b1>] :gfs2:gfs2_glock_iter_next+0x66/0xda Nov 27 14:46:22 cmil221 kernel: [<ffffffff88547966>] :gfs2:gfs2_glock_seq_start+0x1b/0x3c Nov 27 14:46:22 cmil221 kernel: [<ffffffff8003f584>] seq_read+0xe5/0x28c Nov 27 14:46:22 cmil221 kernel: [<ffffffff8000b695>] vfs_read+0xcb/0x171 Nov 27 14:46:22 cmil221 kernel: [<ffffffff80011b72>] sys_read+0x45/0x6e Nov 27 14:46:22 cmil221 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 This looks like contention on glock's lru_lock in gfs2_glock_schedule_for_reclaim().
Created attachment 374374 [details] Patch to address spinlock contention This patch eliminates the need to acquire glock's lru_lock if it's not needed.
Created attachment 374527 [details] Yet another patch to address spinlock contention This patch adds a cond_resched_lock() to the gfs2 glock memory shrinker. If the lru_list is very large then this routine can hog the lru_lock spinlock.
Event posted on 30-11-2009 12:07am EST by lmcilroy Action plan: * We really do need to understand the workload the customer is running, especially what is happening around the time of the high load. Can we ask the customer for a detailed description of how it works? Also can we ask them for strace output from the applications? We need to know if they are regularly using readdir and flock operations in their applications since these can increase the glock count. * We should continue to try to reproduce the workload in-house. From our current understanding of the problem we should only need to run find over millions of files to prime the caches and we probably need to do this until all memory is used up to activate the shrinker. * We should get the .4 kernel to them and installed. This may or may not fix the high load issue so we should be prepared to re-instate the drop_caches script if need be. Other code changes to instrument the glock cache, particularly the size of the lru_list, might prove useful. * We should get the customer to add the 'profile=2' argument to the kernel command line in grub.conf so we can run readprofile or use oprofile instead. * We need to investigate which vm and gfs tunables will help avoid this issue. Maybe vfs_cache_pressure could be reduced to limit the size of gfs/glock slab caches. Maybe min_free_kbytes too? * We need to continue to try and get sysrq-t, sysrq-m and /proc/slabinfo data during the periods of high load so configuring hangwatch to get all this info should be a priority. Other data that would be useful is vmstat output. I think that's about it. This event sent from IssueTracker by lmcilroy issue 368454
Created attachment 374715 [details] screen shot of panic seen Nov 30 running .3 kernel
in kernel-2.6.18-177.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: This fixes a race which could result in a reference count on a glock which had hit zero being incremented incorrectly. This could result in crashes when GFS2 was under high loads and particularly in configurations with large amounts of RAM.
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
confirmed -191.el5 kernel src included following patch : linux-2.6-fs-gfs2-fix-glock-ref-count-issues.patch
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html