Description of problem:
System crashes after two hours of operations, of reading and writing millions of small files (avg. 10K). All reads and writes occur from a main root directory which contain two subdirectories:
incoming/ : all files are untar'd to this directory.
outgoing/ : all the files from incoming are processed and written to this directory.
1/ 2/ 3/.../500K 1/ 2/ 3/.../4million
This cluster configuration consist of 4 nodes and 15 gfs2 mount points, which are presented through clvm / multipath. The program that processes the files on the system is java which has a maximum thread count of 7. Before the system crashes the one thing that stands out while watching top, where glock_workqueue processes start to accumulate (16)...which also looks like delete_workqueue was spawned for each glock_workqueue.
DATE: Wed Nov 18 17:50:15 2009
LOAD AVERAGE: 42.88, 24.12, 13.60
VERSION: #1 SMP Tue Aug 18 15:51:48 EDT 2009
MACHINE: x86_64 (2533 Mhz)
MEMORY: 71 GB
Pid: 18266, comm: glock_workqueue Tainted: G 2.6.18-164.el5 #1
RIP: 0010:[<ffffffff885846c7>] [<ffffffff885846c7>] :gfs2:gfs2_glock_put+0x38/0x135
RSP: 0018:ffff810970825e20 EFLAGS: 00010206
RAX: 0000000000100100 RBX: ffff810c9e80ae18 RCX: 0000000000000004
RDX: 0000000000200200 RSI: 0000000000000286 RDI: ffffffff885fbcb0
RBP: ffff810c9e80af08 R08: ffff810c7d227428 R09: 0000000000000011
R10: ffff81099f942d60 R11: ffffffff88593c3c R12: ffff81127d5fb340
R13: 0000000000000282 R14: ffff810c9e80ae18 R15: ffffffff88585996
FS: 0000000000000000(0000) GS:ffff8109800f5340(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000200200 CR3: 000000097d6c1000 CR4: 00000000000006e0
Process glock_workqueue (pid: 18266, threadinfo ffff810970824000, task ffff81127dad9080)
Stack: ffff810c7d227428 ffff810c9e80af00 ffff810c9e80af08 ffffffff8004d80a
ffff810970825e80 ffff81127d5fb340 ffffffff8004a052 ffff81127d5a3d98
0000000000000282 ffff81127d5a3d88 ffffffff8009f4a9 ffffffff8004a142
Code: 48 89 02 74 04 48 89 50 08 8b 43 3c 48 c7 03 00 01 10 00 48
RIP [<ffffffff885846c7>] :gfs2:gfs2_glock_put+0x38/0x135
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Created attachment 372530 [details]
Originally added this to the wrong bz. Sorry about that.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
Created attachment 372531 [details]
RHEL5 version from Abhi
Created attachment 374004 [details]
Updated patch, based upon latest info
Here is an updated RHEL5 patch. It is based upon the information we gained from the latest backtrace. Here is the low down:
1. Since the previous tests were done with an existing GFS2 patch, I've kept that as a base since its one less variable to consider
2. The test for trying to "hold" a glock with non-zero ref count is still there just in case this is not the correct fix, we should still gain further information about the issue
3. I've also added what I believe to be the fix for the shrinker issue. It has two parts:
a) In gfs_glock_put() we now use atomic_dec_and_lock on the lru_lock since that means that we can be certain that there will never be any glocks on the lru_list which have a zero ref count. We basically make the last decrement of the ref count and the locking of lru_list to remove the glock an atomic operation.
b) In the shrinker we remove the original check for two reasons:
i) If this isn't the right fix, we'll hit the bug much faster and thus shorten the testing cycle
ii) If it is the right fix, then this code is entirely redundant anyway and is thus removed to avoid future confusion.
I'm fairly confident that this will fix the issue, but having said that we'd not managed to hit the issue ourselves since the previously attempted fix was included. I'll try and track down the tests which Ben was using to test his original patch, but that might be tricky since its thanksgiving and I'm sure he isn't going to be around until next week now.
I've done a very basic sanity test of this patch in upstream, but nothing more than that at the moment.
Bug #508806 was the previous attempt to fix this issue. Looks like the testing was done using postmark in that case.
With all the latest patches we are seeing these soft lockups at the customer's site:
Nov 27 14:46:22 cmil221 kernel: BUG: soft lockup - CPU#6 stuck for 10s! [cat:4840]
Nov 27 14:46:22 cmil221 kernel: CPU 6:
Nov 27 14:46:22 cmil221 kernel: Modules linked in: netconsole autofs4 ipmi_devintf ipmi_si ipmi_msghandler hidp rfcomm l2cap bluetooth dm_log_clustered(U) lock_dlm gfs2 dlm configfs su
nrpc ipv6 xfrm_nalgo crypto_api cpufreq_ondemand acpi_cpufreq freq_table dm_round_robin dm_emc dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memho
tplug ac parport_pc lp parport joydev sg shpchp i2c_i801 ide_cd cdrom cdc_ether i2c_core bnx2 usbnet pcspkr dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc ata_pi
ix libata megaraid_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Nov 27 14:46:22 cmil221 kernel: Pid: 4840, comm: cat Tainted: G 2.6.18-164.6.1.el5.bz539240.fix.1 #1
Nov 27 14:46:22 cmil221 kernel: RIP: 0010:[<ffffffff80064bcc>] [<ffffffff80064bcc>] .text.lock.spinlock+0x2/0x30
Nov 27 14:46:22 cmil221 kernel: RSP: 0018:ffff8103c3bf5e30 EFLAGS: 00000286
Nov 27 14:46:22 cmil221 kernel: RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000002
Nov 27 14:46:22 cmil221 kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff88572580
Nov 27 14:46:22 cmil221 kernel: RBP: ffff810d49a08000 R08: 0000000000001000 R09: ffff8109fd03c380
Nov 27 14:46:22 cmil221 kernel: R10: 0000000000000003 R11: ffffffff88548d84 R12: ffffffff80008d84
Nov 27 14:46:22 cmil221 kernel: R13: 00000009f5d26025 R14: ffffffff00000000 R15: ffff810981aed050
Nov 27 14:46:22 cmil221 kernel: FS: 00002b038d91d260(0000) GS:ffff81099faaa8c0(0000) knlGS:0000000000000000
Nov 27 14:46:22 cmil221 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov 27 14:46:22 cmil221 kernel: CR2: 0000000000402fa0 CR3: 0000000c0f1fa000 CR4: 00000000000006e0
Nov 27 14:46:22 cmil221 kernel:
Nov 27 14:46:22 cmil221 kernel: Call Trace:
Nov 27 14:46:22 cmil221 kernel: [<ffffffff88546e72>] :gfs2:gfs2_glock_schedule_for_reclaim+0x5d/0x9a
Nov 27 14:46:22 cmil221 kernel: [<ffffffff8854780f>] :gfs2:gfs2_glock_put+0x118/0x132
Nov 27 14:46:22 cmil221 kernel: [<ffffffff885478b1>] :gfs2:gfs2_glock_iter_next+0x66/0xda
Nov 27 14:46:22 cmil221 kernel: [<ffffffff88547966>] :gfs2:gfs2_glock_seq_start+0x1b/0x3c
Nov 27 14:46:22 cmil221 kernel: [<ffffffff8003f584>] seq_read+0xe5/0x28c
Nov 27 14:46:22 cmil221 kernel: [<ffffffff8000b695>] vfs_read+0xcb/0x171
Nov 27 14:46:22 cmil221 kernel: [<ffffffff80011b72>] sys_read+0x45/0x6e
Nov 27 14:46:22 cmil221 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0
This looks like contention on glock's lru_lock in gfs2_glock_schedule_for_reclaim().
Created attachment 374374 [details]
Patch to address spinlock contention
This patch eliminates the need to acquire glock's lru_lock if it's not needed.
Created attachment 374527 [details]
Yet another patch to address spinlock contention
This patch adds a cond_resched_lock() to the gfs2 glock memory shrinker. If the lru_list is very large then this routine can hog the lru_lock spinlock.
Event posted on 30-11-2009 12:07am EST by lmcilroy
* We really do need to understand the workload the customer is running,
especially what is happening around the time of the high load. Can we ask
the customer for a detailed description of how it works? Also can we ask
them for strace output from the applications? We need to know if they are
regularly using readdir and flock operations in their applications since
these can increase the glock count.
* We should continue to try to reproduce the workload in-house. From our
current understanding of the problem we should only need to run find over
millions of files to prime the caches and we probably need to do this
until all memory is used up to activate the shrinker.
* We should get the .4 kernel to them and installed. This may or may not
fix the high load issue so we should be prepared to re-instate the
drop_caches script if need be. Other code changes to instrument the glock
cache, particularly the size of the lru_list, might prove useful.
* We should get the customer to add the 'profile=2' argument to the
kernel command line in grub.conf so we can run readprofile or use oprofile
* We need to investigate which vm and gfs tunables will help avoid this
issue. Maybe vfs_cache_pressure could be reduced to limit the size of
gfs/glock slab caches. Maybe min_free_kbytes too?
* We need to continue to try and get sysrq-t, sysrq-m and /proc/slabinfo
data during the periods of high load so configuring hangwatch to get all
this info should be a priority. Other data that would be useful is vmstat
I think that's about it.
This event sent from IssueTracker by lmcilroy
Created attachment 374715 [details]
screen shot of panic seen Nov 30 running .3 kernel
You can download this test kernel from http://people.redhat.com/dzickus/el5
Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so. However feel free
to provide a comment indicating that this fix has been verified.
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
This fixes a race which could result in a reference count on a glock which had hit zero being incremented incorrectly. This could result in crashes when GFS2 was under high loads and particularly in configurations with large amounts of RAM.
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~
RHEL 5.5 Beta has been released! There should be a fix present in this
release that addresses your request. Please test and report back results
here, by March 3rd 2010 (2010-03-03) or sooner.
Upon successful verification of this request, post your results and update
the Verified field in Bugzilla with the appropriate value.
If you encounter any issues while testing, please describe them and set
this bug into NEED_INFO. If you encounter new defects or have additional
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.
confirmed -191.el5 kernel src included following patch :
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.