Bug 539240 - glock_workqueue -- glock ref count via gfs2_glock_hold
Summary: glock_workqueue -- glock ref count via gfs2_glock_hold
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.4
Hardware: All
OS: Linux
Target Milestone: rc
: ---
Assignee: Steve Whitehouse
QA Contact: Cluster QE
Depends On:
Blocks: 526947 5.5TechNotes-Updates 544978 545221 546634
TreeView+ depends on / blocked
Reported: 2009-11-19 17:58 UTC by Toure Dunnon
Modified: 2018-11-14 13:44 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
This fixes a race which could result in a reference count on a glock which had hit zero being incremented incorrectly. This could result in crashes when GFS2 was under high loads and particularly in configurations with large amounts of RAM.
Clone Of:
: 545221 546634 (view as bug list)
Last Closed: 2010-03-30 06:58:36 UTC
Target Upstream Version:

Attachments (Terms of Use)
Proposed patch (1.61 KB, patch)
2009-11-20 16:07 UTC, Steve Whitehouse
no flags Details | Diff
RHEL5 version from Abhi (1.58 KB, patch)
2009-11-20 16:13 UTC, Steve Whitehouse
no flags Details | Diff
Updated patch, based upon latest info (2.80 KB, patch)
2009-11-26 13:54 UTC, Steve Whitehouse
no flags Details | Diff
Patch to address spinlock contention (609 bytes, patch)
2009-11-28 08:22 UTC, Lachlan McIlroy
no flags Details | Diff
Yet another patch to address spinlock contention (470 bytes, patch)
2009-11-29 08:55 UTC, Lachlan McIlroy
no flags Details | Diff
screen shot of panic seen Nov 30 running .3 kernel (42.78 KB, image/png)
2009-11-30 10:03 UTC, Norm Murray
no flags Details

System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0178 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update 2010-03-29 12:18:21 UTC

Description Toure Dunnon 2009-11-19 17:58:42 UTC
Description of problem:
System crashes after two hours of operations, of reading and writing millions of small files (avg. 10K). All reads and writes occur from a main root directory which contain two subdirectories:

incoming/ : all files are untar'd to this directory.
outgoing/ : all the files from incoming are processed and written to this directory.

                   |---- ----|
             incoming/    outgoing/
             |                    |
    1/ 2/ 3/.../500K      1/ 2/ 3/.../4million   

This cluster configuration consist of 4 nodes and 15 gfs2 mount points, which are presented through clvm / multipath. The program that processes the files on the system is java which has a maximum thread count of 7. Before the system crashes the one thing that stands out while watching top, where glock_workqueue processes start to accumulate (16)...which also looks like delete_workqueue was spawned for each glock_workqueue.

KERNEL: ./usr/lib/debug/lib/modules/2.6.18-164.el5/vmlinux
    DUMPFILE: cmil_224_vmcore-incomplete
        CPUS: 16
        DATE: Wed Nov 18 17:50:15 2009
      UPTIME: 02:43:43
LOAD AVERAGE: 42.88, 24.12, 13.60
       TASKS: 711
    NODENAME: cmil224
     RELEASE: 2.6.18-164.el5
     VERSION: #1 SMP Tue Aug 18 15:51:48 EDT 2009
     MACHINE: x86_64  (2533 Mhz)
      MEMORY: 71 GB

Pid: 18266, comm: glock_workqueue Tainted: G      2.6.18-164.el5 #1
RIP: 0010:[<ffffffff885846c7>]  [<ffffffff885846c7>] :gfs2:gfs2_glock_put+0x38/0x135
RSP: 0018:ffff810970825e20  EFLAGS: 00010206
RAX: 0000000000100100 RBX: ffff810c9e80ae18 RCX: 0000000000000004
RDX: 0000000000200200 RSI: 0000000000000286 RDI: ffffffff885fbcb0
RBP: ffff810c9e80af08 R08: ffff810c7d227428 R09: 0000000000000011
R10: ffff81099f942d60 R11: ffffffff88593c3c R12: ffff81127d5fb340
R13: 0000000000000282 R14: ffff810c9e80ae18 R15: ffffffff88585996
FS:  0000000000000000(0000) GS:ffff8109800f5340(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000200200 CR3: 000000097d6c1000 CR4: 00000000000006e0
Process glock_workqueue (pid: 18266, threadinfo ffff810970824000, task ffff81127dad9080)
Stack:  ffff810c7d227428 ffff810c9e80af00 ffff810c9e80af08 ffffffff8004d80a
 ffff810970825e80 ffff81127d5fb340 ffffffff8004a052 ffff81127d5a3d98
 0000000000000282 ffff81127d5a3d88 ffffffff8009f4a9 ffffffff8004a142
Call Trace:
 [<ffffffff8004d80a>] run_workqueue+0x94/0xe4
 [<ffffffff8004a052>] worker_thread+0x0/0x122
 [<ffffffff8009f4a9>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8004a142>] worker_thread+0xf0/0x122
 [<ffffffff8008be71>] default_wake_function+0x0/0xe
 [<ffffffff8009f4a9>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003298b>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009f4a9>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003288d>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Code: 48 89 02 74 04 48 89 50 08 8b 43 3c 48 c7 03 00 01 10 00 48
RIP  [<ffffffff885846c7>] :gfs2:gfs2_glock_put+0x38/0x135
 RSP <ffff810970825e20>

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
Actual results:

Expected results:

Additional info:

Comment 1 Steve Whitehouse 2009-11-20 16:07:05 UTC
Created attachment 372530 [details]
Proposed patch

Originally added this to the wrong bz. Sorry about that.

Comment 3 RHEL Program Management 2009-11-20 16:11:13 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update

Comment 4 Steve Whitehouse 2009-11-20 16:13:57 UTC
Created attachment 372531 [details]
RHEL5 version from Abhi

Comment 19 Steve Whitehouse 2009-11-26 13:54:58 UTC
Created attachment 374004 [details]
Updated patch, based upon latest info

Here is an updated RHEL5 patch. It is based upon the information we gained from the latest backtrace. Here is the low down:

1. Since the previous tests were done with an existing GFS2 patch, I've kept that as a base since its one less variable to consider
2. The test for trying to "hold" a glock with non-zero ref count is still there just in case this is not the correct fix, we should still gain further information about the issue
3. I've also added what I believe to be the fix for the shrinker issue. It has two parts:
 a) In gfs_glock_put() we now use atomic_dec_and_lock on the lru_lock since that means that we can be certain that there will never be any glocks on the lru_list which have a zero ref count. We basically make the last decrement of the ref count and the locking of lru_list to remove the glock an atomic operation.
 b) In the shrinker we remove the original check for two reasons:
    i) If this isn't the right fix, we'll hit the bug much faster and thus shorten the testing cycle
    ii) If it is the right fix, then this code is entirely redundant anyway and is thus removed to avoid future confusion.

I'm fairly confident that this will fix the issue, but having said that we'd not managed to hit the issue ourselves since the previously attempted fix was included. I'll try and track down the tests which Ben was using to test his original patch, but that might be tricky since its thanksgiving and I'm sure he isn't going to be around until next week now.

I've done a very basic sanity test of this patch in upstream, but nothing more than that at the moment.

Comment 21 Steve Whitehouse 2009-11-26 15:03:53 UTC
Bug #508806 was the previous attempt to fix this issue. Looks like the testing was done using postmark in that case.

Comment 30 Lachlan McIlroy 2009-11-28 08:10:38 UTC
With all the latest patches we are seeing these soft lockups at the customer's site:

Nov 27 14:46:22 cmil221 kernel: BUG: soft lockup - CPU#6 stuck for 10s! [cat:4840]
Nov 27 14:46:22 cmil221 kernel: CPU 6:
Nov 27 14:46:22 cmil221 kernel: Modules linked in: netconsole autofs4 ipmi_devintf ipmi_si ipmi_msghandler hidp rfcomm l2cap bluetooth dm_log_clustered(U) lock_dlm gfs2 dlm configfs su
nrpc ipv6 xfrm_nalgo crypto_api cpufreq_ondemand acpi_cpufreq freq_table dm_round_robin dm_emc dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memho
tplug ac parport_pc lp parport joydev sg shpchp i2c_i801 ide_cd cdrom cdc_ether i2c_core bnx2 usbnet pcspkr dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc ata_pi
ix libata megaraid_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Nov 27 14:46:22 cmil221 kernel: Pid: 4840, comm: cat Tainted: G      2.6.18-164.6.1.el5.bz539240.fix.1 #1
Nov 27 14:46:22 cmil221 kernel: RIP: 0010:[<ffffffff80064bcc>]  [<ffffffff80064bcc>] .text.lock.spinlock+0x2/0x30
Nov 27 14:46:22 cmil221 kernel: RSP: 0018:ffff8103c3bf5e30  EFLAGS: 00000286
Nov 27 14:46:22 cmil221 kernel: RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000002
Nov 27 14:46:22 cmil221 kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff88572580
Nov 27 14:46:22 cmil221 kernel: RBP: ffff810d49a08000 R08: 0000000000001000 R09: ffff8109fd03c380
Nov 27 14:46:22 cmil221 kernel: R10: 0000000000000003 R11: ffffffff88548d84 R12: ffffffff80008d84
Nov 27 14:46:22 cmil221 kernel: R13: 00000009f5d26025 R14: ffffffff00000000 R15: ffff810981aed050
Nov 27 14:46:22 cmil221 kernel: FS:  00002b038d91d260(0000) GS:ffff81099faaa8c0(0000) knlGS:0000000000000000
Nov 27 14:46:22 cmil221 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov 27 14:46:22 cmil221 kernel: CR2: 0000000000402fa0 CR3: 0000000c0f1fa000 CR4: 00000000000006e0
Nov 27 14:46:22 cmil221 kernel: 
Nov 27 14:46:22 cmil221 kernel: Call Trace:
Nov 27 14:46:22 cmil221 kernel:  [<ffffffff88546e72>] :gfs2:gfs2_glock_schedule_for_reclaim+0x5d/0x9a
Nov 27 14:46:22 cmil221 kernel:  [<ffffffff8854780f>] :gfs2:gfs2_glock_put+0x118/0x132
Nov 27 14:46:22 cmil221 kernel:  [<ffffffff885478b1>] :gfs2:gfs2_glock_iter_next+0x66/0xda
Nov 27 14:46:22 cmil221 kernel:  [<ffffffff88547966>] :gfs2:gfs2_glock_seq_start+0x1b/0x3c
Nov 27 14:46:22 cmil221 kernel:  [<ffffffff8003f584>] seq_read+0xe5/0x28c
Nov 27 14:46:22 cmil221 kernel:  [<ffffffff8000b695>] vfs_read+0xcb/0x171
Nov 27 14:46:22 cmil221 kernel:  [<ffffffff80011b72>] sys_read+0x45/0x6e
Nov 27 14:46:22 cmil221 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0

This looks like contention on glock's lru_lock in gfs2_glock_schedule_for_reclaim().

Comment 31 Lachlan McIlroy 2009-11-28 08:22:02 UTC
Created attachment 374374 [details]
Patch to address spinlock contention

This patch eliminates the need to acquire glock's lru_lock if it's not needed.

Comment 33 Lachlan McIlroy 2009-11-29 08:55:03 UTC
Created attachment 374527 [details]
Yet another patch to address spinlock contention

This patch adds a cond_resched_lock() to the gfs2 glock memory shrinker.  If the lru_list is very large then this routine can hog the lru_lock spinlock.

Comment 34 Issue Tracker 2009-11-29 13:07:17 UTC
Event posted on 30-11-2009 12:07am EST by lmcilroy

Action plan:

* We really do need to understand the workload the customer is running,
especially what is happening around the time of the high load.  Can we ask
the customer for a detailed description of how it works?  Also can we ask
them for strace output from the applications?  We need to know if they are
regularly using readdir and flock operations in their applications since
these can increase the glock count.

* We should continue to try to reproduce the workload in-house.  From our
current understanding of the problem we should only need to run find over
millions of files to prime the caches and we probably need to do this
until all memory is used up to activate the shrinker.

* We should get the .4 kernel to them and installed.  This may or may not
fix the high load issue so we should be prepared to re-instate the
drop_caches script if need be.  Other code changes to instrument the glock
cache, particularly the size of the lru_list, might prove useful.

* We should get the customer to add the 'profile=2' argument to the
kernel command line in grub.conf so we can run readprofile or use oprofile

* We need to investigate which vm and gfs tunables will help avoid this
issue.  Maybe vfs_cache_pressure could be reduced to limit the size of
gfs/glock slab caches.  Maybe min_free_kbytes too?

* We need to continue to try and get sysrq-t, sysrq-m and /proc/slabinfo
data during the periods of high load so configuring hangwatch to get all
this info should be a priority.  Other data that would be useful is vmstat

I think that's about it.

This event sent from IssueTracker by lmcilroy 
 issue 368454

Comment 37 Norm Murray 2009-11-30 10:03:09 UTC
Created attachment 374715 [details]
screen shot of panic seen Nov 30 running .3 kernel

Comment 42 Don Zickus 2009-12-04 19:00:18 UTC
in kernel-2.6.18-177.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 45 Steve Whitehouse 2010-01-14 12:22:59 UTC
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

New Contents:
This fixes a race which could result in a reference count on a glock which had hit zero being incremented incorrectly. This could result in crashes when GFS2 was under high loads and particularly in configurations with large amounts of RAM.

Comment 46 Chris Ward 2010-02-11 10:22:33 UTC
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 48 Alex He 2010-03-09 10:10:31 UTC
confirmed -191.el5 kernel src included following patch :

Comment 50 errata-xmlrpc 2010-03-30 06:58:36 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.