Escalated to Bugzilla from IssueTracker
The jbd slab cache creation/deletion is racey. If multiple jbd based filesystems are mounted concurrently, and there are no other jbd based filesystems already mounted. Then we can race creating the slab caches since jbd_slab[] is not locked. This is not commonly observed because typically /root is mounted early with a jbd based filesystem making the race impossible. On our diskless systems /root does not use the jbd but we do have attached storage which does, and which is mounted in parallel. Basically our setup is similar to what may be found in a NAS style appliance. The upstream patch that appears to fix it is: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=c089d490dfbf53bc0893dc9ef57cf3ee6448314d Another RHEL5 specific approach to fix the problem is attached. This event sent from IssueTracker by dejohnso [Support Engineering Group] issue 286416
Issue Registered (Severity: 3) This event sent from IssueTracker by dejohnso [Support Engineering Group] issue 286416
Date: Wed, 15 Apr 2009 16:37:13 -0700 From: Ben Woodard <bwoodard> Reply-To: bwoodard Organization: Red Hat Inc. User-Agent: Thunderbird 2.0.0.21 (X11/20090310) MIME-Version: 1.0 To: Brian Behlendorf <behlendorf1> CC: "Mark A. Grondona" <mgrondona> Subject: Re: JBD Race References: <200904151226.09733.behlendorf1> In-Reply-To: <200904151226.09733.behlendorf1> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Brian Behlendorf wrote: > Can you open a issue with redhat for the following RHEL jbd issue. I've > proposed a minimal impact fix in the follow lustre bugzilla, and Andreas was > nice enough to provide a reference to where this was fixed properly upstream. > Which ever way redhat wants to fix it is fine with me. This is pretty much a > no brainer but it would be very nice to have it fixed in the next errata. > > https://bugzilla.lustre.org/show_bug.cgi?id=19184 > Done. I sent that up and bounced it off of one of our FS engineers which happens to be my friend. I don't know what the preferred approach is going to be. Since it is a one time cost the extra locks wouldn't be much of a penalty. The upstream fix is obviously a different way to fix the problem. I kind of like those fixes for problems where they get rid of the code with the bug in it. I don't know if they will go for the small targeted fix with no future or if they will want the bigger potentially more disruptive upstream fix. However, since that path gets exercised on each and every boot of a practically every machine it is fairly easy to qualify and test. One thing that I wonder about is if the same problem exists in the jbd2 used by ext4. - -ben woodard assigned to issue for LLNL (HPC). Status set to: Waiting on Tech This event sent from IssueTracker by dejohnso [Support Engineering Group] issue 286416
Date problem reported: 4-15-09 Problem: jbd slab creation / deletion is racy. Customer has put together a patch to fix this based on upstream and would like for us to consider adding the fix to a future kernel errata. SEG, Can you guys do a sanity check on the attached patch from LLNL? What this patch is is a RHEL5-specific approach to fixing some racy behavior with jbd slab cache creation/deletion. The upstream approach to fixing the problem, as mentioned in the summary, is here: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=c089d490dfbf53bc0893dc9ef57cf3ee6448314d If everything looks good, can we get this into a BZ for the kernel guys? Thanks. Issue escalated to Support Engineering Group by: kbaxley. Internal Status set to 'Waiting on SEG' Summary edited. This event sent from IssueTracker by dejohnso [Support Engineering Group] issue 286416
Hi, I have looked at this patch and do not see any issues as far as it working as designed however development obviously needs to look into this as we are changing where the journaling is being allocated in memory. I will elevate this to BZ. Debbie This event sent from IssueTracker by dejohnso [Support Engineering Group] issue 286416
Reassigning to kernel component. Thanks & regards, Phil
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Any response to my question in c#8?
checking again.
Turns out we have a reproducer. However, it isn't with RHEL's kernel it includes mounting Lustre FSs therefore the race is real and not theoretical. As for the locking being gratuitous it does fix the problem. So somewhere your analysis is flawed. I can probably get you a back trace of a non RHEL kernel if you would like.
https://bugzilla.lustre.org/show_bug.cgi?id=19184
On the other hand he does say that they really only saw the bug appear during creation. So the locking may be more heavy handed than needed but he said "it seemed right"
Yes the locking during the creation/deletion is obviously needed, I'm talking about the locking for actually making the allocations, which is what seems gratuitous since its already properly protected. So if its a matter of "seemed right" then we'd rather not take that part just in case of a performance regression. If he had some other reason for protecting allocations and free's we'd like to hear what it was, becaues it doesn't seem necessary, just the locking for the actual slab creation/deletion would be what is necessary.
*** Bug 542832 has been marked as a duplicate of this bug. ***
in kernel-2.6.18-179.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details.
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html
While I just compare the patch that landed in -164.15.1 compared to the patch in the Lustre bugzilla, I notice that chunks in jbd_slab_alloc() and jbd_slab_free() in the RedHat patch are missing. Was that intentional? void * jbd_slab_alloc(size_t size, gfp_t flags) { + void *ptr; int idx; + down_read(&jbd_slab_lock); idx = JBD_SLAB_INDEX(size); BUG_ON(jbd_slab[idx] == NULL); - return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL); + ptr = kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL); + up_read(&jbd_slab_lock); + + return ptr; } void jbd_slab_free(void *ptr, size_t size) { int idx; + down_read(&jbd_slab_lock); idx = JBD_SLAB_INDEX(size); BUG_ON(jbd_slab[idx] == NULL); kmem_cache_free(jbd_slab[idx], ptr); + up_read(&jbd_slab_lock); } /* Thanks, Bernd
Yes, that locking was too much, we didn't want to adversely affect performance, so it was removed. It doesn't make it any less safe, as we only want to serialize the creation/deletion of the slabs.
Thanks for your help Josef!