Bug 1496836

Summary: [RH 7.5 bug] Request for upstream commit 3664847d95e6 to be merged into RHEL 7.5/7.4
Product: Red Hat Enterprise Linux 7 Reporter: John Pittman <jpittman>
Component: kernelAssignee: Nigel Croxon <ncroxon>
kernel sub component: Multiple Devices (MD) QA Contact: guazhang <guazhang>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: alex.wang, anrussel, bhu, dhoward, guazhang, jan.iven, jpittman, jshortt, riehecky, sthiell, surkumar, thomas.oulevey, toracat, tumeya, xni
Version: 7.4Keywords: TestOnly, ZStream
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-3.10.0-820.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1535883 (view as bug list) Environment:
Last Closed: 2018-04-10 22:14:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1442258, 1535883    

Description John Pittman 2017-09-28 14:38:53 UTC
Description of problem:

Request for upstream commit 3664847d95e6 to be merged into RHEL 7.5/7.4.  This patch was recently added to latest -rc by Linus under commit 12fcf66e74b1.

md/raid5: fix a race condition in stripe batch
https://github.com/torvalds/linux/commit/3664847d95e60a9a943858b7800f8484669740fc

There have been reports of RHEL 7 systems crashing due to this bug in large environments when using Lustre.  Testing with the mentioned patch indicates that it fixes the issue.

Version-Release number of selected component (if applicable):

kernel-3.10.0-693.2.2.el7

Actual results:

System crash

Expected results:

No crash

Comment 3 Nigel Croxon 2017-10-06 16:34:47 UTC
A test kernel is available to verify the fix:

http://people.redhat.com/ncroxon/rhel7/.rhel75_Oct6/kernel-3.10.0-726.el7.test.x86_64.rpm 

Please test as soon as possible and give feedback on your results.

-Nigel

Comment 4 Stephane Thiell 2017-10-06 17:19:35 UTC
Hi Nigel,

While others might be able to test, we had to put our new hardware into production very recently, so while I was able to test last week, we're not in a position of testing anymore. :/
We have servers connected to multiple JBODs and using mdraid. I applied the patch to 7.4 myself and it's working fine now. We're using the patch for more than a month now, maybe even 6 weeks. We had about 1 or 2 kernel panics per week without the patch (in 7.3 and 7.4). The tiny patch fixes a race condition that is well explained by the MD kernel maintainer. I hope you'll still be considering it for RH 7.4/7.5. Sorry of not behind more helpful right now but I prefer to be honest with you. :)

Thanks for your help!

Stephane

Comment 6 Nigel Croxon 2017-10-06 18:51:07 UTC
I will be including commit id 3664847d95e60a9a943858b7800f8484669740fc into RHEL7.5.

Are you ok with closing this issue?

-Nigel

Comment 7 Stephane Thiell 2017-10-06 20:42:51 UTC
Great news, thanks Nigel!

Is there any plan to integrate this patch into a RHEL7.4 kernel update?

Thanks,

Stephane

Comment 8 Nigel Croxon 2017-10-09 13:29:58 UTC
Hello Stephane,

Yes, we will back port this to RHEL7.4 kernel update.

-Nigel

Comment 9 Stephane Thiell 2017-10-09 14:53:01 UTC
Thanks much. And yes, I'm ok with closing this issue now.

Best,

Stephane

Comment 18 Nigel Croxon 2018-01-09 20:03:54 UTC
This was fixed in RHEL7.5 with the following commit ID by me.
commit 4f386c203da33195fa5ac0379aebec7ab1948e2e
Author: Nigel Croxon <ncroxon>
Date:   Thu Nov 2 19:14:22 2017 -0400

-Nigel

Comment 23 guazhang@redhat.com 2018-01-19 06:45:00 UTC
Hello

Did code review and verified patches applied correctly on kernel-3.10.0-820.el7. so will moved to "VERIFIED"

Comment 24 errata-xmlrpc 2018-04-10 22:14:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1062