This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 507860 - [5.4][REG] The system panic occurs, when a file is accessed by multiple processes simultaneously.
[5.4][REG] The system panic occurs, when a file is accessed by multiple proce...
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.4
All Linux
urgent Severity high
: rc
: ---
Assigned To: Larry Woodman
Red Hat Kernel QE team
:
Depends On:
Blocks: 499522 508030
  Show dependency treegraph
 
Reported: 2009-06-24 10:59 EDT by Flavio Leitner
Modified: 2014-03-17 04:03 EDT (History)
19 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-11-04 13:08:41 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
reproducer (1.31 KB, application/octet-stream)
2009-06-24 11:01 EDT, Flavio Leitner
no flags Details
sosreport (627.17 KB, application/octet-stream)
2009-06-24 11:01 EDT, Flavio Leitner
no flags Details
this is in replacement of ce25201608bd5af3a4a9653320094beaadef5f58 (3.32 KB, patch)
2009-07-08 10:47 EDT, Andrea Arcangeli
no flags Details | Diff

  None (edit)
Description Flavio Leitner 2009-06-24 10:59:42 EDT
This bug was identified firstly here:
https://bugzilla.redhat.com/show_bug.cgi?id=506684#c26

When multiple processes accessed a same file on hugetlbfs
simultaneously, the system panic occurred and the following
message was shown.

---
kernel BUG at mm/hugetlb.c:418!
pthread-read-an[15693]: bugcheck! 0 [1]
Modules linked in: ipv6 xfrm_nalgo crypto_api autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc
ipmi_watchdog panicforpcl2(U) fefpcl(U) mptctl sadump(U) ipmi_si ipmi_devintf ipmi_msghandler vfat
fat dm_mirror dm_multipath scsi_dh button parport_pc lp parport sg e100 tg3 mii dm_raid45 dm_message
dm_region_hash dm_log dm_mod dm_mem_cache usb_storage lpfc scsi_transport_fc shpchp mptspi mptscsih
mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd

Pid: 15693, CPU 2, comm:      pthread-read-an
psr : 00001010085a6010 ifs : 8000000000000713 ip  : [<a000000100154e50>]    Tainted: G
(2.6.18-152.el5)
ip is at copy_hugetlb_page_range+0x370/0x4a0
unat: 0000000000000000 pfs : 0000000000000713 rsc : 0000000000000003
rnat: a000000100b11de8 bsps: 0000000000000004 pr  : 000000000055a559
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a000000100154e50 b6  : a000000100011000 b7  : a0000001002eb380
f6  : 1003e00000000000000a0 f7  : 1003e20c49ba5e353f7cf
f8  : 1003e00000000000004e2 f9  : 1003e000000000fa00000
f10 : 1003e000000003b9aca00 f11 : 1003e431bde82d7b634db
r1  : a000000100c36430 r2  : a000000100a4ef08 r3  : a00000010097b4e0
r8  : 0000000000000023 r9  : a000000100a4ef38 r10 : a000000100a4ef38
r11 : 0000000000000000 r12 : e0000040be76fbf0 r13 : e0000040be768000
r14 : a000000100a4ef08 r15 : 0000000000000000 r16 : a00000010097b4e8
r17 : e00000409825fe18 r18 : 0000000000000000 r19 : 0000000000000000
r20 : a000000100879280 r21 : a000000100a36a98 r22 : a000000100a4ef10
r23 : a000000100a4ef10 r24 : a0000001007fd054 r25 : 0000000000000000
r26 : a0000001007fd05c r27 : a0000001007fd040 r28 : a0000001007fc008
r29 : 0000063ff9c00000 r30 : 0000000000000000 r31 : 0000000000000000

Call Trace:
[<a000000100013b40>] show_stack+0x40/0xa0
                               sp=e0000040be76f780 bsp=e0000040be7694f8
[<a000000100014470>] show_regs+0x870/0x8c0
                               sp=e0000040be76f950 bsp=e0000040be7694a0
[<a000000100037e20>] die+0x1c0/0x2c0
                               sp=e0000040be76f950 bsp=e0000040be769458
[<a000000100037f70>] die_if_kernel+0x50/0x80
                               sp=e0000040be76f970 bsp=e0000040be769428
[<a000000100669390>] ia64_bad_break+0x270/0x4a0
                               sp=e0000040be76f970 bsp=e0000040be769400
[<a00000010000bfe0>] __ia64_leave_kernel+0x0/0x280
                               sp=e0000040be76fa20 bsp=e0000040be769400
[<a000000100154e50>] copy_hugetlb_page_range+0x370/0x4a0
                               sp=e0000040be76fbf0 bsp=e0000040be769360
[<a0000001001388d0>] copy_page_range+0xd0/0x1760
                               sp=e0000040be76fbf0 bsp=e0000040be769280
[<a000000100079270>] copy_process+0x18d0/0x2920
                               sp=e0000040be76fc00 bsp=e0000040be7691c8
[<a00000010007a350>] do_fork+0x90/0x3c0
                               sp=e0000040be76fc00 bsp=e0000040be769178
[<a00000010000b6a0>] sys_clone2+0x60/0x80
                               sp=e0000040be76fc20 bsp=e0000040be769128
[<a00000010000bd70>] __ia64_trace_syscall+0xd0/0x110
                               sp=e0000040be76fe30 bsp=e0000040be769128
[<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400
                               sp=e0000040be770000 bsp=e0000040be769128
<0>Kernel panic - not syncing: Fatal exception
---

We have never seen this panic before. And we confirmed that this
panic does not occur on the Errata kernel(2.6.18-128.1.14.el5) and 5.3GA.
Therefore we think this is a regression because this problem is added on
the way to 5.4 alpha from 5.3GA.

Version-Release number of selected component:

Red Hat Enterprise Linux Version Number: 5.4
Release Number: Partner Alpha
Architecture: ia64
Kernel Version: 2.6.18-152.el5
Related Package Version:
Related Middleware / Application: None

Drivers or hardware or architecture dependency:
None

How reproducible:
Always.

Step to Reproduce:
Execute our reproducer as follows.
1. Be available to hugetlb
 # mkdir /huge
 # mount -t hugetlbfs hugetlbfs /huge
 # echo 4 > /proc/sys/vm/nr_hugepages
2. Extract the reproducer(I'll attach it)
 # tar zxvf pthread-read-and-fork-hugetlb.tar.gz
3. Compile the reproducer
 # cd pthread-read-and-fork-hugetlb
 # make
4. Execute the reproducer
 # ./pthread-read-and-fork /dev/sda1 &
 # ./pthread-read-and-fork /dev/sda2 &

Actual Results:
system panic occurs

Expected Results:
system panic does not occur

Summary of actions taken to resolve issue:

Location of diagnostic data:

Hardware configuration:

Model: PRIMEQUEST
CPU Info: Itanium 2 (1.6GHz) x 4
Memory Info: 4GB
Hardware Component Information: None
Configuration Info: None
Guest Configuration Info: None

Business Impact:
This problem is a regression and critical bug which brings on
the system panic. The hugetlbfs is generally used by our software
products. Until this problem is fixed, we can not ship RHEL5.4.

Target Release: 5.4 GA
Errata Request:
Hotfix Request:

Additional Info:
I'll attach a sosreport and a reproducer.
Comment 1 Flavio Leitner 2009-06-24 11:01:11 EDT
Created attachment 349251 [details]
reproducer
Comment 2 Flavio Leitner 2009-06-24 11:01:48 EDT
Created attachment 349252 [details]
sosreport
Comment 3 RHEL Product and Program Management 2009-06-24 16:33:11 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 11 Don Zickus 2009-06-30 16:23:02 EDT
in kernel-2.6.18-156.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 13 Jon Thomas 2009-07-01 09:17:02 EDT
hi,

fyi, there are some side effects of these patches being reported in IT311086/bz508919.

in general, it looks like we're getting soft lockups as a result of the additional locking.  It sounds like the issue was resolved for i386 and x86_64, but the customer is claiming it's not fixed on ia64.
Comment 14 Larry Woodman 2009-07-01 09:59:53 EDT
We have not been able to reproduce any of these "soft lockups" internally.  Please attach the /var/log/messages file and/or console output when the soft-lockups occur.  

Thanks, Larry Woodman
Comment 15 Linda Wang 2009-07-01 11:22:00 EDT
*** Bug 508919 has been marked as a duplicate of this bug. ***
Comment 16 Issue Tracker 2009-07-01 12:02:12 EDT
Event posted on 07-02-2009 01:02am JST by mfuruta@redhat.com

Hi Larry-san,

Fujitsu ask me to let you know comment as below:

----
Hi Furuta-san,

Could you send this message to Larry Woodman?

Soft lockups do not occur on this problem. If this problem is reproduced
the system panic occurs without soft lockups.

We are a little bit confused. Are you talking about IT311086/bz508919?

Regards,
Yasuaki Ishimatsu 
----

Thanks in advance.

Regards,
Masaki Furuta



This event sent from IssueTracker by mfuruta@redhat.com 
 issue 310582
Comment 17 Issue Tracker 2009-07-01 12:06:52 EDT
Event posted on 07-02-2009 01:06am JST by mfuruta@redhat.com

Hi Ishimatsu-san,

Thanks for comment!, 
I've sent your message to Larry Woodman, and flagged on BZ.
Please wait for while, I'll get back to you or he'll directly answer
you. 

----
Could you send this message to Larry Woodman?
----

Regards,
Masaki Furuta


This event sent from IssueTracker by mfuruta@redhat.com 
 issue 310582
Comment 18 Jon Thomas 2009-07-01 13:38:09 EDT
re IT311086:

well, I guess I'm a bit confused because in IT311086, the same customer is running the same reproducer on 2.6.18-128.1.15 and 2.6.18-128.1.14 and seeing softlockups. Evidence was in the core file. They do not see soft lockups with 2.6.18-128. I tested 2.6.18-128.1.14 without the linux-2.6-mm-fork-vs-gup-race-fix.patch and didn't hit a lockup either.

The latest report in IT311086 mentioned that issue 311086 was resolved with 2.6.18-128.1.16 for i386 and x86_64, but not ia64. Perhaps that comment was meant for a different IT?
Comment 19 Larry Woodman 2009-07-01 14:53:57 EDT
Jon, someone has to try the latest 5.3.z kernel(2.6.18-128.1.16.el5) because it has the patch that should prevent the softlockups you are seeing.  The problem was that hugetlb_cow() called copy_huge_page() with the page_table_lock() spinlock held and it reschedules!!!  I changed hugetlb_cow() so that it will not drop the page_table_lock() spinlock and changed copy_huge_page() so that it will not reschedule until its donw copying a hugepage.

Larry Woodman
Comment 20 Larry Woodman 2009-07-01 15:05:07 EDT
Jon, someone has to try the latest 5.3.z kernel(2.6.18-128.1.16.el5) because it has the patch that should prevent the softlockups you are seeing.  The problem was that hugetlb_cow() called copy_huge_page() with the page_table_lock() spinlock held and it reschedules!!!  I changed hugetlb_cow() so that it will not drop the page_table_lock() spinlock and changed copy_huge_page() so that it will not reschedule until its done copying a hugepage.

Larry Woodman
Comment 26 Chris Ward 2009-07-03 14:46:23 EDT
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.
Comment 27 Andrea Arcangeli 2009-07-08 10:47:59 EDT
Created attachment 350947 [details]
this is in replacement of ce25201608bd5af3a4a9653320094beaadef5f58

using this fix to close bug #507860, the bug in #510235 can't materialize

this only applies to rhel5.4 minus ce25201608bd5af3a4a9653320094beaadef5f58

It doesn't apply to 5.3 where the latency of 256M pages won't be easily fixed if we also don't apply other patches in 5.4 and then we apply these version of the patches from 5.4 back in 5.3.
Comment 28 Andrea Arcangeli 2009-07-08 10:49:37 EDT
5.3 probably should go with the cond_resched addition of #510235, if they want the real deal they can use 5.4 and new patch to close #507860
Comment 31 Don Zickus 2009-07-22 11:54:03 EDT
This patch has been reverted from 5.4 (kernel-2.6.18-159.el5).  Moving back to ASSIGNED to further work on it during 5.5
Comment 37 Larry Woodman 2009-11-04 13:08:41 EST
This problem is gone in RHEL5-U4.  We reverted the fork-gup patches that
introduced this problem in -159.

------------------------------------------------------------------------------
* Mon Jul 20 2009 Don Zickus <dzickus@redhat.com> [2.6.18-159.el5]
...
- Revert: [mm] fix swap race in fork-gup patch group (Larry Woodman ) [508919]
...
------------------------------------------------------------------------------

Note You need to log in before you can comment on or make changes to this bug.