Bug 359671

Summary: RHEL4: Hald causes system deadlock on ia64
Product: Red Hat Enterprise Linux 4 Reporter: Norm Murray <nmurray>
Component: kernelAssignee: Prarit Bhargava <prarit>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: low Docs Contact:
Priority: low    
Version: 4.5CC: cward, jbaron, tao
Target Milestone: ---Keywords: OtherQA
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2008-0665 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-07-24 19:19:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 430698    
Attachments:
Description Flags
RHEL4 patch for this issue none

Description Wade Mealing 2007-10-31 05:41:27 UTC
Description of problem:

Hald attempts to invalidate a block device ( invalidate_bdev ) which in turn
calls lock_kernel, if other CPU's are attempting to handle IPI events, the
kernel becomes deadlocked.

Version-Release number of selected component (if applicable):

2.6.9-42.EL #1 SMP Wed Jul 12 23:25:09 EDT 2006 ia64 ia64 ia64 GNU/Linux

How reproducible:

Unreproducable.

Steps to Reproduce:
1. Boot kernel
2. Time passes
3. Eject CD.
  
Actual results:

System deadlocks (I think), not responsive, diskdump catches core.

Expected results:

No deadlock.

Additional info:

*********************back trace for hald**************************
PID: 10303  TASK: e00000012bbc8000  CPU: 0   COMMAND: "hald"
#0 [BSP:e00000012bbc95b8] start_disk_dump at a000000200059170
#1 [BSP:e00000012bbc9598] try_crashdump at a0000001000b02d0
#2 [BSP:e00000012bbc9418] ia64_init_handler at a000000100050390
 EFRAME: e0000000047e5f30
     B0: a00000010005bfd0      CR_IIP: a00000010005c020
CR_IPSR: 0000101008126010      CR_IFS: 8000000000000691
 AR_PFS: 0000000000000000      AR_RSC: 0000000000000003
AR_UNAT: 0000000000000000     AR_RNAT: 0000000000000000
 AR_CCV: e00000040bbf6488     AR_FPSR: 0009804c8a70033f
 LOADRS: 0000000000000000 AR_BSPSTORE: 0000000000000000
     B6: a000000100015f80          B7: a00000010000ee90
     PR: 0000000005aa9655          R1: a0000001009bb720
     R2: 0000000000037000          R3: 0000000000037000
     R8: 0000000000003700          R9: 00000000000000fe
    R10: 0000000000000000         R11: 0000000000000000
    R12: e00000012bbcfb00         R13: e00000012bbc8000
    R14: c0000000fee37000         R15: c0000000fee00000
    R16: 00000000000000fe         R17: 000000000000001f
    R18: 0000000000000000         R19: a0000001007ce9ac
    R20: 0000000000000001         R21: 0000000000000000
    R22: e00000012bbc8dd4         R23: e000000001080000
    R24: 0000000000000000         R25: 0000000000000000
    R26: 0000000000000001         R27: 0000000000000000
    R28: a00000010066fc28         R29: a000000100015f80
    R30: 000000000000000e         R31: a0000001006fc880
     F6: 1003e0000000000000000     F7: 1003e0000000000004000
     F8: 1003e00000000a02f2882     F9: 1003e00000753c4092dc4
    F10: 1003eb40f8974e2cbdd88    F11: 1003e0000000000000495
#3 [BSP:e00000012bbc9388] smp_call_function at a00000010005c020
#4 [BSP:e00000012bbc9360] invalidate_bdev at a000000100128b30
#5 [BSP:e00000012bbc9330] __invalidate_device at a000000100167780
#6 [BSP:e00000012bbc9300] check_disk_change at a000000100139940
#7 [BSP:e00000012bbc9290] cdrom_open at a0000001003f0f00
#8 [BSP:e00000012bbc9248] idecd_open at a0000001003e7860
#9 [BSP:e00000012bbc91d0] do_open at a00000010013a640
#10 [BSP:e00000012bbc9198] blkdev_open at a00000010013ae00
#11 [BSP:e00000012bbc9128] __dentry_open at a0000001001220a0
#12 [BSP:e00000012bbc90e8] filp_open at a0000001001223b0
#13 [BSP:e00000012bbc9068] sys_open at a0000001001229b0
#14 [BSP:e00000012bbc9068] ia64_ret_from_syscall at a00000010000f4a0
 EFRAME: e00000012bbcfe40
     B0: 4000000000037870      CR_IIP: a000000000010640
CR_IPSR: 00001213081a6018      CR_IFS: 0000000000000000
 AR_PFS: c000000000000006      AR_RSC: 000000000000000f
AR_UNAT: 0000000000000000     AR_RNAT: 0000000000000000
 AR_CCV: 0000000000000000     AR_FPSR: 0009804c8a70033f
 LOADRS: 0000000002580000 AR_BSPSTORE: 60000fff7fffc120
     B6: 20000000005e1740          B7: 0000000000000000
     PR: 0000000005aaaa61          R1: 2000000000664238
     R2: c00000000000030d          R3: 0000000000000001
     R8: 8000000000000000          R9: 0000000046e66943
    R10: 0000000000000000         R11: c000000000000612
    R12: 60000fffffffb7f0         R13: 20000000002334a0
    R14: 600000000000ad50         R15: 0000000000000404
    R16: 000000000000006e         R17: 00000000e0001000
    R18: 0000000000000000         R19: 0000000000000000
    R20: 0009804c8a70033f         R21: 40000000000378c0
    R22: 20000000005ef260         R23: 60000fff7fffc120
    R24: 0000000000000000         R25: 0000000000000000
    R26: c000000000000006         R27: 000000000000000f
    R28: a000000000010640         R29: 00001213081a6018
    R30: 0000000000000000         R31: 0000000005a9aa61
     F6: 000000000000000000000     F7: 000000000000000000000
     F8: 000000000000000000000     F9: 000000000000000000000
    F10: 000000000000000000000    F11: 000000000000000000000
#15 [BSP:e00000012bbc9068] __kernel_syscall_via_break at a000000000010640


*************************back trace for cpu 8]************************
PID: 23343  TASK: e000000318738000  CPU: 8   COMMAND: "date"
#0 [BSP:e000000318739078] freeze_cpu at a0000002000586c0
#1 [BSP:e000000318739000] handle_IPI at a00000010005b340
#2 [BSP:e000000318738fb8] handle_IRQ_event at a0000001000130f0
#3 [BSP:e000000318738f50] do_IRQ at a000000100013ae0
#4 [BSP:e000000318738f08] ia64_handle_irq at a000000100015d10
#5 [BSP:e000000318738f08] ia64_leave_kernel at a00000010000f600
 EFRAME: e00000031873fc20
     B0: a0000001000881d0      CR_IIP: a000000100008cb0
CR_IPSR: 0000101008126030      CR_IFS: 8000000000000000
 AR_PFS: 0000000000000081      AR_RSC: 0000000000000003
AR_UNAT: 0000000000000000     AR_RNAT: 0000000000000010
 AR_CCV: 0000000000000000     AR_FPSR: 0009804c0270033f
 LOADRS: 0000000000000000 AR_BSPSTORE: 000000002986ac30
     B6: a000000100593de0          B7: a0000001000128d0
     PR: 0000000095555a99          R1: a0000001009bb720
     R2: 0000000000000612          R3: 60000fffffffba41
     R8: 0000000000000000          R9: 0000000000000001
    R10: 0000000000000000         R11: 00000000955506d9
    R12: e00000031873fde0         R13: e000000318738000
    R14: 0000000000000000         R15: 60000fffffffb9f0
    R16: 0000000000000050         R17: e00000031873fe40
    R18: e00000031873fe41         R19: 60000fffffffba70
    R20: 00000000ffffffff         R21: e000000318738028
    R22: e000000318738e10         R23: 0000000000000000
    R24: 0000000000000000         R25: 0000000000000000
    R26: c00000000000038f         R27: 0000000000000000
    R28: 0000000000000000         R29: 0000000000000002
    R30: 0000000000000001         R31: a0000001006fd600
     F6: 0ffff8000000000000000     F7: 1003e0000000000000020
     F8: 1003e0000000000000037     F9: 100048000000000000000
    F10: 1003e0000000000000037    F11: 10004dfdffffff2020000
#6 [BSP:e000000318738f08] ia64_spinlock_contention at a000000100008cb0
#7 [BSP:e000000318738f00] _spin_lock at a000000100593de0
#8 [BSP:e000000318738ea0] sys_sysctl at a0000001000881d0
#9 [BSP:e000000318738ea0] ia64_ret_from_syscall at a00000010000f4a0
 EFRAME: e00000031873fe40
     B0: 20000000003074a0      CR_IIP: a000000000010640
CR_IPSR: 0000121308126038      CR_IFS: 0000000000000000
 AR_PFS: c00000000000038f      AR_RSC: 000000000000000f
AR_UNAT: 0000000000000000     AR_RNAT: 0000000000000000
 AR_CCV: 0000000000000000     AR_FPSR: 0009804c8a70033f
 LOADRS: 0000000000900000 AR_BSPSTORE: 60000fff7fffc078
     B6: 2000000000230980          B7: 0000000000000000
     PR: 000000009555a261          R1: 2000000000294238
     R2: c000000000000410          R3: 60000fffffffba30
     R8: 8000000000000000          R9: 200000000032f1b8
    R10: 0000000000000001         R11: 0000000000000000
    R12: 60000fffffffb9f0         R13: 2000000000334e00
    R14: 2000000000294238         R15: 000000000000047e
    R16: 200000000003cc78         R17: 20000000002308b0
    R18: 200000000003cc30         R19: 0000001000000815
    R20: 2000000000305d00         R21: 200000000003cc30
    R22: 200000000003cc30         R23: 20000000002f72d8
    R24: 2000000000043560         R25: 20000000000435a8
    R26: 20000000002f72c0         R27: 20000000002f72d0
    R28: 20000000000432e8         R29: 20000000000432f0
    R30: 000000000019c8b0         R31: 20000000000999c8
     F6: 1003e00000000000006e0     F7: 1003e0000000000000020
     F8: 1003e0000000000000037     F9: 100048000000000000000
    F10: 1003e0000000000000037    F11: 10004dfdffffff2020000
#10 [BSP:e000000318738ea0] __kernel_syscall_via_break at a000000000010640

Possible fix at 
http://readlist.com/lists/vger.kernel.org/linux-kernel/49/245359.html , although
even if I could test this, I dont know how to reproduce in any actual form.

Comment 6 Prarit Bhargava 2008-01-09 18:40:42 UTC
>Hald attempts to invalidate a block device ( invalidate_bdev ) which in turn
>calls lock_kernel, if other CPU's are attempting to handle IPI events, the
>kernel becomes deadlocked.

lock_kernel should be safe to do when smp_call_function is called.

Wade, is the customer ejecting the CD and seeing a hang?

P.

Comment 8 Luming Yu 2008-01-28 07:04:18 UTC
From the discussion of the patch on
http://readlist.com/lists/vger.kernel.org/linux-kernel/49/245359.html, the
problem seems to be altix specific.

Prarit,  what's your plan on the bug?





Comment 9 Russell Doty 2008-01-30 18:21:23 UTC
re: comment 8 - please note that this is a Unisys system.

Comment 13 Prarit Bhargava 2008-02-19 14:31:11 UTC
This is a tough one -- I've tried to reproduce it here on a 64p/128G box but so
far have been unable to.  I'll ping gbeshers (on site SGI engineer)
as well as jes (who acked the patch upstream) to see if they have any
ideas on reproducers.

P.

Comment 14 Prarit Bhargava 2008-02-19 14:50:07 UTC
Spoke with Jes regarding the patch -- he agrees with me that the patch is safe
to take in and shouldn't require any other patches.

I will post to RHKL shortly.

P.

Comment 15 Prarit Bhargava 2008-02-19 15:02:37 UTC
Created attachment 295293 [details]
RHEL4 patch for this issue

Comment 16 Russell Doty 2008-02-27 18:18:09 UTC
Since this is in POST, can we get a DEV ACK?

Comment 17 RHEL Program Management 2008-02-27 19:28:54 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 18 Vivek Goyal 2008-02-28 23:29:25 UTC
Committed in 68.15.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 20 Chris Ward 2008-06-05 15:54:37 UTC
~~~~~~~~~~~~~~
~ Attention: ~ Feedback requested regarding this **High Priority** bug. 
~~~~~~~~~~~~~~

A fix for this issue should be included in the latest packages contained in
RHEL4.7-Snapshot1--available now on partners.redhat.com.

After you (Red Hat Partner) have verified that this issue has been addressed,
submit a comment describing the passing results of your test in appropriate
detail, along with which snapshot and package version tested. The bugzilla will
be updated by Red Hat Quality Engineering for you when this information has been
received.

If you believe this issue has not properly fixed or you are unable to verify the
issue for any reason, please add a comment describing the most recent issues you
are experiencing, along with which snapshot and package version tested. 

If you believe the bug has not been fixed, change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and bugzilla will be updated for you. 

If you need assistance accessing ftp://partners.redhat.com, please contact your
Partner Manager.

Thank you
Red Hat QE Partner Management

Comment 21 Chris Ward 2008-06-06 08:40:30 UTC
Correction: This bug is **Low Priority**. Sorry.

Comment 22 Chris Ward 2008-06-19 13:07:15 UTC
~~~~~~~~~~~~~~
~ Attention: ~ 
~~~~~~~~~~~~~~

A fix for this issue should be included in the latest kernel packages contained
in **kernel 2.6.9-73.EL**, accessible now on http://partners.redhat.com.

After you (Red Hat Partner) have verified that this issue has been addressed,
submit a comment describing the results of your test in appropriate detail,    
 along with which snapshot and package version tested. The bugzilla will be
updated by Red Hat Quality Engineering for you when this information has been  
    received.

If this issue has not been properly fixed or you are unable to verify the issue
for any reason, please add a comment describing the most recent issues you are
experiencing, along with which snapshot and package version tested. If you are
sure the bug has not been fixed, change the status of the bug to ASSIGNED.

For IssueTracker users, submit verification results as usual; Bugzilla will be
updated by Red Hat Quality Engineering for you.

For additional information, contact your Partner Manager.

Thank you,
Red Hat QE Partner Management

Comment 24 Chris Ward 2008-06-27 11:37:16 UTC
~ Attention ~ Testing Feedback Deadlines are approaching!

This bug should fixed in Partner Snapshot 4, which is now available on
partners.redhat.com, ready for testing.

We are approaching the end of RHEL 4.7 testing cycle. It is important to let us
know of your testing results *as soon as possible*. If any issues are found once
the testing phase  deadline has passed, you might lose your opportunity to
include the fix in this RHEL update.

If you have verified that this bug has been properly fixed or have discovered
any issues, please provide detailed feedback of your testing results, including
the package version   and snapshot tested. The bug status will be updated for
you, based on testing results.

Contact your Partner Manager with any additional questions.


Comment 25 Chris Ward 2008-07-14 11:09:12 UTC
~ Final Notice ~ Testing Phase Ending Soon

This bug should have been fixed in the latest RHEL 4.7 Release Candidate,
available **now** on partners.redhat.com.

If you have already verified that this bug has been properly fixed or have found
any issues, please provide detailed feedback of your testing results, including
the package version and snapshot tested. The bug status will be updated for you,
based on the returned testing results.

Contact your Partner Manager with any additional questions.

Comment 29 errata-xmlrpc 2008-07-24 19:19:04 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html

Comment 30 Chris Ward 2008-07-29 07:26:35 UTC
Partners, I would like to thank you all for your participation in assuring the
quality of this RHEL 4.7 Update Release. My hat's off to you all. Thanks.