Description of problem: Hald attempts to invalidate a block device ( invalidate_bdev ) which in turn calls lock_kernel, if other CPU's are attempting to handle IPI events, the kernel becomes deadlocked. Version-Release number of selected component (if applicable): 2.6.9-42.EL #1 SMP Wed Jul 12 23:25:09 EDT 2006 ia64 ia64 ia64 GNU/Linux How reproducible: Unreproducable. Steps to Reproduce: 1. Boot kernel 2. Time passes 3. Eject CD. Actual results: System deadlocks (I think), not responsive, diskdump catches core. Expected results: No deadlock. Additional info: *********************back trace for hald************************** PID: 10303 TASK: e00000012bbc8000 CPU: 0 COMMAND: "hald" #0 [BSP:e00000012bbc95b8] start_disk_dump at a000000200059170 #1 [BSP:e00000012bbc9598] try_crashdump at a0000001000b02d0 #2 [BSP:e00000012bbc9418] ia64_init_handler at a000000100050390 EFRAME: e0000000047e5f30 B0: a00000010005bfd0 CR_IIP: a00000010005c020 CR_IPSR: 0000101008126010 CR_IFS: 8000000000000691 AR_PFS: 0000000000000000 AR_RSC: 0000000000000003 AR_UNAT: 0000000000000000 AR_RNAT: 0000000000000000 AR_CCV: e00000040bbf6488 AR_FPSR: 0009804c8a70033f LOADRS: 0000000000000000 AR_BSPSTORE: 0000000000000000 B6: a000000100015f80 B7: a00000010000ee90 PR: 0000000005aa9655 R1: a0000001009bb720 R2: 0000000000037000 R3: 0000000000037000 R8: 0000000000003700 R9: 00000000000000fe R10: 0000000000000000 R11: 0000000000000000 R12: e00000012bbcfb00 R13: e00000012bbc8000 R14: c0000000fee37000 R15: c0000000fee00000 R16: 00000000000000fe R17: 000000000000001f R18: 0000000000000000 R19: a0000001007ce9ac R20: 0000000000000001 R21: 0000000000000000 R22: e00000012bbc8dd4 R23: e000000001080000 R24: 0000000000000000 R25: 0000000000000000 R26: 0000000000000001 R27: 0000000000000000 R28: a00000010066fc28 R29: a000000100015f80 R30: 000000000000000e R31: a0000001006fc880 F6: 1003e0000000000000000 F7: 1003e0000000000004000 F8: 1003e00000000a02f2882 F9: 1003e00000753c4092dc4 F10: 1003eb40f8974e2cbdd88 F11: 1003e0000000000000495 #3 [BSP:e00000012bbc9388] smp_call_function at a00000010005c020 #4 [BSP:e00000012bbc9360] invalidate_bdev at a000000100128b30 #5 [BSP:e00000012bbc9330] __invalidate_device at a000000100167780 #6 [BSP:e00000012bbc9300] check_disk_change at a000000100139940 #7 [BSP:e00000012bbc9290] cdrom_open at a0000001003f0f00 #8 [BSP:e00000012bbc9248] idecd_open at a0000001003e7860 #9 [BSP:e00000012bbc91d0] do_open at a00000010013a640 #10 [BSP:e00000012bbc9198] blkdev_open at a00000010013ae00 #11 [BSP:e00000012bbc9128] __dentry_open at a0000001001220a0 #12 [BSP:e00000012bbc90e8] filp_open at a0000001001223b0 #13 [BSP:e00000012bbc9068] sys_open at a0000001001229b0 #14 [BSP:e00000012bbc9068] ia64_ret_from_syscall at a00000010000f4a0 EFRAME: e00000012bbcfe40 B0: 4000000000037870 CR_IIP: a000000000010640 CR_IPSR: 00001213081a6018 CR_IFS: 0000000000000000 AR_PFS: c000000000000006 AR_RSC: 000000000000000f AR_UNAT: 0000000000000000 AR_RNAT: 0000000000000000 AR_CCV: 0000000000000000 AR_FPSR: 0009804c8a70033f LOADRS: 0000000002580000 AR_BSPSTORE: 60000fff7fffc120 B6: 20000000005e1740 B7: 0000000000000000 PR: 0000000005aaaa61 R1: 2000000000664238 R2: c00000000000030d R3: 0000000000000001 R8: 8000000000000000 R9: 0000000046e66943 R10: 0000000000000000 R11: c000000000000612 R12: 60000fffffffb7f0 R13: 20000000002334a0 R14: 600000000000ad50 R15: 0000000000000404 R16: 000000000000006e R17: 00000000e0001000 R18: 0000000000000000 R19: 0000000000000000 R20: 0009804c8a70033f R21: 40000000000378c0 R22: 20000000005ef260 R23: 60000fff7fffc120 R24: 0000000000000000 R25: 0000000000000000 R26: c000000000000006 R27: 000000000000000f R28: a000000000010640 R29: 00001213081a6018 R30: 0000000000000000 R31: 0000000005a9aa61 F6: 000000000000000000000 F7: 000000000000000000000 F8: 000000000000000000000 F9: 000000000000000000000 F10: 000000000000000000000 F11: 000000000000000000000 #15 [BSP:e00000012bbc9068] __kernel_syscall_via_break at a000000000010640 *************************back trace for cpu 8]************************ PID: 23343 TASK: e000000318738000 CPU: 8 COMMAND: "date" #0 [BSP:e000000318739078] freeze_cpu at a0000002000586c0 #1 [BSP:e000000318739000] handle_IPI at a00000010005b340 #2 [BSP:e000000318738fb8] handle_IRQ_event at a0000001000130f0 #3 [BSP:e000000318738f50] do_IRQ at a000000100013ae0 #4 [BSP:e000000318738f08] ia64_handle_irq at a000000100015d10 #5 [BSP:e000000318738f08] ia64_leave_kernel at a00000010000f600 EFRAME: e00000031873fc20 B0: a0000001000881d0 CR_IIP: a000000100008cb0 CR_IPSR: 0000101008126030 CR_IFS: 8000000000000000 AR_PFS: 0000000000000081 AR_RSC: 0000000000000003 AR_UNAT: 0000000000000000 AR_RNAT: 0000000000000010 AR_CCV: 0000000000000000 AR_FPSR: 0009804c0270033f LOADRS: 0000000000000000 AR_BSPSTORE: 000000002986ac30 B6: a000000100593de0 B7: a0000001000128d0 PR: 0000000095555a99 R1: a0000001009bb720 R2: 0000000000000612 R3: 60000fffffffba41 R8: 0000000000000000 R9: 0000000000000001 R10: 0000000000000000 R11: 00000000955506d9 R12: e00000031873fde0 R13: e000000318738000 R14: 0000000000000000 R15: 60000fffffffb9f0 R16: 0000000000000050 R17: e00000031873fe40 R18: e00000031873fe41 R19: 60000fffffffba70 R20: 00000000ffffffff R21: e000000318738028 R22: e000000318738e10 R23: 0000000000000000 R24: 0000000000000000 R25: 0000000000000000 R26: c00000000000038f R27: 0000000000000000 R28: 0000000000000000 R29: 0000000000000002 R30: 0000000000000001 R31: a0000001006fd600 F6: 0ffff8000000000000000 F7: 1003e0000000000000020 F8: 1003e0000000000000037 F9: 100048000000000000000 F10: 1003e0000000000000037 F11: 10004dfdffffff2020000 #6 [BSP:e000000318738f08] ia64_spinlock_contention at a000000100008cb0 #7 [BSP:e000000318738f00] _spin_lock at a000000100593de0 #8 [BSP:e000000318738ea0] sys_sysctl at a0000001000881d0 #9 [BSP:e000000318738ea0] ia64_ret_from_syscall at a00000010000f4a0 EFRAME: e00000031873fe40 B0: 20000000003074a0 CR_IIP: a000000000010640 CR_IPSR: 0000121308126038 CR_IFS: 0000000000000000 AR_PFS: c00000000000038f AR_RSC: 000000000000000f AR_UNAT: 0000000000000000 AR_RNAT: 0000000000000000 AR_CCV: 0000000000000000 AR_FPSR: 0009804c8a70033f LOADRS: 0000000000900000 AR_BSPSTORE: 60000fff7fffc078 B6: 2000000000230980 B7: 0000000000000000 PR: 000000009555a261 R1: 2000000000294238 R2: c000000000000410 R3: 60000fffffffba30 R8: 8000000000000000 R9: 200000000032f1b8 R10: 0000000000000001 R11: 0000000000000000 R12: 60000fffffffb9f0 R13: 2000000000334e00 R14: 2000000000294238 R15: 000000000000047e R16: 200000000003cc78 R17: 20000000002308b0 R18: 200000000003cc30 R19: 0000001000000815 R20: 2000000000305d00 R21: 200000000003cc30 R22: 200000000003cc30 R23: 20000000002f72d8 R24: 2000000000043560 R25: 20000000000435a8 R26: 20000000002f72c0 R27: 20000000002f72d0 R28: 20000000000432e8 R29: 20000000000432f0 R30: 000000000019c8b0 R31: 20000000000999c8 F6: 1003e00000000000006e0 F7: 1003e0000000000000020 F8: 1003e0000000000000037 F9: 100048000000000000000 F10: 1003e0000000000000037 F11: 10004dfdffffff2020000 #10 [BSP:e000000318738ea0] __kernel_syscall_via_break at a000000000010640 Possible fix at http://readlist.com/lists/vger.kernel.org/linux-kernel/49/245359.html , although even if I could test this, I dont know how to reproduce in any actual form.
>Hald attempts to invalidate a block device ( invalidate_bdev ) which in turn >calls lock_kernel, if other CPU's are attempting to handle IPI events, the >kernel becomes deadlocked. lock_kernel should be safe to do when smp_call_function is called. Wade, is the customer ejecting the CD and seeing a hang? P.
From the discussion of the patch on http://readlist.com/lists/vger.kernel.org/linux-kernel/49/245359.html, the problem seems to be altix specific. Prarit, what's your plan on the bug?
re: comment 8 - please note that this is a Unisys system.
This is a tough one -- I've tried to reproduce it here on a 64p/128G box but so far have been unable to. I'll ping gbeshers (on site SGI engineer) as well as jes (who acked the patch upstream) to see if they have any ideas on reproducers. P.
Spoke with Jes regarding the patch -- he agrees with me that the patch is safe to take in and shouldn't require any other patches. I will post to RHKL shortly. P.
Created attachment 295293 [details] RHEL4 patch for this issue
Since this is in POST, can we get a DEV ACK?
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Committed in 68.15.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
~~~~~~~~~~~~~~ ~ Attention: ~ Feedback requested regarding this **High Priority** bug. ~~~~~~~~~~~~~~ A fix for this issue should be included in the latest packages contained in RHEL4.7-Snapshot1--available now on partners.redhat.com. After you (Red Hat Partner) have verified that this issue has been addressed, submit a comment describing the passing results of your test in appropriate detail, along with which snapshot and package version tested. The bugzilla will be updated by Red Hat Quality Engineering for you when this information has been received. If you believe this issue has not properly fixed or you are unable to verify the issue for any reason, please add a comment describing the most recent issues you are experiencing, along with which snapshot and package version tested. If you believe the bug has not been fixed, change the status of the bug to ASSIGNED. If you are receiving this message in Issue Tracker, please reply with a message to Issue Tracker about your results and bugzilla will be updated for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. Thank you Red Hat QE Partner Management
Correction: This bug is **Low Priority**. Sorry.
~~~~~~~~~~~~~~ ~ Attention: ~ ~~~~~~~~~~~~~~ A fix for this issue should be included in the latest kernel packages contained in **kernel 2.6.9-73.EL**, accessible now on http://partners.redhat.com. After you (Red Hat Partner) have verified that this issue has been addressed, submit a comment describing the results of your test in appropriate detail, along with which snapshot and package version tested. The bugzilla will be updated by Red Hat Quality Engineering for you when this information has been received. If this issue has not been properly fixed or you are unable to verify the issue for any reason, please add a comment describing the most recent issues you are experiencing, along with which snapshot and package version tested. If you are sure the bug has not been fixed, change the status of the bug to ASSIGNED. For IssueTracker users, submit verification results as usual; Bugzilla will be updated by Red Hat Quality Engineering for you. For additional information, contact your Partner Manager. Thank you, Red Hat QE Partner Management
~ Attention ~ Testing Feedback Deadlines are approaching! This bug should fixed in Partner Snapshot 4, which is now available on partners.redhat.com, ready for testing. We are approaching the end of RHEL 4.7 testing cycle. It is important to let us know of your testing results *as soon as possible*. If any issues are found once the testing phase deadline has passed, you might lose your opportunity to include the fix in this RHEL update. If you have verified that this bug has been properly fixed or have discovered any issues, please provide detailed feedback of your testing results, including the package version and snapshot tested. The bug status will be updated for you, based on testing results. Contact your Partner Manager with any additional questions.
~ Final Notice ~ Testing Phase Ending Soon This bug should have been fixed in the latest RHEL 4.7 Release Candidate, available **now** on partners.redhat.com. If you have already verified that this bug has been properly fixed or have found any issues, please provide detailed feedback of your testing results, including the package version and snapshot tested. The bug status will be updated for you, based on the returned testing results. Contact your Partner Manager with any additional questions.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0665.html
Partners, I would like to thank you all for your participation in assuring the quality of this RHEL 4.7 Update Release. My hat's off to you all. Thanks.