Bug 446086
Summary: | crash formatting a DVD under libata | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Charlotte Richardson <charlotte.richardson> | ||||||||
Component: | kernel | Assignee: | David Milburn <dmilburn> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | low | ||||||||||
Version: | 5.2 | CC: | cward, dmilburn, jgarzik, lwang, mchehab, pasteur, peterm, ToddAndMargo | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2009-09-02 08:56:24 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Charlotte Richardson
2008-05-12 17:17:02 UTC
Could you please provide more detailed info, including: kernel version that you're using; the OOPS dump; vmcore, if it generated one: sosreport RHEL5.2, so it is 2.6.18-92.el5. I had installed the k3b from the Client images since it is no longer in the Server images. I did not save the dump, only the console stack trace, though it ought to be easy enough to reproduce if I remove the fix to __atapi_pio_bytes in libata-core.c (we occlude libata.ko for our customers to avoid crashes); let me know if you need it. Here's the stack trace from the console: Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: [<ffffffff880d7326>] :libata:ata_hsm_move+0x3b3/0x770 PGD b062067 PUD b45e067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /class/net/bond0/flags CPU 0 Modules linked in: ppp_deflate zlib_deflate ppp_async crc_ccitt ppp_generic slhc ipv6 xfrm_nalgo crypto_api autofs4 hidp rfcomm l2cap bluetooth sunrpc bonding iscsi_tcp ib_iser libiscsi scsi_transport_iscsi rdma_ucm ib_ucm ib_srp ib_sdp rdma_cm ib_cm iw_cm ib_addr ib_ipoib ib_sa ib_uverbs ib_umad ib_mad ib_core dm_mirror(U) dm_multipath(U) dm_mod(U) video sbs backlight i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport ipmi_devintf ftmod(U) ipmi_msghandler st joydev vtm(FU) sr_mod cdrom(U) sg(U) i5000_edac edac_mc pcspkr radeonfb(FU) fosil(U) e1000(U) ata_piix(U) aic79xx(U) scsi_transport_spi aic94xx(U) libsas(U) libata(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) raid1(U) ext3jbd ehci_hcd ohci_hcd uhci_hcd(U) Pid: 0, comm: swapper Tainted: GF 2.6.18-78.el5 #1 RIP: 0010:[<ffffffff880d7326>] [<ffffffff880d7326>] :libata:ata_hsm_move+0x3b3/0x770 RSP: 0018:ffffffff80416e58 EFLAGS: 00010046 RAX: ffff81007e4622f0 RBX: ffff81007e4600e0 RCX: 0000000000000000 RDX: ffff81007e460000 RSI: 0000000000000000 RDI: 0000000000015096 RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000003e R10: ffff81007fe64038 R11: 0000000000000060 R12: 0000000000000002 R13: ffff81007e460000 R14: ffff81005e7ce6e0 R15: 0000000000000058 FS: 0000000000000000(0000) GS:ffffffff8039e000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 000000000b7e3000 CR4: 00000000000006e0 Process swapper (pid: 0, threadinfo ffffffff803ce000, task ffffffff802e3ae0) Stack: 0000000000000060 ffff81007e462420 ffff81007e460000 ffff81007e460000 0000000000000002 ffff81007e4622f0 0000000000000001 0000000000000000 0000000000000001 ffff81007e460000 ffff81007e4600e0 0000000000000058 Call Trace: <IRQ> [<ffffffff880dbbda>] :libata:ata_interrupt+0x15a/0x1e2 [<ffffffff8004d401>] hrtimer_run_queues+0xd9/0x16d [<ffffffff8001088b>] handle_IRQ_event+0x29/0x58 [<ffffffff800b76eb>] __do_IRQ+0xa4/0x103 [<ffffffff8006b3e1>] do_IRQ+0xe7/0xf5 [<ffffffff80069d28>] default_idle+0x0/0x50 [<ffffffff8005c615>] ret_from_intr+0x0/0xa <EOI> [<ffffffff80069d51>] default_idle+0x29/0x50 [<ffffffff8004721d>] cpu_idle+0x95/0xb8 [<ffffffff803d9801>] start_kernel+0x220/0x225 [<ffffffff803d922f>] _sinittext+0x22f/0x236 Code: 48 8b 11 44 89 c7 41 03 7e 08 48 c1 ea 33 48 89 d0 48 c1 e8 RIP [<ffffffff880d7326>] :libata:ata_hsm_move+0x3b3/0x770 RSP <ffffffff80416e58> CR2: 0000000000000000 <0>Kernel panic - not syncing: Fatal exception If you have already a fix, could you please attach? I tried to reproduce the bug on two different machines without success. Could you please provide us the crash dump? Are you applying any patch at libata or at ide? I will have to install a clean system to make you a dump. You need to be using libata (without the fix mentioned above) to see this one. We force libata to be used via the piix.intel_via_libata=1 on the boot command line so that libata is used for the DVDROM instead of piix. So you need the try this on a system with a PIIX in it, or you will not be running libata for the DVDROM. I'll make you a dump, but probably cannot until next week or so since we are about to go into a test cycle on different code here. (Don't worry, I haven't forgotten!) Created attachment 321550 [details]
svn diff of fix to libata/ata-core.c
I attached the patch to libata-core.c as the svn diff of our fix to libata. It just replaces sg_next with the correct macro, ata_qc_next_sg, to avoid chasing the null pointer. I am installing a clean system now so I can make you a crash dump, which I will attach when I have it. I tried to attach the vmcore file fronm reproducing this bug after doing a clean install of RHEL5.2 2.6.18-92.el5, but eventually, after an hour or so, Firefox gave up transferring the very big vmcore file. We are going to have to come up with some other way to get you the crash dump file if you really need it. /proc/cmdline is ro root=/dev/md2 nmi_watchdog=0 console=tty0 console=ttyS0,115200 nosoftlockup piix.intel_via_libata=1 crashkernel=128M@16M I copied over RHEL5.2-Client-20080516.6-x86_64-disc4-ftp.iso, both to use as the test image to burn to the DVD and to install the k3b RPM from since using k3b is easier than typing the lengthy growisofs command manually, though you can reproduce this crash without installing k3B from this Client ISO if you don't mind all the typing. Then I installed k3b and fed the DVD drive a blank unformatted DVD+RW. In k3b, go to tools->burn DVD ISO image, select the ISO image to burn, and click on "start" (I also set "verify written iamge" but I doubt that matters since it doesn't get that far). Crash occurred shortly thereafter. No other extra RPMs or any non-RedHat packages are installed. To reproduce the bug, you need a system where the DVD burner IDE device is connected to a PIIX or ESB2 chip so that the piix.intel_via_libata=1 will cause the code path via libata (which works much better than piix, which is why we are using it, at least once we fixed this bug by overlaying libata.ko with one with this bug fixed). Intel-based Apple Macs have ESB2 chips, as do Stratus GeminiR (Rhapsody) and FusionH boxes (I did this on a FusionH). I think there is also a recent Dell server box that has that chip. I don't know what systems used the original PIIX. (Jim Paradis has a Stratus box in your lab in Westford so he can probably help you repro this.) Created attachment 331120 [details]
/var/log/messages
Follows the sanitized /var/log/messages. I think all info at dmesg is inside this log.
(In reply to comment #9) > Created an attachment (id=331120) [details] > /var/log/messages > > Follows the sanitized /var/log/messages. I think all info at dmesg is inside > this log. Please discard this comment. It were added at the wrong BZ#. Charlotte, I think I have reproduced the problem that you are seeing and captured a core dump. crash> sys KERNEL: /usr/lib/debug/lib/modules/2.6.18-92.1.17.el5/vmlinux DUMPFILE: /var/crash/172.16.17.131-2009-01-28-13:29:02/vmcore CPUS: 4 DATE: Wed Jan 28 13:28:46 2009 UPTIME: 5 days, 22:46:31 LOAD AVERAGE: 0.55, 0.21, 0.09 TASKS: 188 NODENAME: dhcp-122.hsv.redhat.com RELEASE: 2.6.18-92.1.17.el5 VERSION: #1 SMP Wed Oct 22 04:19:38 EDT 2008 MACHINE: x86_64 (2500 Mhz) MEMORY: 3.9 GB PANIC: "Oops: 0000 [1] SMP " (check log for details) crash> bt PID: 6707 TASK: ffff8101150d4860 CPU: 2 COMMAND: "Xorg" #0 [ffff8101043e7bb0] crash_kexec at ffffffff800aab3e #1 [ffff8101043e7c70] __die at ffffffff800650ff #2 [ffff8101043e7cb0] do_page_fault at ffffffff80066af1 #3 [ffff8101043e7da0] error_exit at ffffffff8005dde9 [exception RIP: ata_hsm_move+947] RIP: ffffffff8810a30e RSP: ffff8101043e7e58 RFLAGS: 00013046 RAX: ffff81012e5ce2f0 RBX: ffff81012e5cc0e0 RCX: 0000000000000000 RDX: ffff81012e5cc000 RSI: 0000000000000000 RDI: 000000000001cc36 RBP: 0000000000000000 R8: 0000000000000000 R9: 00000000000000ff R10: 00002b339e908904 R11: 000000001d426610 R12: 0000000000000002 R13: ffff81012e5cc000 R14: ffff8100b15950a0 R15: 0000000000000058 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 #4 [ffff8101043e7e50] ata_hsm_move at ffffffff8810a1e7 #5 [ffff8101043e7ed0] ata_interrupt at ffffffff8810eb62 #6 [ffff8101043e7f20] handle_IRQ_event at ffffffff800109a8 #7 [ffff8101043e7f50] __do_IRQ at ffffffff800b73ae #8 [ffff8101043e7f90] do_IRQ at ffffffff8006c575 --- <IRQ stack> --- #9 [ffff81012e857f58] ret_from_intr at ffffffff8005d615 RIP: 00002b339e15396b RSP: 00007fff0f039f90 RFLAGS: 00003206 RAX: 00000000000000ff RBX: 0080008000800080 RCX: 00000000000000d4 RDX: 0000000000000d00 RSI: 00002b339e908958 RDI: 000000001d426625 RBP: 0080008000800080 R8: 0000000000000032 R9: 00000000000000ff R10: 00002b339e908904 R11: 000000001d426610 R12: 00ff00ff00ff00ff R13: 0000000000000007 R14: 00000000000000ff R15: 0080008000800080 ORIG_RAX: ffffffffffffff9d CS: 0033 SS: 002b crash> dis -l ata_hsm_move . . /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/drivers/ata/libata-core.c: 51950xffffffff8810a24a <ata_hsm_move+751>: mov 0x78(%rbx),%r14 The above corresponds to line 5195 in libata-core.c sg = qc->cursg; Dump out content of R14 crash> rd ffff8100b15950a0 ffff8100b15950a0: 0000000000000000 ........ include/linux/libata-compat.h: 37 0xffffffff8810a302 <ata_hsm_move+935>: mov (%r14),%rcx The above moves what R14 points to into RCX, now RCX contains zero. /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/drivers/ata/libata-core.c: 52320xffffffff8810a30e <ata_hsm_move+947>: mov (%rcx),%rdx And finally, deferences the NULL pointer trying to move what RCX points to into RDX, system crashes. I agree with your analysis and patch, I have not been able to reproduce the problem since applying it to -92.1.17.el5. I will need to apply your patch to the latest RHEL5 sources and make sure the problem is not reproducible. I should have a test kernel available for you soon. (Mauro, if you don't mind I will take over this bug since it looks very similiar to BZ 467308). Thank you, David Created attachment 331709 [details]
Update patch against -131.el5
Update patch against current RHEL5 sources.
I was able to reproduce the problem running -131.el5, after applying your patch Comment#12, I was no longer able to reproduce the problem. RIP: 0010:[<ffffffff8811640c>] [<ffffffff8811640c>] :libata:ata_sff_hsm_move+0\ x335/0x6de RSP: 0000:ffff8101043e7e68 EFLAGS: 00013046 RAX: ffff81012e6a63f0 RBX: ffff81012e6a40e0 RCX: 6f727020676e6974 RDX: ffff81012e6a6528 RSI: 0000000000000000 RDI: 000000000001cc36 RBP: 0000000000000000 R08: 0000000000000000 R09: ffff81012e6a40e0 R10: 00002b37bd46d794 R11: 000000001a4fb864 R12: 0000000000000002 R13: ffff81012e6a4000 R14: ffff8100c86e77e0 R15: 0000000000000058 FS: 00002b37ba5f2ad0(0000) GS:ffff81010439eec0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000acc46c8 CR3: 000000012c828000 CR4: 00000000000006e0 Process Xorg (pid: 8461, threadinfo ffff81012d18c000, task ffff81010a9c37e0) Stack: ffffffff88078eee 00000000cd41ab08 ffff81012e6a4000 ffff81012e6a6528 ffff81012e6a4000 000000002d86cb00 ffff81012e6a63f0 ffff81012e6a4000 ffff81012e6a40e0 0000000000000058 0000000000000000 ffff81012fc155d8 Call Trace: <IRQ> [<ffffffff88078eee>] :scsi_mod:scsi_next_command+0x2d/0x39 [<ffffffff8811760d>] :libata:ata_sff_interrupt+0x14c/0x1c7 [<ffffffff80010a0d>] handle_IRQ_event+0x51/0xa6 [<ffffffff800b7b09>] __do_IRQ+0xa4/0x103 [<ffffffff80011f8b>] __do_softirq+0x89/0x133 [<ffffffff8006c95d>] do_IRQ+0xe7/0xf5 [<ffffffff8005d615>] ret_from_intr+0x0/0xa <EOI> Please verify the problem is no longer reproducible running the kernel-2.6.18-131.el5.bz446086.1 test kernel. http://people.redhat.com/dmilburn/ Sorry for the delay - I had to scrounge for a system with an ESB2 in it as I am not working on that platform anymore these days. It works fine now with your fixed kernel. I checked and I see that the patch is in linux-kernel-test.patch. The offending code had moved from libata-core.c to libata-sff.c, routines ata_pio_sector and __atapi_pio_bytes. Thanks! *** Bug 468624 has been marked as a duplicate of this bug. *** This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. in kernel-2.6.18-134.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. ~~ Attention - RHEL 5.4 Beta Released! ~~ RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner! If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value. Questions can be posted to this bug or your customer or partner representative. Hi All, I am on 5.3 for a while. I have since upgraded to AHCI. My /etc/modprobe: old: alias scsi_hostadapter2 ata_piix new: alias scsi_hostadapter2 ahci Am I safe from this bug till I can finally upgrade to 5.4? Many thanks, -T Todd, The original problem was seen using ata_piix, the fix was actually to libata core code so its possible that you could have a case where you run into this crash, though you probably would have hit it by now since this fix was committed to -134.el5. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html (In reply to comment #23) > Todd, > > The original problem was seen using ata_piix, the fix was actually to > libata core code so its possible that you could have a case where you run > into this crash, though you probably would have hit it by now since this > fix was committed to -134.el5. Please excuse my paranoia on this issue. The last time I tried to cut a data DVD, this bug wiped out my hard drive (twice). If I had not also been paranoid about backup, I would have lost my business. To put it bluntly, I am scared. $ uname -r 2.6.18-164.9.1.el5 Does this mean I am now safe to cut a data DVD with K3B? Many thanks, -T Todd, It should be ok, but I will try and retest on ahci/k3b. Please note this BZ has been verified on the reported configuration and closed out, we will need to open a new BZ for any new problems encountered. Thanks. |