Red Hat Bugzilla – Bug 446086
crash formatting a DVD under libata
Last modified: 2010-01-08 09:38:38 EST
Description of problem:
We boot with piix.intel_via_libata=1 in order to use libata and ata-piix in
place of piix to supprt our IDE devices on the ESB2 instead of the older IDE
code. Using libata, if you attempt to burn a blank DVD using k3b or directly
using growisofs, you get a crash. (CDs work fine.)
Version-Release number of selected component (if applicable):
All 5.1 and 5.2 kernels.
NOTE: If you escape this problem because the DVD is already formatted, you will
hit the problem in bug 220481, probably.
Steps to Reproduce:
1. Insert a completely-blank, unformatted DVD into the drive.
2. Either go to tools->burn DVD ISO image in k3b, or run growisofs directly:
/usr/bin/growisofs -Z /dev/scd0=<ISO image file> -use-the-force-luke=notray
-user-the-force-luke=tty -dvd-compat -speed=4 -use-the-force-luke=bufsize=32m
3. Crashes (null pointer dereference in __atapi_pio_bytes).
Hit bug 220481 unless you set the buffer size to 4M instead of 32M, probably,
but otherwise should burn DVD correctly.
What's going on here is that the ATAPI command (register file) is DMAed to the
IDE device, but the data to be written is PIOed. The queued command (qc)
contains a pointer to the scatterlist containing the data to be written. sg_next
just increments the pointer to the scatterlist. However, the scatterlist in the
qc has an additional piece if the transfer needed to be padded, which is stored
separately. ata_qc_next_sg knows how to iterate including this piece, whereas
next_sg will fall off the end of the list. The two places next_sg is used in
libata-core.c need to be replaced by ata_qc_next_sg.
Could you please provide more detailed info, including:
kernel version that you're using;
the OOPS dump;
vmcore, if it generated one:
RHEL5.2, so it is 2.6.18-92.el5. I had installed the k3b from the Client images since it is no longer in the Server images. I did not save the dump, only the console stack trace, though it ought to be easy enough to reproduce if I remove the fix to __atapi_pio_bytes in libata-core.c (we occlude libata.ko for our customers to avoid crashes); let me know if you need it. Here's the stack trace from the console:
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
PGD b062067 PUD b45e067 PMD 0
Oops: 0000  SMP
last sysfs file: /class/net/bond0/flags
Modules linked in: ppp_deflate zlib_deflate ppp_async crc_ccitt ppp_generic
slhc ipv6 xfrm_nalgo crypto_api autofs4 hidp rfcomm l2cap bluetooth sunrpc
bonding iscsi_tcp ib_iser libiscsi scsi_transport_iscsi rdma_ucm ib_ucm ib_srp
ib_sdp rdma_cm ib_cm iw_cm ib_addr ib_ipoib ib_sa ib_uverbs ib_umad ib_mad
ib_core dm_mirror(U) dm_multipath(U) dm_mod(U) video sbs backlight i2c_ec
i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport
ipmi_devintf ftmod(U) ipmi_msghandler st joydev vtm(FU) sr_mod cdrom(U) sg(U)
i5000_edac edac_mc pcspkr radeonfb(FU) fosil(U) e1000(U) ata_piix(U) aic79xx(U)
scsi_transport_spi aic94xx(U) libsas(U) libata(U) scsi_transport_sas(U)
sd_mod(U) scsi_mod(U) raid1(U) ext3jbd ehci_hcd ohci_hcd uhci_hcd(U)
Pid: 0, comm: swapper Tainted: GF 2.6.18-78.el5 #1
RIP: 0010:[<ffffffff880d7326>] [<ffffffff880d7326>]
RSP: 0018:ffffffff80416e58 EFLAGS: 00010046
RAX: ffff81007e4622f0 RBX: ffff81007e4600e0 RCX: 0000000000000000
RDX: ffff81007e460000 RSI: 0000000000000000 RDI: 0000000000015096
RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000003e
R10: ffff81007fe64038 R11: 0000000000000060 R12: 0000000000000002
R13: ffff81007e460000 R14: ffff81005e7ce6e0 R15: 0000000000000058
FS: 0000000000000000(0000) GS:ffffffff8039e000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000000b7e3000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff803ce000, task ffffffff802e3ae0)
Stack: 0000000000000060 ffff81007e462420 ffff81007e460000 ffff81007e460000
0000000000000002 ffff81007e4622f0 0000000000000001 0000000000000000
0000000000000001 ffff81007e460000 ffff81007e4600e0 0000000000000058
<IRQ> [<ffffffff880dbbda>] :libata:ata_interrupt+0x15a/0x1e2
<EOI> [<ffffffff80069d51>] default_idle+0x29/0x50
Code: 48 8b 11 44 89 c7 41 03 7e 08 48 c1 ea 33 48 89 d0 48 c1 e8
RIP [<ffffffff880d7326>] :libata:ata_hsm_move+0x3b3/0x770
<0>Kernel panic - not syncing: Fatal exception
If you have already a fix, could you please attach?
I tried to reproduce the bug on two different machines without success. Could you please provide us the crash dump?
Are you applying any patch at libata or at ide?
I will have to install a clean system to make you a dump. You need to be using libata (without the fix mentioned above) to see this one. We force libata to be used via the piix.intel_via_libata=1 on the boot command line so that libata is used for the DVDROM instead of piix. So you need the try this on a system with a PIIX in it, or you will not be running libata for the DVDROM. I'll make you a dump, but probably cannot until next week or so since we are about to go into a test cycle on different code here. (Don't worry, I haven't forgotten!)
Created attachment 321550 [details]
svn diff of fix to libata/ata-core.c
I attached the patch to libata-core.c as the svn diff of our fix to libata.
It just replaces sg_next with the correct macro, ata_qc_next_sg, to avoid chasing the null pointer.
I am installing a clean system now so I can make you a crash dump, which I will attach when I have it.
I tried to attach the vmcore file fronm reproducing this bug after doing a clean install of RHEL5.2 2.6.18-92.el5, but eventually, after an hour or so, Firefox gave up transferring the very big vmcore file. We are going to have to come up with some other way to get you the crash dump file if you really need it.
ro root=/dev/md2 nmi_watchdog=0 console=tty0 console=ttyS0,115200 nosoftlockup piix.intel_via_libata=1 crashkernel=128M@16M
I copied over RHEL5.2-Client-20080516.6-x86_64-disc4-ftp.iso, both to use as the test image to burn to the DVD and to install the k3b RPM from since using k3b is easier than typing the lengthy growisofs command manually, though you can reproduce this crash without installing k3B from this Client ISO if you don't mind all the typing. Then I installed k3b and fed the DVD drive a blank unformatted DVD+RW. In k3b, go to tools->burn DVD ISO image, select the ISO image to burn, and click on "start" (I also set "verify written iamge" but I doubt that matters since it doesn't get that far). Crash occurred shortly thereafter. No other extra RPMs or any non-RedHat packages are installed.
To reproduce the bug, you need a system where the DVD burner IDE device is connected to a PIIX or ESB2 chip so that the piix.intel_via_libata=1 will cause the code path via libata (which works much better than piix, which is why we are using it, at least once we fixed this bug by overlaying libata.ko with one with this bug fixed). Intel-based Apple Macs have ESB2 chips, as do Stratus GeminiR (Rhapsody) and FusionH boxes (I did this on a FusionH). I think there is also a recent Dell server box that has that chip. I don't know what systems used the original PIIX. (Jim Paradis has a Stratus box in your lab in Westford so he can probably help you repro this.)
Created attachment 331120 [details]
Follows the sanitized /var/log/messages. I think all info at dmesg is inside this log.
(In reply to comment #9)
> Created an attachment (id=331120) [details]
> Follows the sanitized /var/log/messages. I think all info at dmesg is inside
> this log.
Please discard this comment. It were added at the wrong BZ#.
I think I have reproduced the problem that you are seeing and captured a
DATE: Wed Jan 28 13:28:46 2009
UPTIME: 5 days, 22:46:31
LOAD AVERAGE: 0.55, 0.21, 0.09
VERSION: #1 SMP Wed Oct 22 04:19:38 EDT 2008
MACHINE: x86_64 (2500 Mhz)
MEMORY: 3.9 GB
PANIC: "Oops: 0000  SMP " (check log for details)
PID: 6707 TASK: ffff8101150d4860 CPU: 2 COMMAND: "Xorg"
#0 [ffff8101043e7bb0] crash_kexec at ffffffff800aab3e
#1 [ffff8101043e7c70] __die at ffffffff800650ff
#2 [ffff8101043e7cb0] do_page_fault at ffffffff80066af1
#3 [ffff8101043e7da0] error_exit at ffffffff8005dde9
[exception RIP: ata_hsm_move+947]
RIP: ffffffff8810a30e RSP: ffff8101043e7e58 RFLAGS: 00013046
RAX: ffff81012e5ce2f0 RBX: ffff81012e5cc0e0 RCX: 0000000000000000
RDX: ffff81012e5cc000 RSI: 0000000000000000 RDI: 000000000001cc36
RBP: 0000000000000000 R8: 0000000000000000 R9: 00000000000000ff
R10: 00002b339e908904 R11: 000000001d426610 R12: 0000000000000002
R13: ffff81012e5cc000 R14: ffff8100b15950a0 R15: 0000000000000058
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
#4 [ffff8101043e7e50] ata_hsm_move at ffffffff8810a1e7
#5 [ffff8101043e7ed0] ata_interrupt at ffffffff8810eb62
#6 [ffff8101043e7f20] handle_IRQ_event at ffffffff800109a8
#7 [ffff8101043e7f50] __do_IRQ at ffffffff800b73ae
#8 [ffff8101043e7f90] do_IRQ at ffffffff8006c575
--- <IRQ stack> ---
#9 [ffff81012e857f58] ret_from_intr at ffffffff8005d615
RIP: 00002b339e15396b RSP: 00007fff0f039f90 RFLAGS: 00003206
RAX: 00000000000000ff RBX: 0080008000800080 RCX: 00000000000000d4
RDX: 0000000000000d00 RSI: 00002b339e908958 RDI: 000000001d426625
RBP: 0080008000800080 R8: 0000000000000032 R9: 00000000000000ff
R10: 00002b339e908904 R11: 000000001d426610 R12: 00ff00ff00ff00ff
R13: 0000000000000007 R14: 00000000000000ff R15: 0080008000800080
ORIG_RAX: ffffffffffffff9d CS: 0033 SS: 002b
crash> dis -l ata_hsm_move
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/drivers/ata/libata-core.c: 51950xffffffff8810a24a <ata_hsm_move+751>: mov 0x78(%rbx),%r14
The above corresponds to line 5195 in libata-core.c
sg = qc->cursg;
Dump out content of R14
crash> rd ffff8100b15950a0
ffff8100b15950a0: 0000000000000000 ........
0xffffffff8810a302 <ata_hsm_move+935>: mov (%r14),%rcx
The above moves what R14 points to into RCX, now RCX contains zero.
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/drivers/ata/libata-core.c: 52320xffffffff8810a30e <ata_hsm_move+947>: mov (%rcx),%rdx
And finally, deferences the NULL pointer trying to move what RCX points
to into RDX, system crashes.
I agree with your analysis and patch, I have not been able to reproduce
the problem since applying it to -92.1.17.el5. I will need to apply your
patch to the latest RHEL5 sources and make sure the problem is not reproducible.
I should have a test kernel available for you soon.
(Mauro, if you don't mind I will take over this bug since it looks very
similiar to BZ 467308).
Created attachment 331709 [details]
Update patch against -131.el5
Update patch against current RHEL5 sources.
I was able to reproduce the problem running -131.el5, after applying your
patch Comment#12, I was no longer able to reproduce the problem.
RIP: 0010:[<ffffffff8811640c>] [<ffffffff8811640c>] :libata:ata_sff_hsm_move+0\
RSP: 0000:ffff8101043e7e68 EFLAGS: 00013046
RAX: ffff81012e6a63f0 RBX: ffff81012e6a40e0 RCX: 6f727020676e6974
RDX: ffff81012e6a6528 RSI: 0000000000000000 RDI: 000000000001cc36
RBP: 0000000000000000 R08: 0000000000000000 R09: ffff81012e6a40e0
R10: 00002b37bd46d794 R11: 000000001a4fb864 R12: 0000000000000002
R13: ffff81012e6a4000 R14: ffff8100c86e77e0 R15: 0000000000000058
FS: 00002b37ba5f2ad0(0000) GS:ffff81010439eec0(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000acc46c8 CR3: 000000012c828000 CR4: 00000000000006e0
Process Xorg (pid: 8461, threadinfo ffff81012d18c000, task ffff81010a9c37e0)
Stack: ffffffff88078eee 00000000cd41ab08 ffff81012e6a4000 ffff81012e6a6528
ffff81012e6a4000 000000002d86cb00 ffff81012e6a63f0 ffff81012e6a4000
ffff81012e6a40e0 0000000000000058 0000000000000000 ffff81012fc155d8
<IRQ> [<ffffffff88078eee>] :scsi_mod:scsi_next_command+0x2d/0x39
Please verify the problem is no longer reproducible running the kernel-2.6.18-131.el5.bz446086.1 test kernel.
Sorry for the delay - I had to scrounge for a system with an ESB2 in it as I am not working on that platform anymore these days. It works fine now with your fixed kernel. I checked and I see that the patch is in linux-kernel-test.patch. The offending code had moved from libata-core.c to libata-sff.c, routines ata_pio_sector and __atapi_pio_bytes. Thanks!
*** Bug 468624 has been marked as a duplicate of this bug. ***
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
You can download this test kernel from http://people.redhat.com/dzickus/el5
Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so. However feel free
to provide a comment indicating that this fix has been verified.
~~ Attention - RHEL 5.4 Beta Released! ~~
RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!
If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.
Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.
Questions can be posted to this bug or your customer or partner representative.
I am on 5.3 for a while. I have since upgraded to AHCI. My /etc/modprobe:
old: alias scsi_hostadapter2 ata_piix
new: alias scsi_hostadapter2 ahci
Am I safe from this bug till I can finally upgrade to 5.4?
The original problem was seen using ata_piix, the fix was actually to
libata core code so its possible that you could have a case where you run
into this crash, though you probably would have hit it by now since this
fix was committed to -134.el5.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
(In reply to comment #23)
> The original problem was seen using ata_piix, the fix was actually to
> libata core code so its possible that you could have a case where you run
> into this crash, though you probably would have hit it by now since this
> fix was committed to -134.el5.
Please excuse my paranoia on this issue. The last time I tried to cut a data DVD, this bug wiped out my hard drive (twice). If I had not also been paranoid about backup, I would have lost my business. To put it bluntly, I am scared.
$ uname -r
Does this mean I am now safe to cut a data DVD with K3B?
It should be ok, but I will try and retest on ahci/k3b. Please note this
BZ has been verified on the reported configuration and closed out, we will
need to open a new BZ for any new problems encountered. Thanks.