Bug 446086

Summary: crash formatting a DVD under libata
Product: Red Hat Enterprise Linux 5 Reporter: Charlotte Richardson <charlotte.richardson>
Component: kernelAssignee: David Milburn <dmilburn>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 5.2CC: cward, dmilburn, jgarzik, lwang, mchehab, pasteur, peterm, ToddAndMargo
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 08:56:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
svn diff of fix to libata/ata-core.c
none
/var/log/messages
none
Update patch against -131.el5 none

Description Charlotte Richardson 2008-05-12 17:17:02 UTC
Description of problem:
We boot with piix.intel_via_libata=1 in order to use libata and ata-piix in
place of piix to supprt our IDE devices on the ESB2 instead of the older IDE
code. Using libata, if you attempt to burn a blank DVD using k3b or directly
using growisofs, you get a crash. (CDs work fine.)


Version-Release number of selected component (if applicable):
All 5.1 and 5.2 kernels.


How reproducible:
Always. 

NOTE: If you escape this problem because the DVD is already formatted, you will
hit the problem in bug 220481, probably.


Steps to Reproduce:
1. Insert a completely-blank, unformatted DVD into the drive.
2. Either go to tools->burn DVD ISO image in k3b, or run growisofs directly:

/usr/bin/growisofs -Z /dev/scd0=<ISO image file> -use-the-force-luke=notray
-user-the-force-luke=tty -dvd-compat -speed=4 -use-the-force-luke=bufsize=32m

3. Crashes (null pointer dereference in __atapi_pio_bytes).
  
Actual results:
Boom!


Expected results:
Hit bug 220481 unless you set the buffer size to 4M instead of 32M, probably,
but otherwise should burn DVD correctly.


Additional info:
What's going on here is that the ATAPI command (register file) is DMAed to the
IDE device, but the data to be written is PIOed. The queued command (qc)
contains a pointer to the scatterlist containing the data to be written. sg_next
just increments the pointer to the scatterlist. However, the scatterlist in the
qc has an additional piece if the transfer needed to be padded, which is stored
separately. ata_qc_next_sg knows how to iterate including this piece, whereas
next_sg will fall off the end of the list. The two places next_sg is used in
libata-core.c need to be replaced by ata_qc_next_sg.

Comment 1 Mauro Carvalho Chehab 2008-08-07 18:27:47 UTC
Could you please provide more detailed info, including:

kernel version that you're using;
the OOPS dump;
vmcore, if it generated one:
sosreport

Comment 2 Charlotte Richardson 2008-08-07 18:54:00 UTC
RHEL5.2, so it is 2.6.18-92.el5. I had installed the k3b from the Client images since it is no longer in the Server images. I did not save the dump, only the console stack trace, though it ought to be easy enough to reproduce if I remove the fix to __atapi_pio_bytes in libata-core.c (we occlude libata.ko for our customers to avoid crashes); let me know if you need it. Here's the stack trace from the console:

Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
 [<ffffffff880d7326>] :libata:ata_hsm_move+0x3b3/0x770
PGD b062067 PUD b45e067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /class/net/bond0/flags
CPU 0
Modules linked in: ppp_deflate zlib_deflate ppp_async crc_ccitt ppp_generic
slhc ipv6 xfrm_nalgo crypto_api autofs4 hidp rfcomm l2cap bluetooth sunrpc
bonding iscsi_tcp ib_iser libiscsi scsi_transport_iscsi rdma_ucm ib_ucm ib_srp
ib_sdp rdma_cm ib_cm iw_cm ib_addr ib_ipoib ib_sa ib_uverbs ib_umad ib_mad
ib_core dm_mirror(U) dm_multipath(U) dm_mod(U) video sbs backlight i2c_ec
i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport
ipmi_devintf ftmod(U) ipmi_msghandler st joydev vtm(FU) sr_mod cdrom(U) sg(U)
i5000_edac edac_mc pcspkr radeonfb(FU) fosil(U) e1000(U) ata_piix(U) aic79xx(U)
scsi_transport_spi aic94xx(U) libsas(U) libata(U) scsi_transport_sas(U)
sd_mod(U) scsi_mod(U) raid1(U) ext3jbd ehci_hcd ohci_hcd uhci_hcd(U)
Pid: 0, comm: swapper Tainted: GF     2.6.18-78.el5 #1
RIP: 0010:[<ffffffff880d7326>]  [<ffffffff880d7326>]
:libata:ata_hsm_move+0x3b3/0x770
RSP: 0018:ffffffff80416e58  EFLAGS: 00010046
RAX: ffff81007e4622f0 RBX: ffff81007e4600e0 RCX: 0000000000000000
RDX: ffff81007e460000 RSI: 0000000000000000 RDI: 0000000000015096
RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000003e
R10: ffff81007fe64038 R11: 0000000000000060 R12: 0000000000000002
R13: ffff81007e460000 R14: ffff81005e7ce6e0 R15: 0000000000000058
FS:  0000000000000000(0000) GS:ffffffff8039e000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000000b7e3000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff803ce000, task ffffffff802e3ae0)
Stack:  0000000000000060 ffff81007e462420 ffff81007e460000 ffff81007e460000
 0000000000000002 ffff81007e4622f0 0000000000000001 0000000000000000
 0000000000000001 ffff81007e460000 ffff81007e4600e0 0000000000000058
Call Trace:
 <IRQ>  [<ffffffff880dbbda>] :libata:ata_interrupt+0x15a/0x1e2
 [<ffffffff8004d401>] hrtimer_run_queues+0xd9/0x16d
 [<ffffffff8001088b>] handle_IRQ_event+0x29/0x58
 [<ffffffff800b76eb>] __do_IRQ+0xa4/0x103
 [<ffffffff8006b3e1>] do_IRQ+0xe7/0xf5
 [<ffffffff80069d28>] default_idle+0x0/0x50
 [<ffffffff8005c615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff80069d51>] default_idle+0x29/0x50
 [<ffffffff8004721d>] cpu_idle+0x95/0xb8
 [<ffffffff803d9801>] start_kernel+0x220/0x225
 [<ffffffff803d922f>] _sinittext+0x22f/0x236

Code: 48 8b 11 44 89 c7 41 03 7e 08 48 c1 ea 33 48 89 d0 48 c1 e8
RIP  [<ffffffff880d7326>] :libata:ata_hsm_move+0x3b3/0x770
RSP <ffffffff80416e58>
CR2: 0000000000000000
 <0>Kernel panic - not syncing: Fatal exception

Comment 3 Mauro Carvalho Chehab 2008-08-08 13:53:11 UTC
If you have already a fix, could you please attach?

Comment 4 Mauro Carvalho Chehab 2008-10-20 18:26:14 UTC
I tried to reproduce the bug on two different machines without success. Could you please provide us the crash dump?

Are you applying any patch at libata or at ide?

Comment 5 Charlotte Richardson 2008-10-22 22:02:29 UTC
I will have to install a clean system to make you a dump. You need to be using libata (without the fix mentioned above) to see this one. We force libata to be used via the piix.intel_via_libata=1 on the boot command line so that libata is used for the DVDROM instead of piix. So you need the try this on a system with a PIIX in it, or you will not be running libata for the DVDROM. I'll make you a dump, but probably cannot until next week or so since we are about to go into a test cycle on different code here. (Don't worry, I haven't forgotten!)

Comment 6 Charlotte Richardson 2008-10-26 16:08:59 UTC
Created attachment 321550 [details]
svn diff of fix to libata/ata-core.c

Comment 7 Charlotte Richardson 2008-10-26 16:12:04 UTC
I attached the patch to libata-core.c as the svn diff of our fix to libata.
It just replaces sg_next with the correct macro, ata_qc_next_sg, to avoid chasing the null pointer.

I am installing a clean system now so I can make you a crash dump, which I will attach when I have it.

Comment 8 Charlotte Richardson 2008-10-26 18:28:44 UTC
I tried to attach the vmcore file fronm reproducing this bug after doing a clean install of RHEL5.2 2.6.18-92.el5, but eventually, after an hour or so, Firefox gave up transferring the very big vmcore file. We are going to have to come up with some other way to get you the crash dump file if you really need it.

/proc/cmdline is

ro root=/dev/md2 nmi_watchdog=0 console=tty0 console=ttyS0,115200 nosoftlockup piix.intel_via_libata=1 crashkernel=128M@16M

I copied over RHEL5.2-Client-20080516.6-x86_64-disc4-ftp.iso, both to use as the test image to burn to the DVD and to install the k3b RPM from since using k3b is easier than typing the lengthy growisofs command manually, though you can reproduce this crash without installing k3B from this Client ISO if you don't mind all the typing. Then I installed k3b and fed the DVD drive a blank unformatted DVD+RW. In k3b, go to tools->burn DVD ISO image, select the ISO image to burn, and click on "start" (I also set "verify written iamge" but I doubt that matters since it doesn't get that far). Crash occurred shortly thereafter. No other extra RPMs or any non-RedHat packages are installed.

To reproduce the bug, you need a system where the DVD burner IDE device is connected to a PIIX or ESB2 chip so that the piix.intel_via_libata=1 will cause the code path via libata (which works much better than piix, which is why we are using it, at least once we fixed this bug by overlaying libata.ko with one with this bug fixed). Intel-based Apple Macs have ESB2 chips, as do Stratus GeminiR (Rhapsody) and FusionH boxes (I did this on a FusionH). I think there is also a recent Dell server box that has that chip. I don't know what systems used the original PIIX. (Jim Paradis has a Stratus box in your lab in Westford so he can probably help you repro this.)

Comment 9 Mauro Carvalho Chehab 2009-02-06 12:56:16 UTC
Created attachment 331120 [details]
/var/log/messages

Follows the sanitized /var/log/messages. I think all info at dmesg is inside this log.

Comment 10 Mauro Carvalho Chehab 2009-02-06 12:57:37 UTC
(In reply to comment #9)
> Created an attachment (id=331120) [details]
> /var/log/messages
> 
> Follows the sanitized /var/log/messages. I think all info at dmesg is inside
> this log.

Please discard this comment. It were added at the wrong BZ#.

Comment 11 David Milburn 2009-02-11 18:06:32 UTC
Charlotte, 

I think I have reproduced the problem that you are seeing and captured a 
core dump.

crash> sys
      KERNEL: /usr/lib/debug/lib/modules/2.6.18-92.1.17.el5/vmlinux
    DUMPFILE: /var/crash/172.16.17.131-2009-01-28-13:29:02/vmcore
        CPUS: 4
        DATE: Wed Jan 28 13:28:46 2009
      UPTIME: 5 days, 22:46:31
LOAD AVERAGE: 0.55, 0.21, 0.09
       TASKS: 188
    NODENAME: dhcp-122.hsv.redhat.com
     RELEASE: 2.6.18-92.1.17.el5
     VERSION: #1 SMP Wed Oct 22 04:19:38 EDT 2008
     MACHINE: x86_64  (2500 Mhz)
      MEMORY: 3.9 GB
       PANIC: "Oops: 0000 [1] SMP " (check log for details)

crash> bt
PID: 6707   TASK: ffff8101150d4860  CPU: 2   COMMAND: "Xorg"
 #0 [ffff8101043e7bb0] crash_kexec at ffffffff800aab3e
 #1 [ffff8101043e7c70] __die at ffffffff800650ff
 #2 [ffff8101043e7cb0] do_page_fault at ffffffff80066af1
 #3 [ffff8101043e7da0] error_exit at ffffffff8005dde9
    [exception RIP: ata_hsm_move+947]
    RIP: ffffffff8810a30e  RSP: ffff8101043e7e58  RFLAGS: 00013046
    RAX: ffff81012e5ce2f0  RBX: ffff81012e5cc0e0  RCX: 0000000000000000
    RDX: ffff81012e5cc000  RSI: 0000000000000000  RDI: 000000000001cc36
    RBP: 0000000000000000   R8: 0000000000000000   R9: 00000000000000ff
    R10: 00002b339e908904  R11: 000000001d426610  R12: 0000000000000002
    R13: ffff81012e5cc000  R14: ffff8100b15950a0  R15: 0000000000000058
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
 #4 [ffff8101043e7e50] ata_hsm_move at ffffffff8810a1e7
 #5 [ffff8101043e7ed0] ata_interrupt at ffffffff8810eb62
 #6 [ffff8101043e7f20] handle_IRQ_event at ffffffff800109a8
 #7 [ffff8101043e7f50] __do_IRQ at ffffffff800b73ae
 #8 [ffff8101043e7f90] do_IRQ at ffffffff8006c575
--- <IRQ stack> ---
 #9 [ffff81012e857f58] ret_from_intr at ffffffff8005d615
    RIP: 00002b339e15396b  RSP: 00007fff0f039f90  RFLAGS: 00003206
    RAX: 00000000000000ff  RBX: 0080008000800080  RCX: 00000000000000d4
    RDX: 0000000000000d00  RSI: 00002b339e908958  RDI: 000000001d426625
    RBP: 0080008000800080   R8: 0000000000000032   R9: 00000000000000ff
    R10: 00002b339e908904  R11: 000000001d426610  R12: 00ff00ff00ff00ff
    R13: 0000000000000007  R14: 00000000000000ff  R15: 0080008000800080
    ORIG_RAX: ffffffffffffff9d  CS: 0033  SS: 002b

crash> dis -l ata_hsm_move
 .
 .
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/drivers/ata/libata-core.c: 51950xffffffff8810a24a <ata_hsm_move+751>:  mov    0x78(%rbx),%r14

The above corresponds to line 5195 in libata-core.c
        sg = qc->cursg;

Dump out content of R14

crash> rd ffff8100b15950a0
ffff8100b15950a0:  0000000000000000                    ........

include/linux/libata-compat.h: 37
0xffffffff8810a302 <ata_hsm_move+935>:  mov    (%r14),%rcx

The above moves what R14 points to into RCX, now RCX contains zero.

/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/drivers/ata/libata-core.c: 52320xffffffff8810a30e <ata_hsm_move+947>:  mov    (%rcx),%rdx

And finally, deferences the NULL pointer trying to move what RCX points
to into RDX, system crashes.

I agree with your analysis and patch, I have not been able to reproduce
the problem since applying it to -92.1.17.el5. I will need to apply your
patch to the latest RHEL5 sources and make sure the problem is not reproducible.
I should have a test kernel available for you soon.

(Mauro, if you don't mind I will take over this bug since it looks very 
similiar to BZ 467308).

Thank you,
David

Comment 12 David Milburn 2009-02-12 16:30:51 UTC
Created attachment 331709 [details]
Update patch against -131.el5

Update patch against current RHEL5 sources.

Comment 13 David Milburn 2009-02-12 16:35:30 UTC
I was able to reproduce the problem running -131.el5, after applying your
patch Comment#12, I was no longer able to reproduce the problem.

RIP: 0010:[<ffffffff8811640c>]  [<ffffffff8811640c>] :libata:ata_sff_hsm_move+0\
x335/0x6de
RSP: 0000:ffff8101043e7e68  EFLAGS: 00013046
RAX: ffff81012e6a63f0 RBX: ffff81012e6a40e0 RCX: 6f727020676e6974
RDX: ffff81012e6a6528 RSI: 0000000000000000 RDI: 000000000001cc36
RBP: 0000000000000000 R08: 0000000000000000 R09: ffff81012e6a40e0
R10: 00002b37bd46d794 R11: 000000001a4fb864 R12: 0000000000000002
R13: ffff81012e6a4000 R14: ffff8100c86e77e0 R15: 0000000000000058
FS:  00002b37ba5f2ad0(0000) GS:ffff81010439eec0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000acc46c8 CR3: 000000012c828000 CR4: 00000000000006e0
Process Xorg (pid: 8461, threadinfo ffff81012d18c000, task ffff81010a9c37e0)
Stack:  ffffffff88078eee 00000000cd41ab08 ffff81012e6a4000 ffff81012e6a6528
 ffff81012e6a4000 000000002d86cb00 ffff81012e6a63f0 ffff81012e6a4000
 ffff81012e6a40e0 0000000000000058 0000000000000000 ffff81012fc155d8
Call Trace:
 <IRQ>  [<ffffffff88078eee>] :scsi_mod:scsi_next_command+0x2d/0x39
 [<ffffffff8811760d>] :libata:ata_sff_interrupt+0x14c/0x1c7
 [<ffffffff80010a0d>] handle_IRQ_event+0x51/0xa6
 [<ffffffff800b7b09>] __do_IRQ+0xa4/0x103
 [<ffffffff80011f8b>] __do_softirq+0x89/0x133
 [<ffffffff8006c95d>] do_IRQ+0xe7/0xf5
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 <EOI>

Comment 14 David Milburn 2009-02-12 16:37:07 UTC
Please verify the problem is no longer reproducible running the kernel-2.6.18-131.el5.bz446086.1 test kernel.

http://people.redhat.com/dmilburn/

Comment 15 Charlotte Richardson 2009-02-18 18:56:18 UTC
Sorry for the delay - I had to scrounge for a system with an ESB2 in it as I am not working on that platform anymore these days. It works fine now with your fixed kernel. I checked and I see that the patch is in linux-kernel-test.patch. The offending code had moved from libata-core.c to libata-sff.c, routines ata_pio_sector and __atapi_pio_bytes. Thanks!

Comment 16 David Milburn 2009-02-19 15:09:39 UTC
*** Bug 468624 has been marked as a duplicate of this bug. ***

Comment 18 RHEL Program Management 2009-02-19 23:00:07 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 19 Don Zickus 2009-03-09 18:54:18 UTC
in kernel-2.6.18-134.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 21 Chris Ward 2009-07-03 18:02:50 UTC
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 22 Todd 2009-07-27 19:44:15 UTC
Hi All,

I am on 5.3 for a while.  I have since upgraded to AHCI.  My /etc/modprobe:

old: alias scsi_hostadapter2 ata_piix
new: alias scsi_hostadapter2 ahci

Am I safe from this bug till I can finally upgrade to 5.4?

Many thanks,
-T

Comment 23 David Milburn 2009-07-27 22:21:19 UTC
Todd,

The original problem was seen using ata_piix, the fix was actually to
libata core code so its possible that you could have a case where you run
into this crash, though you probably would have hit it by now since this
fix was committed to -134.el5.

Comment 25 errata-xmlrpc 2009-09-02 08:56:24 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Comment 26 Todd 2009-12-28 04:28:27 UTC
(In reply to comment #23)
> Todd,
> 
> The original problem was seen using ata_piix, the fix was actually to
> libata core code so its possible that you could have a case where you run
> into this crash, though you probably would have hit it by now since this
> fix was committed to -134.el5.  

Please excuse my paranoia on this issue.  The last time I tried to cut a data DVD, this bug wiped out my hard drive (twice).  If I had not also been paranoid about backup, I would have lost my business.  To put it bluntly, I am scared.

$ uname -r
2.6.18-164.9.1.el5

Does this mean I am now safe to cut a data DVD with K3B?

Many thanks,
-T

Comment 27 David Milburn 2010-01-08 14:38:38 UTC
Todd,

It should be ok, but I will try and retest on ahci/k3b. Please note this
BZ has been verified on the reported configuration and closed out, we will
need to open a new BZ for any new problems encountered. Thanks.