Bug 480317

Summary:

guest reports repeatedly ATA error

Product:

Red Hat Enterprise Linux 5

Reporter:

Karel Volný <kvolny>

Component:

xen

Assignee:

Michal Novotny <minovotn>

Status:

CLOSED ERRATA

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

medium

Docs Contact:

Priority:

low

Version:

5.5

CC:

alehman, areis, clalance, drjones, iaslanidis, jbastian, jdenemar, mathieu-acct, mganisin, minovotn, mmalik, nicolas.monnet, pbonzini, swilsonau, tuchkin, xen-maint, xinsun, yuzhang

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

xen-3.0.3-102.el5

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-03-30 08:59:22 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
bootscreen	none
Qemu Libata fix	none

Description Karel Volný 2009-01-16 13:35:00 UTC

Description of problem:
I tried to install Fedora 10 as a guest on RHEL-5 host. The installation process got frozen at the end, which I believe to be a consequence of this error (as it was the bootloader installation phase). Despite that, I am able to boot the guest system, but I am getting a lot of ATA errors.

Version-Release number of selected component (if applicable):
kernel-xen-2.6.18-128.el5
xen-3.0.3-80.el5

How reproducible:
always

Steps to Reproduce:
1. (run xen enabled system)
2. virt-install -n F10 -r 512 -f F10.img -s 10 --vnc --hvm -c ./boot.iso
3. perform the default installation
4. reboot the guest
  
Actual results:
the guest console is flooded with repetitions of the following error message:

ata2: soft resetting link
ata2.00: configured for MWDMA2
ata2: EH complete
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata2.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         cdb 1e 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
         res 41/20:03:00:00:00/00:00:00:00:00/a0 Emask 0x3 (HSM violation)
ata2.00: status: { DRDY ERR }
ata2: soft resetting link


Expected results:
no errors occur

Additional info:
the physical hardware is Dell Precision 490

what I find strange is that the virt-manager reports the virtual disk drive as IDE (hda) while during the guest installation, it was detected as sda

Comment 1 Karel Volný 2009-02-20 12:53:56 UTC

Created attachment 332698 [details]
bootscreen

I have experienced the same problem, installation freezing at the end, also with recent Rawhide

unfortunately, after rebooting the guest, I am unable to boot it, see the screenshot - pay attention also to the reported hard drive size

Comment 2 Karel Volný 2009-02-20 12:57:28 UTC

I forgot to mention that the virtual guest at the screenshot uses disk partition instead of image file as the harddrive device

Comment 4 Sergey Tuchkin 2009-03-02 12:23:38 UTC

Reproduced on FC10 guest with an image file as harddrive divice hda
The host is Scienfific Linux 5.2 x86_64, xen-3.0.3-64.el5_2.9.x86_64

Comment 5 Chris Lalancette 2009-03-02 14:27:46 UTC

Can you try passing "clocksource=acpi_pm" to the guest kernel before you boot it, and see if that makes a difference?  There is a bug in F-10 having to do with paravirtualized clocks, and I'm wondering if this is another instance of it.

I'm also going to change the component to "xen" for the time being; this is either a bug in the guest emulation (i.e. xen), or it's a bug in the guest kernel (in which case we would move it to F-10 kernel).  But it's definitely not python-virtinst's problem.

Chris Lalancette

Comment 6 Sergey Tuchkin 2009-03-02 14:54:17 UTC

Yes, I tried, but it didn't help - I see the same ata2 errors in dmesg output:

[root@fc10 ~]# cat /proc/cmdline 
ro root=/dev/VolGroup00/LogVol00 rhgb quiet clocksource=acpi_pm
[root@fc10 ~]# dmesg|tail
ata2.00: status: { DRDY ERR }
ata2: soft resetting link
ata2.00: configured for MWDMA2
ata2: EH complete
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata2.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         cdb 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
         res 41/20:03:00:00:00/00:00:00:00:00/a0 Emask 0x3 (HSM violation)
ata2.00: status: { DRDY ERR }
ata2: soft resetting link
[root@fc10 ~]# uname -a
Linux fc10.xen.home 2.6.27.15-170.2.24.fc10.x86_64 #1 SMP Wed Feb 11 23:14:31 EST 2009 x86_64 x86_64 x86_64 GNU/Linux

And I'm agree that python-virtinst is not the source of this problem

Comment 7 Karel Volný 2009-03-05 10:33:19 UTC

(In reply to comment #5)
> Can you try passing "clocksource=acpi_pm" to the guest kernel before you
> boot it, and see if that makes a difference?

the same for me, passing this option does not help

tried using Rawhide, kernel 2.6.29-0.179.rc6.git5.fc11.x86_64

Comment 8 Michal Novotny 2009-06-09 11:36:18 UTC

Karel: What about appending domain configuration file to this BZ ?

Sergey: I have ran into this issue using xen-3.0.3-87.el5 RPMs with kernel-xen-2.6.18-146.el5xen too. This may be kernel-xen problem as well as IOEMU problem and most definitely not python-virtinst problem because this is done even when VM is installed. I tried F10 i386 FV guest...

Comment 9 Michal Novotny 2009-06-09 15:08:15 UTC

I've been poking about this in IOEMU code but no luck since but it may be some kernel thing because I found some information on fedora-kernel-list at:

http://www.mail-archive.com/fedora-kernel-list@redhat.com/msg00087.html

May be related to this 2.6.20+ kernels but not "pci=nomsi" because this is not working either. Maybe some kernel issue.

Michal

Comment 10 Sam Wilson 2009-06-15 07:10:40 UTC

I have been running into this issue as well however it may still be related to lib-virt somehow as once I created a disk(secondary) in virt-manager set to "SCSI Disk" there were no errors when trying to access this disk where there is a stream  of soft resetting link errors when accessing the IDE created device (which shows as /dev/sda1).

Sam.

Comment 11 Michal Novotny 2009-10-09 12:58:38 UTC

Hi Sam,
well, you're talking about libvirt relations or something like that. I don't think it's the issue but for clarification, could you provide us your libvirt version and exact steps you did to see and not to see those errors?

Thanks,
Michal

Comment 12 Jeff Bastian 2009-10-09 16:20:30 UTC

I'm also hitting this error installing early builds of RHEL 6.0 on a RHEL 5.4 host with 
   kernel-xen-2.6.18-164.el5
   libvirt-0.6.3-20.1.el5_4
   libvirt-python-0.6.3-20.1.el5_4
   python-virtinst-0.400.3-5.el5
   virt-manager-0.6.1-8.el5
   xen-3.0.3-94.el5
   xen-libs-3.0.3-94.el5

I started the RHEL 6 install with virt-install:
  virt-install -n rhel6 -r 512 --vcpus=1 -f /var/lib/xen/images/rhel6 \
    -b xenbr0 --vnc --noautoconsole -v --os-type=linux --os-variant=fedora11 \
    -c /tmp/rhel6/boot.iso

Note that I used an OS variant of fedora11 since rhel6 is not listed yet for virt-install.

On the first boot after installation it spit out hundreds of these errors, but it eventually booted all the way.

This thread implies this is fixed upstream:
  http://www.mail-archive.com/linux-ide@vger.kernel.org/msg14513.html

Comment 14 Andrew Jones 2009-10-12 08:31:13 UTC

*** Bug 526662 has been marked as a duplicate of this bug. ***

Comment 16 Paolo Bonzini 2009-11-25 15:45:43 UTC

Upstream patch is here: http://www.mail-archive.com/qemu-devel@nongnu.org/msg11844.html

The backport to Xen's qemu is almost trivial.

Comment 17 Michal Novotny 2009-11-26 12:55:47 UTC

(In reply to comment #16)
> Upstream patch is here:
> http://www.mail-archive.com/qemu-devel@nongnu.org/msg11844.html
> 
> The backport to Xen's qemu is almost trivial.  

Thanks for pointing this out. I'll backport this one ...

Michal

Comment 18 Michal Novotny 2009-11-26 14:47:32 UTC

Created attachment 374020 [details]
Qemu Libata fix

Well, I have this backported but I am unable to reproduce it even with Fedora 10 and Fedora 12 x86_64... This is the patch but could somebody tell me how to reproduce it since I am unable to reproduce it?

Michal

Comment 20 Karel Volný 2009-11-26 16:17:28 UTC

(In reply to comment #18)
> Well, I have this backported but I am unable to reproduce it even with Fedora
> 10 and Fedora 12 x86_64...

could that be it is somehow hardware dependent?

(unfortunately, I can't reinstall my machine to RHEL-5 right now to try)

Comment 21 Andrew Jones 2009-11-26 16:34:37 UTC

When I boot a RHEL-6 64b fv guest with xen -100 I get tons of and tons of ata
errors on the console. After applying the patch in comment 18 I don't get those
errors to the console anymore, but dmesg still shows a few of these.

ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata2.00: BMDMA stat 0x5
ata2.00: cmd a0/01:00:00:80:00/00:00:00:00:00/a0 tag 0 dma 16512 in
         cdb 5a 00 2a 00 00 00 00 00  80 00 00 00 00 00 00 00
         res 48/20:02:00:1c:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)
ata2.00: status: { DRDY DRQ }
ata2: soft resetting link

The same results for f11 (2.6.30.9-96).

Comment 22 Michal Novotny 2009-11-26 22:31:06 UTC

(In reply to comment #21)
> When I boot a RHEL-6 64b fv guest with xen -100 I get tons of and tons of ata
> errors on the console. After applying the patch in comment 18 I don't get those
> errors to the console anymore, but dmesg still shows a few of these.
> 
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata2.00: BMDMA stat 0x5
> ata2.00: cmd a0/01:00:00:80:00/00:00:00:00:00/a0 tag 0 dma 16512 in
>          cdb 5a 00 2a 00 00 00 00 00  80 00 00 00 00 00 00 00
>          res 48/20:02:00:1c:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)
> ata2.00: status: { DRDY DRQ }
> ata2: soft resetting link
> 
> The same results for f11 (2.6.30.9-96).  

Well, maybe the upstream qemu patch does this because when an error is here it's not showing other HSM violations so it's showing just few of them. So did this improve the situation?

Michal

Comment 23 Andrew Jones 2009-11-27 10:28:44 UTC

Now that I look again closer, the error I reported in comment 21 is different than originally reported error in this bug. I have DRDY DRQ and the original report was for DRDY ERR. It looks like the proposed patch does eliminate the DRDY ERRs. So the DRDY DRQ errors are something else and deserve a different bug.

Comment 24 Andrew Jones 2009-11-27 10:46:18 UTC

Ok, I just backedup and doubled checked without the patch. The error I have continuously output to the console is Emask 0x2 { DRDY DRQ ERR }. So I never reproduced exactly the same thing as the originator. This may not make a difference, but should maybe be investigated. I'll sort it out and open a new bug for it if necessary.

As far as this bug goes, I believe the patch works. When not violating the HSM when avoid getting constant exceptions.

Comment 26 Ioannis Aslanidis 2009-11-30 12:03:25 UTC

I still see the same errors under a fully-virtualized environment:

Linux fedora-11-64 2.6.30.9-96.fc11.x86_64 #1 SMP Wed Nov 4 00:02:04 EST 2009 x86_64 x86_64 x86_64 GNU/Linux

{{{
 ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
 ata2.00: cmd a0/00:00:00:08:00/00:00:00:00:00/a0 tag 0 pio 16392 in
         cdb 4a 01 00 00 10 00 00 00  08 00 00 00 00 00 00 00
         res 41/50:03:00:08:00/00:00:00:00:00/a0 Emask 0x3 (HSM violation)
 ata2.00: status: { DRDY ERR }
 ata2: soft resetting link
 ata2.00: configured for MWDMA2
 ata2: EH complete
}}}

Apart from that, the guest tends to hand every few days.

Comment 27 Michal Novotny 2009-12-07 10:32:13 UTC

(In reply to comment #26)
> I still see the same errors under a fully-virtualized environment:
> 
> Linux fedora-11-64 2.6.30.9-96.fc11.x86_64 #1 SMP Wed Nov 4 00:02:04 EST 2009
> x86_64 x86_64 x86_64 GNU/Linux
> 
> {{{
>  ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>  ata2.00: cmd a0/00:00:00:08:00/00:00:00:00:00/a0 tag 0 pio 16392 in
>          cdb 4a 01 00 00 10 00 00 00  08 00 00 00 00 00 00 00
>          res 41/50:03:00:08:00/00:00:00:00:00/a0 Emask 0x3 (HSM violation)
>  ata2.00: status: { DRDY ERR }
>  ata2: soft resetting link
>  ata2.00: configured for MWDMA2
>  ata2: EH complete
> }}}
> 
> Apart from that, the guest tends to hand every few days.  

Well, this maybe kernel related... Does it do with older/newer kernels?

Michal

Comment 28 Ioannis Aslanidis 2009-12-07 10:51:53 UTC

Seems to be doing it with all fedora 11 kernels I tried, including the last one. It may be related to bug #543947

Comment 29 Michal Novotny 2009-12-07 11:19:49 UTC

(In reply to comment #28)
> Seems to be doing it with all fedora 11 kernels I tried, including the last
> one. It may be related to bug #543947  

Well, I can't claim I understand that stuff well but could you also try with F10 or F12 kernels? If this is no issue on F10 and F12 kernels, it may be related to bug you wrote above...

Michal

Comment 30 Ioannis Aslanidis 2009-12-07 11:50:45 UTC

I can tell you for sure that it does not happen with Fedora 9 kernels. I did not try with Fedora 12 or Fedora 10.

Comment 31 Ioannis Aslanidis 2009-12-10 09:41:23 UTC

Any updates on this?

Comment 32 Michal Novotny 2009-12-10 16:10:33 UTC

Well, I did some testing with Fedora 8, 9 and Fedora 10 kernels (all 32 bit, i386, guests) just to be sure and this problem didn't occur on those guests but DRDY DRQ messages are here in dmesg output but not DRDY DRQ ERR ones. It seems like it's related to BZ #543947. Also, I've not been able to install Fedora 12 again - there were some errors - we need to be sure...

Michal

Comment 33 Michal Novotny 2009-12-10 17:29:31 UTC

Well, I managed to install Fedora 12, 32-bit guest and I saw no DRDY DRQ ERR errors, only DRDY DRQ messages so it seems the problem is really related to bug #543947 because I saw no such issue on other guest than Fedora 11.

Michal

Comment 43 Alan Lehman 2009-12-25 16:30:33 UTC

I am seeing this problem with Fedora 12 fully virtualized guest on RHEL 5.3.

kernel-2.6.18-164.6.1.el5xen
xen-3.0.3-94.el5_4.2

This string of errors is logged every few seconds whenever the guest is up:

Dec 24 19:05:24 web1 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 24 19:05:24 web1 kernel: ata2.00: ST_FIRST: DRQ=1 with device error, dev_stat 0x49
Dec 24 19:05:24 web1 kernel: ata2.00: cmd a0/00:00:00:24:00/00:00:00:00:00/a0 tag 0 pio 36 in
Dec 24 19:05:24 web1 kernel:         cdb 12 00 00 00 24 00 00 00  00 00 00 00 00 00 00 00
Dec 24 19:05:24 web1 kernel:         res 49/20:01:00:24:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)
Dec 24 19:05:24 web1 kernel: ata2.00: status: { DRDY DRQ ERR }
Dec 24 19:05:24 web1 kernel: ata2: soft resetting link
Dec 24 19:05:25 web1 kernel: ata2.00: configured for MWDMA2
Dec 24 19:05:25 web1 kernel: ata2: EH complete

Comment 44 Alan Lehman 2009-12-25 16:55:08 UTC

A little more info on my post above:

host hardware: Proliant DL365 Opteron 
I tried clocksource=acpi_pm, but it made no difference.

guest: 2.6.31.6-166.fc12.x86_64

Comment 45 Paolo Bonzini 2009-12-29 12:11:47 UTC

Alan, packages that fix this bug will be available shortly.

Comment 53 XinSun 2010-01-04 08:16:12 UTC

According to Comment #52, check this bug on xen-3.0.3-102.el5 and rhel5.4  for x86_64 and i386 platform:
1.(run xen enabled system)
2. virt-install -n F10 -r 512 -f F10.img -s 10 --vnc --hvm -c
/root/Fedora-10-i386-DVD.iso
3. perform the default installation
4. reboot the guest
5. dmesg | grep ata2

After step5, I get follow results:

ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata2.00: BMDMA stat 0x5
ata2.00: cmd a0/01:00:00:80:00/00:00:00:00:00/a0 tag 0 dma 16512 in
ata2.00: status: { DRDY DRQ }
ata2: soft resetting link
ata2.00: configured for MWDMA2
ata2: EH complete

These ata2 message is about  {DRDY DRQ}, not the original {DRDY ERR}. So this bug is fixed on xen-3.0.3-102.el5 and change bug's status to verified.

Comment 55 errata-xmlrpc 2010-03-30 08:59:22 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0294.html

Comment 56 Paolo Bonzini 2010-04-08 15:44:28 UTC

This bug was closed during 5.5 development and it's being removed from the internal tracking bugs (which are now for 5.6).