Bug 618097

Summary: kernel thread ata/1 consume too much cpu time
Product: Red Hat Enterprise Linux 5 Reporter: Mark Wu <dwu>
Component: kernelAssignee: David Milburn <dmilburn>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.5CC: cww, gsgatlin, jjneely, jwilson, tao
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-01-18 17:35:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg on -128 kernel which also works fine.
none
dmesg on -194 kernel which has bad performance none

Description Mark Wu 2010-07-26 06:18:19 UTC
Description of problem:
Performance de-gradated in Dell OptiPlex 740 after booting to 2.6.18-194 kernel. System response is very slow. ata/1 process is consuming too much cpu and in R state all the time. hald-addon-storage process and scsi_eh_1 is also in D state or R state most of the time.

And the system can work well with any of the following workarounds:

1. Reverting back to 2.6.18-164 kernel.
2. Stopping hald's polling CR-ROM
3. Booting with acpi=off option.

They found that both CD-ROM model GSA-H73N and DH-16A6S has this problem.

Version-Release number of selected component (if applicable):
kernel - 2.6.18-194 

How reproducible:


Steps to Reproduce:
1. boot with 194 kernel (with kernel parameter "hda=ide-scsi" and without "acpi=off")

  
Actual results:


Expected results:


Additional info:
Recursive "diff" between ata drivers of -164 and -194, haven't found any change related to this issue.
# diff -r ata-194/ ata-164/
diff -r ata-194/ahci.c ata-164/ahci.c
481d480
< { PCI_VDEVICE(INTEL, 0x3a22), board_ahci }, /* ICH10 */
505,510d503
< /* AMD */
< { PCI_VDEVICE(AMD, 0x7800), board_ahci }, /* AMD Hudson-2 */
< /* AMD is using RAID class only for ahci controllers */
< { PCI_VENDOR_ID_AMD, PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID,
<  PCI_CLASS_STORAGE_RAID << 8, 0xffffff, board_ahci },
<
2295,2297c2288,2289
< if ((pdev->vendor == PCI_VENDOR_ID_AMD && pdev->device == 0x7800) ||
<    (pdev->vendor == PCI_VENDOR_ID_ATI &&
<     (pdev->device == 0x4380 || pdev->device == 0x4390))) {
---
> if (pdev->vendor == PCI_VENDOR_ID_ATI &&
>    (pdev->device == 0x4380 || pdev->device == 0x4390)) {
diff -r ata-194/pata_atiixp.c ata-164/pata_atiixp.c
255d254
< { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_HUDSON2_IDE), },

Comment 1 Mark Wu 2010-07-26 06:33:36 UTC
crash> bt 443
PID: 443    TASK: ffff81007f5fe860  CPU: 1   COMMAND: "ata/1"
 #0 [ffff81000a6faf20] crash_nmi_callback at ffffffff8007bf44
 #1 [ffff81000a6faf40] do_nmi at ffffffff8006688a
 #2 [ffff81000a6faf50] nmi at ffffffff80065eef
    [exception RIP: __delay+6]
    RIP: ffffffff8000c9f2  RSP: ffff81007ef5be18  RFLAGS: 00000287
    RAX: 00000000000619a4  RBX: ffff81007eed8000  RCX: 00000000083fe723
    RDX: 000000000000010d  RSI: 0000000000000282  RDI: 00000000029094f0
    RBP: 0000000000000005   R8: ffff81007ef5a000   R9: ffff81007eed8000
    R10: ffff81007ed721d8  R11: ffffffff8807aeb3  R12: ffff81007eed80e0
    R13: 0000000000000282  R14: ffff81007eed8000  R15: ffffffff880c6511
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #3 [ffff81007ef5be18] __delay at ffffffff8000c9f2
 #4 [ffff81007ef5be18] ata_pio_task at ffffffff880c6565
 #5 [ffff81007ef5be38] run_workqueue at ffffffff8004dc37
 #6 [ffff81007ef5be78] worker_thread at ffffffff8004a562
 #7 [ffff81007ef5bee8] kthread at ffffffff80032bdc
 #8 [ffff81007ef5bf48] kernel_thread at ffffffff8005efb1

void ata_pio_task(void *_data)
{
	struct ata_port *ap = _data;
	struct ata_queued_cmd *qc = ap->port_task_data;
	u8 status;
	int poll_next;

fsm_start:
	WARN_ON(ap->hsm_task_state == HSM_ST_IDLE);

	/*
	 * This is purely heuristic.  This is a fast path.
	 * Sometimes when we enter, BSY will be cleared in
	 * a chk-status or two.  If not, the drive is probably seeking
	 * or something.  Snooze for a couple msecs, then
	 * chk-status again.  If still busy, queue delayed work.
	 */
	status = ata_sff_busy_wait(ap, ATA_BUSY, 5);
	if (status & ATA_BUSY) {
		msleep(2);
		status = ata_sff_busy_wait(ap, ATA_BUSY, 10);
		if (status & ATA_BUSY) {
			ata_pio_queue_task(ap, qc, ATA_SHORT_PAUSE);
			return;
		}
	}
...
}
It seems that kernel thread ata/1 is waiting for  bit ATA_BUSY to be cleaned. And it was shown that ata/1 consume too much cpu from top, so maybe it spent a lot of time in busy wait.

Comment 2 Mark Wu 2010-07-26 06:35:40 UTC
vmcore is available at megatron.gsslab.rdu.redhat.com:/cores/20100518081635/work

Comment 3 Mark Wu 2010-07-26 06:40:01 UTC
Specify the option "noacpi=1" of module libata, and still have a slow response.

Similar issue was reported in https://bugzilla.redhat.com/show_bug.cgi?id=468027#c49

Comment 4 David Milburn 2010-08-03 18:35:01 UTC
Would you please attach your dmesg output after a successful -164.el5 boot
using the kernel parameter "hda=ide-scsi"?

Comment 5 Mark Wu 2010-08-04 06:02:30 UTC
Created attachment 436442 [details]
dmesg on -128 kernel which also works fine.

Currently we only have sosreport for -194 kernel and -128 kernel. The system in question also works fine with 128 kernel.  I am going to collect dmesg on -164 kernel from the customer.

Comment 6 Mark Wu 2010-08-04 06:41:26 UTC
Created attachment 436452 [details]
dmesg on -194 kernel which has bad performance

Comment 7 Gary Gatling 2010-08-30 13:48:19 UTC
With respect to the Optiplex 740 workstation. It seems that updating
the BIOS to the latest version (version 2.2.5) fixes the issues with
ata/1 consuming too much CPU. I hit this bug when upgrading from 32 bit
to 64 bit RHEL 5. Its easy to confuse this with bug 586532 so make sure
you have the workaround for enable_msi=0 in  /etc/modprobe.conf and have the
latest BIOS and this problem goes away.

Comment 8 Chris Williams 2012-01-18 17:35:21 UTC
It's been almost a year and a half since this BZ was updated. Customer issue is closed. Closing the BZ NOTABUG. If this is still an issue please open a case with Red Hat Support via the Customer Portal.