190309 – kernel randomly loses all interrupts from my UHCI or SATA controllers.

Bug 190309 - kernel randomly loses all interrupts from my UHCI or SATA controllers.

Summary: kernel randomly loses all interrupts from my UHCI or SATA controllers.

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	5
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:	bzcl34nup
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-05-01 00:11 UTC by Nicholas Miell
Modified:	2008-05-06 15:51 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-05-06 15:51:45 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
lspci and dmidecode output (23.23 KB, text/plain) 2006-05-01 00:11 UTC, Nicholas Miell	no flags	Details
dmesg output (16.84 KB, text/plain) 2006-05-01 00:27 UTC, Nicholas Miell	no flags	Details
updated dmesg, this time with apic=debug (19.31 KB, text/plain) 2006-05-03 20:05 UTC, Nicholas Miell	no flags	Details
dmesg output (17.50 KB, text/plain) 2006-11-15 21:56 UTC, John Holmstadt	no flags	Details
Show Obsolete (1) View All

Description Nicholas Miell 2006-05-01 00:11:34 UTC

Description of problem:

Randomly (but often -- it's about once per day at this point), I'll stop getting
any interrupts (the values in /proc/interrupts stop incrementing) from either my
SATA controller or my UHCI controllers (or, occasionally, both at once).

If I remove and reload the driver modules or use sysfs to unbind and then bind
the devices, they immediately start working again.

Version-Release number of selected component (if applicable):
kernel-2.6.16-1.2096_FC5, but it's been happening off and on for months now.

How reproducible: Random.

Steps to Reproduce: Unknown.

Comment 1 Nicholas Miell 2006-05-01 00:11:35 UTC

Created attachment 128428 [details]
lspci and dmidecode output

Comment 2 Nicholas Miell 2006-05-01 00:27:12 UTC

Created attachment 128430 [details]
dmesg output

You'll probably notice I'm running a custom kernel.

This kernel is identical to the stock 2.6.16-1.2096_FC5 kernel except it also
includes the following patch (which didn't make any difference w.r.t. this
bug):

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=75cf7456dd87335f574dcd53c4ae616a2ad71a11;hp=f01f4182597a3bb4b6fbf92e041faf7a1016f4b6

Comment 3 Nicholas Miell 2006-05-03 20:05:41 UTC

Created attachment 128564 [details]
updated dmesg, this time with apic=debug

I forgot to mention originally that using noapic seems to make this problem go
away.

I also noticed that disabling all my GL screensavers (but not DRI entirely)
seems to make the problem go away as well. No clue what that means.

Comment 4 Nicholas Miell 2006-05-06 21:07:30 UTC

Still happens with kernel-2.6.16-1.2111_FC5, and not using GL seems to merely
make the problem less likely to happen rather than not happen at all.

Comment 5 Nicholas Miell 2006-06-05 01:42:02 UTC

Still happens with kernel-2.6.16-1.2122_FC5.

When I'm experiencing interrupt loss, crash's irq command says the interrupts
are actually enabled. However, the following systemtap script fixes the problem
instantly:

function unmask(irq:long) %{
        /* This is unmask_IO_APIC_irq in kernel-2.6.16-1.2122_FC5.x86_64*/
        void (*unmaskP)(int) = (void(*)(int)) 0xffffffff80119dc0;
        unmaskP(THIS->irq);
%}

probe begin {
        unmask(16); /* sata_via gets irq 16 */
        exit();
}

This leads me to believe that interrupts are getting masked at the IOAPIC level
without the kernel's knowledge. (Or else the kernel is screwing up and masking
them at the IOAPIC level without disabling them at the software level.)

Comment 6 Dave Jones 2006-10-16 21:55:14 UTC

A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 7 Nicholas Miell 2006-10-18 20:19:13 UTC

Still happens with kernel-2.6.18-1.2200.fc5

Comment 8 Nicholas Miell 2006-10-18 20:23:18 UTC

Hey, there's new error messages now:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: (BMDMA stat 0x4)
ata1.00: tag 0 cmd 0xca Emask 0x4 stat 0x40 err 0x0 (timeout)
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1: soft resetting port
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: failed to recover some devices, retrying in 5 secs
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1: soft resetting port
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: failed to recover some devices, retrying in 5 secs
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1: soft resetting port
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1.00: disabled
ata1: EH complete
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 61863
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 218871695
Buffer I/O error on device dm-2, logical block 27358906
lost page write due to I/O error on dm-2
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 218895999
Buffer I/O error on device dm-2, logical block 27361944
lost page write due to I/O error on dm-2
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 189010383
EXT3-fs error (device dm-2): ext3_find_entry: reading directory #11815289 offset 0
Aborting journal on device dm-2.
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 12807
Buffer I/O error on device dm-2, logical block 1545
lost page write due to I/O error on dm-2
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 61903
Buffer I/O error on device dm-2, logical block 7682
lost page write due to I/O error on dm-2
journal commit I/O error
ext3_abort called.
EXT3-fs error (device dm-2): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 447
Buffer I/O error on device dm-2, logical block 0
lost page write due to I/O error on dm-2
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 218932055
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 218932055
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 218932087
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 218932087

Which is irritating, actually -- with the older kernels it would block
indefinitely, and I could manually unmask the IRQ with grotesque SystemTap hack
and everything would work fine, now I have to catch it before it timesout and
things go horribly wrong.

Comment 9 John Holmstadt 2006-11-15 21:56:07 UTC

Created attachment 141317 [details]
dmesg output

I'm getting something very similar on kernel 2.6.18-1.2239.fc5. This is using a
Silicon Image, Inc. SiI 3112 (sata_sil) controller. Everything works fine for a
while, then after copying ~24G to the drives (at least in this case), the
timeout errors come up and I eventually get repeated "I/O error in filesystem"
errors when XFS tries to write to the logical volume.

Comment 10 Shawn OReilly 2007-12-13 20:57:54 UTC

I am also seeing this trouble; I will gather logs and provide them upon the next
occurrence.

It seems to have cropped up since I went to a dual core processor (literally
upgraded the processor in-place; all other hardware is unchanged).

In this particular system I have five SATA controllers (3+SIL3112; latest
firmware; one *Promise 378; and one*VIA) All controllers have seen this problem
once; and I loose the whole controller once it happens (I have not taken steps
to attempt to repair without a reboot; I have just kicked the box over).

Logs to follow.

- Shawn

Comment 11 Bug Zapper 2008-04-04 02:47:40 UTC

Fedora apologizes that these issues have not been resolved yet. We're
sorry it's taken so long for your bug to be properly triaged and acted
on. We appreciate the time you took to report this issue and want to
make sure no important bugs slip through the cracks.

If you're currently running a version of Fedora Core between 1 and 6,
please note that Fedora no longer maintains these releases. We strongly
encourage you to upgrade to a current Fedora release. In order to
refocus our efforts as a project we are flagging all of the open bugs
for releases which are no longer maintained and closing them.
http://fedoraproject.org/wiki/LifeCycle/EOL

If this bug is still open against Fedora Core 1 through 6, thirty days
from now, it will be closed 'WONTFIX'. If you can reporduce this bug in
the latest Fedora version, please change to the respective version. If
you are unable to do this, please add a comment to this bug requesting
the change.

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we are following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

And if you'd like to join the bug triage team to help make things
better, check out http://fedoraproject.org/wiki/BugZappers

Comment 12 Bug Zapper 2008-05-06 15:51:44 UTC

This bug is open for a Fedora version that is no longer maintained and
will not be fixed by Fedora. Therefore we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen thus bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.