652086 – IO (writes?) hang indefinitely

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 652086 - IO (writes?) hang indefinitely

Summary: IO (writes?) hang indefinitely

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.0
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Red Hat Kernel Manager
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-11-10 23:40 UTC by Travers Carter
Modified:	2013-02-28 07:09 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-02-28 07:09:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
The output of dmesg after the problem starts (101.31 KB, text/plain) 2010-11-10 23:40 UTC, Travers Carter	no flags	Details
The output of iostat -x 1 2 after the problem starts (5.48 KB, text/plain) 2010-11-10 23:42 UTC, Travers Carter	no flags	Details
The output of ps aufxww after the problem starts (37.16 KB, text/plain) 2010-11-10 23:43 UTC, Travers Carter	no flags	Details
The out of ls -l /dev/mapper (871 bytes, text/plain) 2010-11-10 23:44 UTC, Travers Carter	no flags	Details
The output of cat /proc/mdstat (739 bytes, text/plain) 2010-11-10 23:44 UTC, Travers Carter	no flags	Details
View All

Description Travers Carter 2010-11-10 23:40:22 UTC

Created attachment 459582 [details]
The output of dmesg after the problem starts

Description of problem:
I have a KVM host with 14 guests which periodically appears to suffer from IO requests getting stuck in the queue, once the initial request gets stuck the load goes up to ~120 or higher and several tasks are stuck in D.  These tasks never recover without power-cycling the box (shutdown hangs and fails to complete).


Version-Release number of selected component (if applicable):
kernel-2.6.32-44.2.el6.x86_64

How reproducible:
Unsure of the trigger conditions, but it seems to occur around 1-2 times per week

Steps to Reproduce:
Unsure, but it seems to happen when several of the KVM guests are heavily loaded.
  
Actual results:
Individual guests become completely unresponsive and the only solution is to power cycle the host.

Expected results:
I/O shouldn't hang.

Additional info:

The disk layout on the system is 6 x SAS drives arranged as two RAID-5 groups of 3 using md with an additional RAID-0 md across the groups with LVM on top.  The KVM guests are a mix of raw image files on an ext4 filesystem and some directly on LVs (all in the same VG).

The host is an IBM x3650M2 with 60G of physical RAM and 2 x Xeon E5520 Quad Core CPUs.

I have attached the output of
 * dmesg
 * ps -aufxwww
 * iostat -x 1 2
 * ls -l /dev/mapper
 * cat /proc/mdstat

Comment 2 Travers Carter 2010-11-10 23:42:56 UTC

Created attachment 459583 [details]
The output of iostat -x 1 2 after the problem starts

After the problem starts the avqq-sz doesn't change at all on any of the devices showing 100% utilisation.

Comment 3 Travers Carter 2010-11-10 23:43:42 UTC

Created attachment 459584 [details]
The output of ps aufxww after the problem starts

Comment 4 Travers Carter 2010-11-10 23:44:13 UTC

Created attachment 459585 [details]
The out of ls -l /dev/mapper

Comment 5 Travers Carter 2010-11-10 23:44:41 UTC

Created attachment 459586 [details]
The output of cat /proc/mdstat

Comment 6 Travers Carter 2010-11-16 22:24:12 UTC

This problem still appears to be present in kernel-2.6.32-71.7.1.el6.x86_64

Comment 7 Travers Carter 2010-11-24 01:23:08 UTC

The problem appears to be triggered by an interaction with irqbalance, after stopping irqbalance the system has not run into an I/O deadlock for the past week now, where as it was typically happening at least once a week previously and sometimes more.

This [http://xen.1045712.n5.nabble.com/Fix-the-occasional-xen-blkfront-deadlock-when-irqbalancing-td2644296.html] looks like a vaguely similar symptom affecting Xen blockio, could there be a similar problem with virtio?

Most of the guest VMs on this server are using virtio.

Comment 8 Travers Carter 2010-11-24 22:29:39 UTC

Spoke too soon, hit the problem again, so disabling irqbalance doesn't solve it (although it may reduce the probability).

Comment 9 Travers Carter 2010-12-26 04:47:55 UTC

The problem also occurs in vanilla 2.6.36, reported upstream at https://bugzilla.kernel.org/show_bug.cgi?id=25632

Comment 10 RHEL Program Management 2011-01-07 04:22:31 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 11 Suzanne Logcher 2011-01-07 16:08:05 UTC

This request was erroneously denied for the current release of Red Hat
Enterprise Linux.  The error has been fixed and this request has been
re-proposed for the current release.

Comment 12 Stanislaw Gruszka 2011-01-12 22:55:24 UTC

Please install kernel-debug and run it. It should print more precise information where the problem can be.

Comment 13 Travers Carter 2011-01-13 22:36:43 UTC

I am planning to revert from 2.6.36 vanilla back to the RHEL kernel at the next reboot window and will switch to kernel-debug then, however as I reported in the upstream (kernel bugzilla report mentioned a few posts ago), moving the VM disk images off of an ext4 filesystem and onto individual LVs directly appears to have worked around the problem - I haven't seen it in over two weeks now.

Comment 14 RHEL Program Management 2011-02-01 05:53:21 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 15 RHEL Program Management 2011-02-01 18:51:29 UTC

This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.

Comment 16 RHEL Program Management 2011-04-04 02:26:52 UTC

Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 17 RHEL Program Management 2011-10-07 15:17:22 UTC

Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 18 Jes Sorensen 2013-02-27 08:02:02 UTC

Hi,

Is this still a problem, or did it get resolved with the changes made earlier?

Thanks,
Jes

Comment 19 Travers Carter 2013-02-27 22:28:12 UTC

I'm not sure if the bug itself still exists, but it hasn't been a problem for me since switching from files (on ext4) to raw LV devices for the guest images (and I've also switched back to the RHEL kernels).

Unfortunantly I can't really experiment with the system to confirm if the problem still shows up when using files now as it's in production.

Comment 20 Jes Sorensen 2013-02-28 07:09:27 UTC

Thanks for the update.

Since you are not suffering from this problem with the change of setup and
I am not seeing others report similar issues, I am going to close it for now.

If you see something else in the future, please go ahead and open a new BZ.

Thanks,
Jes

Note You need to log in before you can comment on or make changes to this bug.