Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 701447

Summary:

sata_sil24 -> md raid5 -> dm snapshot -> ext4 jbd -> hung I/O

Product:

Red Hat Enterprise Linux 6

Reporter:

scott.m.mcdermott+rh-omnisys-bugs

Component:

kernel

Assignee:

Red Hat Kernel Manager <kernel-mgr>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

6.3

CC:

esandeen, Jes.Sorensen, lczerner, msnitzer, orion, scott.m.mcdermott+rh-omnisys-bugs, syeghiay

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-01-08 12:27:36 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
sysrq backtrace	none
sysrq state dump	none
sysrq cpu register dump	none
sysrq memory dump	none
sysrq blocked task dump	none
sysrq timers output	none
lspci verbose	none
kernel init messages	none
hung kernel syslog dump	none
system boot messages	none
/proc/ioports	none
/proc/iomem	none
/proc/cpuinfo	none
/proc/interrupts	none

Description scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:11:34 UTC

Description of problem:

    Kernel hangs when doing I/O on ext4 while also
    doing I/O on a snapshot.  LVM is stored on MD RAID5.
    Did not even take much I/O for this to happen but my
    hardware is old/junk admittedly.

Version-Release number of selected component (if applicable):

    2.6.32-71.18.2.el6.x86_64

How reproducible:

    Encountered once so far, machine is in production, will
    try to reproduce if needed.  In meantime we have stopped
    taking snapshots while there is IO on the filesystem.

Steps to Reproduce:

    1. mdadm --create --level=5 --raid-devices=3 --metadata=1.1 --bitmap=internal
    2. pvcreate / vgcreate / lvcreate
    3. mkfs -t ext4
    4. copy bootstrap data for rsync, bunch of trees couple million files
    5. wait for a few days of changes to tree (backups of ~50 machines)
    6. start several parallel "rsync -aHXA remote:/foo/ /path/to/lv/" on gigabit
    7. while paralell rsync backups ongoing: lvcreate -s -n temporary -l 50%free
    8. rsync using snapshot as src, push to different host, also gigabit

Actual results:

    Kernel hangs all disk IO but otherwise continues to run.
    System disks are different PVs and controller but shells
    appeared also to hang as soon as there was any IO to
    those too.  Could not get in with new shell but kernel
    was running, got good dumps from serial console.

Expected results:

    Kernel should continue to run with no problem regardless
    of how much IO.  In this case it did not take very much
    IO, only a few megabytes before it hung so now we are
    scared to take snapshot and read from it.

Additional info:

    See attached.

Comment 2 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:13:19 UTC

Created attachment 496368 [details]
sysrq backtrace

Comment 3 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:13:57 UTC

Created attachment 496369 [details]
sysrq state dump

Comment 4 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:14:44 UTC

Created attachment 496370 [details]
sysrq cpu register dump

Comment 5 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:15:28 UTC

Created attachment 496371 [details]
sysrq memory dump

Comment 6 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:16:15 UTC

Created attachment 496372 [details]
sysrq blocked task dump

Comment 7 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:18:21 UTC

Created attachment 496373 [details]
sysrq timers output

Comment 8 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:19:01 UTC

Created attachment 496374 [details]
lspci verbose

Comment 9 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:19:53 UTC

Created attachment 496375 [details]
kernel init messages

Comment 10 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:21:13 UTC

Created attachment 496378 [details]
hung kernel syslog dump

Comment 11 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:22:15 UTC

Created attachment 496379 [details]
system boot messages

Comment 12 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:23:54 UTC

Created attachment 496381 [details]
/proc/ioports

Comment 13 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:24:24 UTC

Created attachment 496382 [details]
/proc/iomem

Comment 14 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:25:08 UTC

Created attachment 496383 [details]
/proc/cpuinfo

Comment 15 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:25:33 UTC

Created attachment 496384 [details]
/proc/interrupts

Comment 16 Orion Poplawski 2011-05-17 19:32:21 UTC

I have what appears to be nearly identical hardware (Tyan S2881), but I'm running the sata controller in "Ultra" rather than "RAID" mode - no sata_sil24 device.  I'm unable to install EL6 because the io system locks up during initial formatting of partitions on the drive.

Comment 17 scott.m.mcdermott+rh-omnisys-bugs 2011-05-18 04:53:39 UTC

Note that, while I am running the onboard SATA (an
SI3114 in JBOD mode, not RAID), it is for system disks,
which appeared to still be functioning at the time of
the crash (syslog appeared to be flushed).

The lockup occurred on a device with an add-in card
(SI3124 PCI-X) also with a JBOD

Comment 18 RHEL Program Management 2011-10-07 15:33:26 UTC

Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 19 Eric Sandeen 2013-02-26 16:19:49 UTC

Almost everything is in congestion_wait or io_schedule, except for this:

kdmflush      D ffff88007fc23480     0  3625      2 0x00000080
 ffff8800f3c65ba0 0000000000000046 0000000000000000 ffff8800b5fedcc0
 ffff8800b5fedcc0 ffff8800f103b840 00000000df3b5f00 000000011bb3c9cf
 ffff8800f362a678 ffff8800f3c65fd8 0000000000010518 ffff8800f362a678
Call Trace:
 [<ffffffff813d0755>] md_make_request+0x85/0x230
 [<ffffffff81156a79>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff81091df0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81241882>] generic_make_request+0x1b2/0x4f0
 [<ffffffff811a2a02>] ? bio_alloc_bioset+0xb2/0xf0
 [<ffffffffa00016bd>] __map_bio+0xad/0x130 [dm_mod]
 [<ffffffffa00033ec>] __split_and_process_bio+0x46c/0x630 [dm_mod]
 [<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffffa00039b1>] dm_wq_work+0x161/0x200 [dm_mod]
 [<ffffffffa0003850>] ? dm_wq_work+0x0/0x200 [dm_mod]
 [<ffffffff8108c720>] worker_thread+0x170/0x2a0
 [<ffffffff81091df0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8108c5b0>] ? worker_thread+0x0/0x2a0
 [<ffffffff81091a86>] kthread+0x96/0xa0
 [<ffffffff810141ca>] child_rip+0xa/0x20
 [<ffffffff810919f0>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20

Comment 20 Lukáš Czerner 2013-10-17 16:56:14 UTC

This is not likely to be ext4 or jbd2 issue, but rather MD issue. Is this reproducible on the recent RHEL6 kernel ? Jes can you take a look at this ?

Thanks!
-Lukas

Comment 21 Jes Sorensen 2014-01-08 12:27:36 UTC

There has been no update from bug reporter to the NEEDINFO for 3 months,
so closing.

If this happens again against a recent RHEL, please feel free to open a new
bugzilla.

Jes

Comment 22 scott.m.mcdermott+rh-omnisys-bugs 2014-07-01 00:52:50 UTC

Sorry, just noticed this.  No longer have access to the hardware.