Bug 701447

Summary: sata_sil24 -> md raid5 -> dm snapshot -> ext4 jbd -> hung I/O
Product: Red Hat Enterprise Linux 6 Reporter: scott.m.mcdermott+rh-omnisys-bugs
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.3CC: esandeen, Jes.Sorensen, lczerner, msnitzer, orion, scott.m.mcdermott+rh-omnisys-bugs, syeghiay
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-08 12:27:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
sysrq backtrace
none
sysrq state dump
none
sysrq cpu register dump
none
sysrq memory dump
none
sysrq blocked task dump
none
sysrq timers output
none
lspci verbose
none
kernel init messages
none
hung kernel syslog dump
none
system boot messages
none
/proc/ioports
none
/proc/iomem
none
/proc/cpuinfo
none
/proc/interrupts none

Description scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:11:34 UTC
Description of problem:

    Kernel hangs when doing I/O on ext4 while also
    doing I/O on a snapshot.  LVM is stored on MD RAID5.
    Did not even take much I/O for this to happen but my
    hardware is old/junk admittedly.

Version-Release number of selected component (if applicable):

    2.6.32-71.18.2.el6.x86_64

How reproducible:

    Encountered once so far, machine is in production, will
    try to reproduce if needed.  In meantime we have stopped
    taking snapshots while there is IO on the filesystem.

Steps to Reproduce:

    1. mdadm --create --level=5 --raid-devices=3 --metadata=1.1 --bitmap=internal
    2. pvcreate / vgcreate / lvcreate
    3. mkfs -t ext4
    4. copy bootstrap data for rsync, bunch of trees couple million files
    5. wait for a few days of changes to tree (backups of ~50 machines)
    6. start several parallel "rsync -aHXA remote:/foo/ /path/to/lv/" on gigabit
    7. while paralell rsync backups ongoing: lvcreate -s -n temporary -l 50%free
    8. rsync using snapshot as src, push to different host, also gigabit

Actual results:

    Kernel hangs all disk IO but otherwise continues to run.
    System disks are different PVs and controller but shells
    appeared also to hang as soon as there was any IO to
    those too.  Could not get in with new shell but kernel
    was running, got good dumps from serial console.

Expected results:

    Kernel should continue to run with no problem regardless
    of how much IO.  In this case it did not take very much
    IO, only a few megabytes before it hung so now we are
    scared to take snapshot and read from it.

Additional info:

    See attached.

Comment 2 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:13:19 UTC
Created attachment 496368 [details]
sysrq backtrace

Comment 3 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:13:57 UTC
Created attachment 496369 [details]
sysrq state dump

Comment 4 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:14:44 UTC
Created attachment 496370 [details]
sysrq cpu register dump

Comment 5 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:15:28 UTC
Created attachment 496371 [details]
sysrq memory dump

Comment 6 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:16:15 UTC
Created attachment 496372 [details]
sysrq blocked task dump

Comment 7 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:18:21 UTC
Created attachment 496373 [details]
sysrq timers output

Comment 8 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:19:01 UTC
Created attachment 496374 [details]
lspci verbose

Comment 9 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:19:53 UTC
Created attachment 496375 [details]
kernel init messages

Comment 10 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:21:13 UTC
Created attachment 496378 [details]
hung kernel syslog dump

Comment 11 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:22:15 UTC
Created attachment 496379 [details]
system boot messages

Comment 12 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:23:54 UTC
Created attachment 496381 [details]
/proc/ioports

Comment 13 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:24:24 UTC
Created attachment 496382 [details]
/proc/iomem

Comment 14 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:25:08 UTC
Created attachment 496383 [details]
/proc/cpuinfo

Comment 15 scott.m.mcdermott+rh-omnisys-bugs 2011-05-02 21:25:33 UTC
Created attachment 496384 [details]
/proc/interrupts

Comment 16 Orion Poplawski 2011-05-17 19:32:21 UTC
I have what appears to be nearly identical hardware (Tyan S2881), but I'm running the sata controller in "Ultra" rather than "RAID" mode - no sata_sil24 device.  I'm unable to install EL6 because the io system locks up during initial formatting of partitions on the drive.

Comment 17 scott.m.mcdermott+rh-omnisys-bugs 2011-05-18 04:53:39 UTC
Note that, while I am running the onboard SATA (an
SI3114 in JBOD mode, not RAID), it is for system disks,
which appeared to still be functioning at the time of
the crash (syslog appeared to be flushed).

The lockup occurred on a device with an add-in card
(SI3124 PCI-X) also with a JBOD

Comment 18 RHEL Program Management 2011-10-07 15:33:26 UTC
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 19 Eric Sandeen 2013-02-26 16:19:49 UTC
Almost everything is in congestion_wait or io_schedule, except for this:

kdmflush      D ffff88007fc23480     0  3625      2 0x00000080
 ffff8800f3c65ba0 0000000000000046 0000000000000000 ffff8800b5fedcc0
 ffff8800b5fedcc0 ffff8800f103b840 00000000df3b5f00 000000011bb3c9cf
 ffff8800f362a678 ffff8800f3c65fd8 0000000000010518 ffff8800f362a678
Call Trace:
 [<ffffffff813d0755>] md_make_request+0x85/0x230
 [<ffffffff81156a79>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff81091df0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81241882>] generic_make_request+0x1b2/0x4f0
 [<ffffffff811a2a02>] ? bio_alloc_bioset+0xb2/0xf0
 [<ffffffffa00016bd>] __map_bio+0xad/0x130 [dm_mod]
 [<ffffffffa00033ec>] __split_and_process_bio+0x46c/0x630 [dm_mod]
 [<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffffa00039b1>] dm_wq_work+0x161/0x200 [dm_mod]
 [<ffffffffa0003850>] ? dm_wq_work+0x0/0x200 [dm_mod]
 [<ffffffff8108c720>] worker_thread+0x170/0x2a0
 [<ffffffff81091df0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8108c5b0>] ? worker_thread+0x0/0x2a0
 [<ffffffff81091a86>] kthread+0x96/0xa0
 [<ffffffff810141ca>] child_rip+0xa/0x20
 [<ffffffff810919f0>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20

Comment 20 Lukáš Czerner 2013-10-17 16:56:14 UTC
This is not likely to be ext4 or jbd2 issue, but rather MD issue. Is this reproducible on the recent RHEL6 kernel ? Jes can you take a look at this ?

Thanks!
-Lukas

Comment 21 Jes Sorensen 2014-01-08 12:27:36 UTC
There has been no update from bug reporter to the NEEDINFO for 3 months,
so closing.

If this happens again against a recent RHEL, please feel free to open a new
bugzilla.

Jes

Comment 22 scott.m.mcdermott+rh-omnisys-bugs 2014-07-01 00:52:50 UTC
Sorry, just noticed this.  No longer have access to the hardware.