Hide Forgot
Description of problem: Kernel hangs when doing I/O on ext4 while also doing I/O on a snapshot. LVM is stored on MD RAID5. Did not even take much I/O for this to happen but my hardware is old/junk admittedly. Version-Release number of selected component (if applicable): 2.6.32-71.18.2.el6.x86_64 How reproducible: Encountered once so far, machine is in production, will try to reproduce if needed. In meantime we have stopped taking snapshots while there is IO on the filesystem. Steps to Reproduce: 1. mdadm --create --level=5 --raid-devices=3 --metadata=1.1 --bitmap=internal 2. pvcreate / vgcreate / lvcreate 3. mkfs -t ext4 4. copy bootstrap data for rsync, bunch of trees couple million files 5. wait for a few days of changes to tree (backups of ~50 machines) 6. start several parallel "rsync -aHXA remote:/foo/ /path/to/lv/" on gigabit 7. while paralell rsync backups ongoing: lvcreate -s -n temporary -l 50%free 8. rsync using snapshot as src, push to different host, also gigabit Actual results: Kernel hangs all disk IO but otherwise continues to run. System disks are different PVs and controller but shells appeared also to hang as soon as there was any IO to those too. Could not get in with new shell but kernel was running, got good dumps from serial console. Expected results: Kernel should continue to run with no problem regardless of how much IO. In this case it did not take very much IO, only a few megabytes before it hung so now we are scared to take snapshot and read from it. Additional info: See attached.
Created attachment 496368 [details] sysrq backtrace
Created attachment 496369 [details] sysrq state dump
Created attachment 496370 [details] sysrq cpu register dump
Created attachment 496371 [details] sysrq memory dump
Created attachment 496372 [details] sysrq blocked task dump
Created attachment 496373 [details] sysrq timers output
Created attachment 496374 [details] lspci verbose
Created attachment 496375 [details] kernel init messages
Created attachment 496378 [details] hung kernel syslog dump
Created attachment 496379 [details] system boot messages
Created attachment 496381 [details] /proc/ioports
Created attachment 496382 [details] /proc/iomem
Created attachment 496383 [details] /proc/cpuinfo
Created attachment 496384 [details] /proc/interrupts
I have what appears to be nearly identical hardware (Tyan S2881), but I'm running the sata controller in "Ultra" rather than "RAID" mode - no sata_sil24 device. I'm unable to install EL6 because the io system locks up during initial formatting of partitions on the drive.
Note that, while I am running the onboard SATA (an SI3114 in JBOD mode, not RAID), it is for system disks, which appeared to still be functioning at the time of the crash (syslog appeared to be flushed). The lockup occurred on a device with an add-in card (SI3124 PCI-X) also with a JBOD
Since RHEL 6.2 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
Almost everything is in congestion_wait or io_schedule, except for this: kdmflush D ffff88007fc23480 0 3625 2 0x00000080 ffff8800f3c65ba0 0000000000000046 0000000000000000 ffff8800b5fedcc0 ffff8800b5fedcc0 ffff8800f103b840 00000000df3b5f00 000000011bb3c9cf ffff8800f362a678 ffff8800f3c65fd8 0000000000010518 ffff8800f362a678 Call Trace: [<ffffffff813d0755>] md_make_request+0x85/0x230 [<ffffffff81156a79>] ? ____cache_alloc_node+0x99/0x160 [<ffffffff81091df0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff81241882>] generic_make_request+0x1b2/0x4f0 [<ffffffff811a2a02>] ? bio_alloc_bioset+0xb2/0xf0 [<ffffffffa00016bd>] __map_bio+0xad/0x130 [dm_mod] [<ffffffffa00033ec>] __split_and_process_bio+0x46c/0x630 [dm_mod] [<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20 [<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20 [<ffffffffa00039b1>] dm_wq_work+0x161/0x200 [dm_mod] [<ffffffffa0003850>] ? dm_wq_work+0x0/0x200 [dm_mod] [<ffffffff8108c720>] worker_thread+0x170/0x2a0 [<ffffffff81091df0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8108c5b0>] ? worker_thread+0x0/0x2a0 [<ffffffff81091a86>] kthread+0x96/0xa0 [<ffffffff810141ca>] child_rip+0xa/0x20 [<ffffffff810919f0>] ? kthread+0x0/0xa0 [<ffffffff810141c0>] ? child_rip+0x0/0x20
This is not likely to be ext4 or jbd2 issue, but rather MD issue. Is this reproducible on the recent RHEL6 kernel ? Jes can you take a look at this ? Thanks! -Lukas
There has been no update from bug reporter to the NEEDINFO for 3 months, so closing. If this happens again against a recent RHEL, please feel free to open a new bugzilla. Jes
Sorry, just noticed this. No longer have access to the hardware.