721127 – Heavy disk I/O (MD RAID?) crashes or freezes Fedora 15

Bug 721127 - Heavy disk I/O (MD RAID?) crashes or freezes Fedora 15

Summary: Heavy disk I/O (MD RAID?) crashes or freezes Fedora 15

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	16
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-07-13 19:31 UTC by Paul Flinders
Modified:	2014-03-06 10:52 UTC (History)
CC List:	35 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-08-07 12:32:14 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Computer 1 (7.25 KB, text/plain) 2012-05-10 12:55 UTC, G. Michael Carter	no flags	Details
Computer 2 (5.45 KB, text/plain) 2012-05-10 12:59 UTC, G. Michael Carter	no flags	Details
View All

Description Paul Flinders 2011-07-13 19:31:21 UTC

Description of problem:
Heavy disk I/O crashes or freezes Fedora 15, possibly MD RAID specific.

Just installed Fedora 15. Clean install to separate partition on spare HD with no problems. Set out to back-up approx 1.3TB of data (mostly camcorder footage) to a new hard 2TB drive but experienced freezes or reboots after a few tens of seconds with anything that caused heavy disk I/O such as resize2fs or rsync.

Unfortunately no OOPS or any other diagnostic to attach, either the X window session froze with a blank screen requiring a reboot or more often the machine simply rebooted.

I have successfully copied the data in F14 (kernel 2.6.36-1), and am about 90 minutes into verifying the copy (with that much data chances of an error somewhere become rather more likely - it'll take a few hours more to compare the whole tree) but both the original copy and an attempt to verify the copy crashes in F15 very quickly. I've tried both 2.6.38.6-26 which was the original F15 kernel and 2.6.38.8-35 which is the current one.

The filesystem in question is an md RAID5 of 4 Samsung drives with an ext4 filesystem. All the drives are fairly new (18 months)

I'm not sure that this is RAID specific though - the machine also crashed during a "yum install" of some stuff which would have gone to the system partition on a separate Hitachi Deskstar T7K500 - this drive is a bit older but, again, shows no problems with smartctl.

The destination drive is a new WD20EARS

This pretty much makes F15 unusable

Version-Release number of selected component (if applicable):
Kernels 2.6.38.6-26 and (through?) 2.6.38.8-35 affected

How reproducible:
100%

Steps to Reproduce:
1. Install Fedora 15
2. Copy large filesystem
3.

Actual results:
Reboots & freezes after a few 10s of seconds.

Expected results:
Data copied without reboots or freezes

Additional info:
Platform: Core i7 920 (2.67GHz Bloomfield, not O/C'd), x86_64 kernel versions as above, Asus P6TSE with 12GB DDR3 DRAM, md RAID5 (4x Samsung HD103SJ all clean with no errors according to smartctl) plus 2TB Werstern Digital drive acting as temporary backup while I reconfigure the RAID partition & a 320Gb Hitachi for system files.

Comment 1 risticmiroslav 2011-08-02 22:24:06 UTC

Same here, but with single HDD configuration. Whenever a large file is copied both to the HDD  and to an USB thumbdrive high IO occurs and system is freezed (unusable) until the copying is finished. No crash (not yet), though.

Comment 2 Mace Moneta 2011-08-03 23:49:33 UTC

Same thing.  It happens on any long running sustained I/O.  SATA, USB, or Firewire drives all experience the same problem.  It even happens when copying with 'ionice -c3'.  I started using rsync with '--bwlimit' to about half the normal throughput to work around the issue.

Only X hangs, I can ssh in, and don't see any issues in /var/log/messages or dmesg.  I do note that one core is running at 100% when the problem occurs, vs. about 10% when I use rsync bwlimit to throttle the throughput.

Same problem on kernel-2.6.40-4.fc15.x86_64 as well.

Comment 3 Elad Alfassa 2011-08-16 17:27:51 UTC

Same here.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 4 Mace Moneta 2011-08-20 02:29:13 UTC

I was able to easily recreate the problem, by copying a large amount of data (gigabytes) to a USB attached flash card.  After a few minutes, the user interface froze.  I ssh'd in from another machine and noticed high iowait, but little actual I/O taking place.  I switched all the devices from the CFQ scheduler to the deadline scheduler, and the problem immediately cleared.  With deadline, I can't recreate it anymore.  If nothing else, this should be a workaround for those with the problem.  You can switch on the fly for each device with:

echo "deadline" > /sys/block/sdX/queue/scheduler

or boot with kernel option elevator=deadline

Comment 5 Tommy He 2011-10-05 14:52:43 UTC

Confirmed this issue is still reproducible on Fedora 16 Beta with kernel 3.1.0-0.rc8.git0.1.fc16.i686.

Not good.

This is really a serious issue. It made my backing up and switching to GPT partition table not fluent as it should be.

Comment 6 Peter C 2011-10-05 21:30:09 UTC

Same issue here on 2.6.40.3-0.fc15.x86_64.

At first I thought it was just copying to USB devices (cf. https://bugzilla.redhat.com/show_bug.cgi?id=734516), but the issue also occurs when copying between two internal SATA drives.  Really annoying...

Comment 7 Tommy He 2011-10-06 01:04:13 UTC

(In reply to comment #4)
> I was able to easily recreate the problem, by copying a large amount of data
> (gigabytes) to a USB attached flash card.  After a few minutes, the user
> interface froze.  I ssh'd in from another machine and noticed high iowait, but
> little actual I/O taking place.  I switched all the devices from the CFQ
> scheduler to the deadline scheduler, and the problem immediately cleared.  With
> deadline, I can't recreate it anymore.  If nothing else, this should be a
> workaround for those with the problem.  You can switch on the fly for each
> device with:
> 
> echo "deadline" > /sys/block/sdX/queue/scheduler
> 
> or boot with kernel option elevator=deadline

For me, or at least 3.1.0-0.rc8.git0.1.fc16.i686, the switching to deadline scheduler only reduce the chances of freeze. I was still caught two freezes when copying between internal and USB-connected external hard drive. Both have ext4 file systems.

So I guess the cause lies somewhere else.

Comment 8 Elad Alfassa 2011-10-18 18:02:08 UTC

deadline didn't really help in my case. my case was copying large files to a usb storage device.
My usb storage device was formatted with NTFS. I re-formatted it to ext4, and it seems to fix the problem.

Comment 9 Tommy He 2011-10-21 11:26:29 UTC

(In reply to comment #8)
> deadline didn't really help in my case. my case was copying large files to a
> usb storage device.
> My usb storage device was formatted with NTFS. I re-formatted it to ext4, and
> it seems to fix the problem.

My USB external hard drive is formatted with ext4. However, the problem still exists.

Comment 10 Ankur Sinha (FranciscoD) 2011-10-24 14:39:55 UTC

hello,

Seeing this on my up to date F16 box. I tried copying 7gigs to a flash drive using rsync, and my interface froze. Pull the flash disk out, and it starts working again.

[ankur@ankur ~]$ uname -a
Linux ankur.pc 3.1.0-0.rc10.git0.1.fc16.x86_64 #1 SMP Wed Oct 19 05:02:17 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

Thanks,
Ankur

Comment 11 Emilio Scalise 2011-10-28 10:43:47 UTC

I think that https://bugzilla.redhat.com/show_bug.cgi?id=742802 is a duplicate of this bug.

I use fedora 15 x86_64, and high IO causes great system freezes. kernel version 2.6.40.6-0.fc15.x86_64
Today I was copying large files to a slow SD card, 3mbytes/s speed, and the whole system was mostly unresponsive.
So it doesn't matter where you read and or write 

That's very bad.
This could be also related to systemd/cgroup changes.

This problem should be treated as very high priority.

Comment 12 Tommy He 2011-10-29 13:26:42 UTC

This issue is reproducible on Fedora 16 RC with kernel 3.1.0-1.fc16.i686.

Copying a 7.5G file to USB stick in a speed of 7MB/s.

System completely freezes and then the cooling fan starts being noisy. So I guess high CPU happened as well.

Had to long press the power button to force close. REISUB didn't help.

Comment 13 Magnus Tuominen 2011-10-30 18:18:19 UTC

System UI froze while copying 1,5GB file from external HD (ext4) to usb key (vfat)

Linux verne 3.1.0-1.fc16.x86_64 #1 SMP Mon Oct 24 12:18:13 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

Comment 14 Dave Jones 2011-10-31 16:36:38 UTC

There's not much to go on here without any backtraces.
Try running the kernel-debug kernel to see if that makes any additional output appear before the lockups.

Comment 15 Tommy He 2011-11-01 12:40:19 UTC

(In reply to comment #14)
> There's not much to go on here without any backtraces.
> Try running the kernel-debug kernel to see if that makes any additional output
> appear before the lockups.

Here are the lines in /var/log/messages before freeze(20:18:50) and after I forced power off and restarted(20:21:42). I was copying a 7.5G file to NTFS formatted USB thumb disk. The freeze happened in less than two minutes:

Nov  1 20:18:33 localhost systemd-logind[924]: New session 6 of user lvp.
Nov  1 20:18:37 localhost gnome-session[3645]: DEBUG(+): GsmDBusClient: obj_path=/org/gnome/SessionManager interface=org.gnome.SessionManager method=IsInhibited
Nov  1 20:18:37 localhost gnome-session[3645]: DEBUG(+): GsmDBusClient: obj_path=/org/gnome/SessionManager interface=org.gnome.SessionManager method=IsInhibited
Nov  1 20:18:50 localhost dbus-daemon[940]: ** Message: No devices in use, exit
Nov  1 20:21:42 localhost systemd-tmpfiles[3850]: Successfully loaded SELinux database in 52ms 820us, size on heap is 363K.
Nov  1 20:24:42 localhost kernel: imklog 5.8.5, log source = /proc/kmsg started.
Nov  1 20:24:42 localhost rsyslogd: [origin software="rsyslogd" swVersion="5.8.5" x-pid="1032" x-info="http://www.rsyslog.com"] start
Nov  1 20:24:42 localhost kernel: [    0.000000] Initializing cgroup subsys cpuset
Nov  1 20:24:42 localhost kernel: [    0.000000] Initializing cgroup subsys cpu

I use kernel-debug-3.1.0-5.fc16.i686 in this case. The session 6 was started because the pulseaudio constantly crashed and resulted in high Gnome Shell CPU usage. So I performed the copy operation in VTE.

Let me know what other info I can provide.

Comment 16 Tommy He 2011-11-03 05:27:04 UTC

(In reply to comment #14)
> There's not much to go on here without any backtraces.
> Try running the kernel-debug kernel to see if that makes any additional output
> appear before the lockups.

Please instruct me how to collect the information you require to identify this issue.

Comment 17 Gilboa Davara 2011-12-02 05:17:11 UTC

Dave,

Would oprofile dump would do?
It will be a rather large dump...

- Gilboa

Comment 18 Emilio Scalise 2011-12-20 14:46:17 UTC

Still happens also on Fedora 16, x86_64 kernel 3.1.5-6.fc16.x86_64.

If you switch virtual console you can work on the other, but Xorg session is blocked by flush operations on the slow io device.

I was copying 10gb of files from a micro sd (with usb adapter) and a nexus s phone.

X session is not usable when copying files (copy started using dolphin file manager).

Comment 19 Paulo Fidalgo 2012-01-02 11:34:14 UTC

I can confirm the behaviour described by #18, although if I start htop from the virtual console it don't start. Every time this happens and I can start htop I see that the CPU and memory use are low/normal.

Tested with 320GB external drive, 8 GB android phone (sdcard) and 2/4 GB flash drive.

Comment 20 Elad Alfassa 2012-01-02 11:52:03 UTC

After updating my computer's BIOS, and switching of "legacy usb support" in the BIOS, I no longer see the problem.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 21 Emilio Scalise 2012-01-06 13:13:52 UTC

http://phoronix.com/forums/showthread.php?67502-Linux-3-2-Kernel-Officially-Christened&p=245623#post245623

an user says that 3.2 kernel fixes the problem, I'm going to grab a 3.2 build!!!

Comment 22 Josh Boyer 2012-01-06 14:15:04 UTC

(In reply to comment #21)
> http://phoronix.com/forums/showthread.php?67502-Linux-3-2-Kernel-Officially-Christened&p=245623#post245623
> 
> an user says that 3.2 kernel fixes the problem, I'm going to grab a 3.2
> build!!!

http://koji.fedoraproject.org/koji/buildinfo?buildID=281207 would be the one I suggest if you want to try 3.2 out.  It's the final 3.2 release with the debugging options disabled.

Comment 23 Emilio Scalise 2012-01-06 16:17:30 UTC

do I need to rebuild it?
It's compiling since two hours ago, because it's compiling many kernel flavours..
I've done a simple rpmbuild --rebuild kernel-3.2.0-2.fc17.src.rpm

Comment 24 Josh Boyer 2012-01-06 16:28:49 UTC

(In reply to comment #23)
> do I need to rebuild it?
> It's compiling since two hours ago, because it's compiling many kernel
> flavours..
> I've done a simple rpmbuild --rebuild kernel-3.2.0-2.fc17.src.rpm

No... you should just be able to download the built RPMs directly from that link already.

Comment 25 Emilio Scalise 2012-01-07 20:35:38 UTC

I've build and installed that 3.2 kernel and definitely fixes this bug for me.

That's great!

Comment 26 Ankur Sinha (FranciscoD) 2012-01-09 12:57:37 UTC

Hi Josh,

Will 3.2 please be made available to F16?. This bug is a real pain, and F17 is a long way off :/

Thanks,
Ankur

Comment 27 Peter Robinson 2012-01-09 13:03:04 UTC

(In reply to comment #26)
> Hi Josh,
> 
> Will 3.2 please be made available to F16?. This bug is a real pain, and F17 is
> a long way off :/

Yes. Its been discussed on the lists http://lists.fedoraproject.org/pipermail/devel/2012-January/160970.html

You can also use the 3.2.0 F17 quite easily with F-16 without issues.

Comment 28 Ankur Sinha (FranciscoD) 2012-01-09 13:16:43 UTC

Great! Thanks! :)

Comment 29 Julien HENRY 2012-01-09 15:59:43 UTC

I also confirm that updating to kernel 3.2 fix the issue. Even under high computing/IO load, UI is slowed down but still responsive! Much better than to wait until the process is completed.

Comment 30 Paulo Fidalgo 2012-01-11 12:21:00 UTC

After updating updating the kernel package from the one provided here:
http://koji.fedoraproject.org/koji/buildinfo?buildID=281207

I've notice that now I can use my system while copying files from/to USB drives, although the speed is not constant and low, despite the low system resource usage.

I also experience some minor hangups and copying between two usb drives is really slow (below 1MByte/s) and inconstant (the drives are connected to the same internal usb hub).

Comment 31 Emilio Scalise 2012-01-20 11:56:47 UTC

It seems that also F16 kernel version 3.1.9-1.fc16.x86_64 fixes the problem.
Could someone try that to confirm that it works?
Anyone has backported those IO patches from 3.2?

Comment 32 Gilboa Davara 2012-01-21 15:03:13 UTC

No go.
I just created a 16GB VM image using dd if=/dev/zero bs=1M ... and the DE (KDE) was completely unresponsive (beyond the mouse pointer, that is).
P.S. my 5 x 320GB Software RAID 5 setup was writing at ~350MBps and SSH access worked just fine, so the problem, at least as far as I could see, was limited to the GUI.

- Gilboa

Comment 33 Emilio Scalise 2012-01-22 00:59:12 UTC

which kernel you used? could you try with 3.2.1 kernel from f17?

Comment 34 Gilboa Davara 2012-01-23 03:59:33 UTC

F16/3.2.1 from koji hanged my machine during after a couple of hours of heavy usage. As I didn't have a serial console attached, I couldn't get any callstack.
However, as I'm using the nVidia binary driver (and couldn't reproduce the results on a nVidia-less machine) I decided not to post a bug report.

I'll wait for next koji build before trying again. (F17 already moved to 3.3pre-rc)

- Gilboa

Comment 35 Paulo Fidalgo 2012-01-24 10:15:47 UTC

It appears to be solved with 3.2.1-3.fc16.x86_64.

At least I could copy a file with about 12GB without system hang using KDE Dolphin. I can observe slowness but not hanging for a long period, which is considered normal due I/O operations.

Comment 36 Gilboa Davara 2012-01-30 16:00:38 UTC

Using F16/x86_64/v3.2.1-3, I dd'ed a 60GB VM image (dd if=/dev/zero of=/image/name bs=1M count=XXX) and KDE was unbearably slow.
The test was conducted on a 2 x 6C Xeon w/12GB RAM and 5 x 320GB SATA drives on software RAID5 (w/ HT on).

No idea why, but I can't remember hitting the same issue in early F15 kernels.
(I used to compile kernels and while playing spring-rts... :))

Comment 37 G. Michael Carter 2012-02-15 18:47:55 UTC

I'm getting the same sort of issue:

Under heavy IO it can freeze for a bit, then may or may not return.   When running my monthly backups things start queuing up till nothing works... 

That is: while frozen if you do a ps -ef it hangs, or top or whatever.  System is still running and if lucky I can type reboot.  If not process just start piling up.

Also if it's doing a resync when the backups trigger... forget it... the system will lock up.

Two systems:  both 3.2.5-3.fc16.x86_64 and using xfs on the array

System 1: 
[root@whitestar log]# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Sun Aug 14 00:28:09 2011
     Raid Level : raid5
     Array Size : 2930284032 (2794.54 GiB 3000.61 GB)
  Used Dev Size : 976761344 (931.51 GiB 1000.20 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Wed Feb 15 13:26:35 2012
          State : active, resyncing 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

  Resync Status : 84% complete

           Name : whitestar:0  (local to host whitestar)
           UUID : ff6d1724:e688b10f:e1cd698e:017c37f7
         Events : 13921

    Number   Major   Minor   RaidDevice State
       0       8       32        0      active sync   /dev/sdc
       1       8       48        1      active sync   /dev/sdd
       3       8       64        2      active sync   /dev/sde
       4       8        0        3      active sync   /dev/sda

System #2:

[root@andromeda /]# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Tue Aug 16 20:31:39 2011
     Raid Level : raid5
     Array Size : 9767564800 (9315.08 GiB 10001.99 GB)
  Used Dev Size : 1953512960 (1863.02 GiB 2000.40 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

    Update Time : Wed Feb 15 13:43:11 2012
          State : clean 
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : andromeda:0  (local to host andromeda)
           UUID : a15f896f:5adcb015:8908feeb:8f0ce6f2
         Events : 35578

    Number   Major   Minor   RaidDevice State
       0       8       48        0      active sync   /dev/sdd
       1       8       64        1      active sync   /dev/sde
       2       8       80        2      active sync   /dev/sdf
       4       8       32        3      active sync   /dev/sdc
       6       8       16        4      active sync   /dev/sdb
       5       8        0        5      active sync   /dev/sda

Comment 38 Magnus Tuominen 2012-03-09 20:04:31 UTC

This was still present in 3.2.9-2, so I grabbed the 3.3.0-0.rc6.git2.2.fc18 from koji today and installed it on my f16 x86_64 and I no longer experience the gui lockup.

Comment 39 Ankur Sinha (FranciscoD) 2012-03-10 05:15:06 UTC

Confirming that I still get the UI lock up:

Linux ankur.pc 3.2.9-1.fc16.x86_64 #1 SMP Thu Mar 1 01:41:10 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Will test with 3.3 when it hits stable and re-confirm.

Comment 40 Dave Jones 2012-03-22 16:41:00 UTC

[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 41 Dave Jones 2012-03-22 16:45:52 UTC

[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 42 Dave Jones 2012-03-22 16:55:01 UTC

[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 43 Emilio Scalise 2012-03-23 01:07:33 UTC

updated to kernel-3.3.0-4.fc16, writed many small files over a slow usb pen, no slowdowns to graphical ui (I use kde), no lag. Great!

Hope that this will remain this way over next kernel updates.

Thanks Dave and other kernel devs!

Comment 44 Gilboa Davara 2012-03-25 18:27:29 UTC

Still no go.
With fully update kernel on both host and VM's I've simultaneously update the kernel cscope DB on two F16 VM's while browsing the web on the quad core AMD Phenom 635 machine w/ 8GB RAM.
I/O was at ~10-20MBps, CPU usage was at under 100% (effectively 25%), but the UI was *very* sluggish.

- GIlboa

Comment 45 Gilboa Davara 2012-03-25 18:29:51 UTC

... Though I should point out that the I had minor swap usage ~250MB, so the slowdown might have been triggered by swap-out.
I'll redo the test on my Xeon workstation (see above) and report the results.

- Gilboa

Comment 46 Daniel L. 2012-03-27 08:02:33 UTC

I'm getting the same issues with the new 3.3.0-4.fc16 kernel.

I've built a software raid 1 with two usb drives for my backups. The system freezes happen when I copy data from or to the raid or when it is resyncing.

This is really annoying as the raid needs to be resynced when the system freezed while I was writing to the raid..

Comment 47 Jes Sorensen 2012-03-27 13:08:41 UTC

Daniel,

Could y(In reply to comment #46)
> I'm getting the same issues with the new 3.3.0-4.fc16 kernel.
> 
> I've built a software raid 1 with two usb drives for my backups. The system
> freezes happen when I copy data from or to the raid or when it is resyncing.
> 
> This is really annoying as the raid needs to be resynced when the system
> freezed while I was writing to the raid..

Could you please clarify here - did you build the raid1 from the two USB
drives, or do you have a system with a raid1 built from regular disk and
just two USB drives installed?

Thanks,
Jes

Comment 48 Jes Sorensen 2012-03-27 13:17:23 UTC

(In reply to comment #44)
> Still no go.
> With fully update kernel on both host and VM's I've simultaneously update the
> kernel cscope DB on two F16 VM's while browsing the web on the quad core AMD
> Phenom 635 machine w/ 8GB RAM.
> I/O was at ~10-20MBps, CPU usage was at under 100% (effectively 25%), but the
> UI was *very* sluggish.
> 
> - GIlboa

How are you running the VMs? Are you making sure to launch them with O_DIRECT
access to the file VM image files? I think it's cache=none on the QEMU
command line.

If you run the image files with regular buffered I/O and the guests are large
compared to the amount of memory you have, you can easily thrash the system
memory of the host which will lead to a very sluggish system.

Jes

Comment 49 Daniel L. 2012-03-27 13:58:28 UTC

(In reply to comment #47)
> Daniel,
> 
> Could y(In reply to comment #46)
> > I'm getting the same issues with the new 3.3.0-4.fc16 kernel.
> > 
> > I've built a software raid 1 with two usb drives for my backups. The system
> > freezes happen when I copy data from or to the raid or when it is resyncing.
> > 
> > This is really annoying as the raid needs to be resynced when the system
> > freezed while I was writing to the raid..
> 
> Could you please clarify here - did you build the raid1 from the two USB
> drives, or do you have a system with a raid1 built from regular disk and
> just two USB drives installed?
> 
> Thanks,
> Jes

Hi Jes,

The raid is build from the two USB drives.

Comment 50 frollic nilsson 2012-04-03 10:43:22 UTC

I'm having the same issue, kernel 3.3.0-4.fc16.i686, 4 disk soft RAID5 set. 

when RAID IO is high, the process generating the load will eventually freeze, the process specific CPU core goes into an endless 99-100% WAIT state.

When this happens, a shutdown -h now doesn't work, the computer doesn't restart, it hangs during shutdown, only the power or reset button helps. 

Once restarted the raid set have to be resynced.

Hardware is 

Supermicro X7SPA-H (Atom D510)
2GB RAM
soft RAID1 for OS
soft RAID5 for data

The problem occurred today again, last time it happened was 10 days ago.

"screenshot" from nmon (don't know if it'll be readable though, once posted)

│CPU  User%  Sys% Wait% Idle|0          |25         |50          |75       100|│
│ 1   0.5   0.5   0.0   99.0| >                                               |│
│ 2   0.0   0.0   0.0  100.0| >                                               |│
│ 3   0.0   0.5  99.5    0.0|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>│
│ 4   0.0   0.0   0.0  100.0| >                                               |│
│                           +-------------------------------------------------+│
│Avg  0.1   0.2  25.0   74.6|WWWWWWWWWWWW >                                   |│
│                           +-------------------------------------------------+│

Comment 51 Gilboa Davara 2012-04-03 11:01:25 UTC

Jes,

My VM's already use cache=none.
*However* I can only seem to reproduce this issue on a fairly (?) low end quad core AMD Phnome machine w/8GB RAM.
Running the same scenario (including the minor memory over-commit) on a dual 6C Xeon w/ 12GB RAM + 5 drive software RAID5 doesn't produce the same results.

Far more, I just DD'ed 4GB into a USB thumb drive while DD'ing 40GB to the MD5 RAID did produce some sluggishness, but far, *far* better than 3.0 or 3.1.

I'll do some additional testing and report back.

Comment 52 Daniel L. 2012-04-03 18:19:28 UTC

I can confirm the behavior that is described in comment 50.

When the system freezes there is nothing you can do. Even Caps Lock and Num Lock don't work anymore.

Comment 53 frollic nilsson 2012-05-08 07:20:07 UTC

Problem still occurs in 3.3.2-6.fc16.i686

Comment 54 frollic nilsson 2012-05-09 12:17:33 UTC

3.3.4-3.fc16.i686 is a no go, managed to run into the problem again in less than 24hrs.

It's amazing this problem doesn't seem to be assigned to anyone at RH .... :(

Comment 55 Jes Sorensen 2012-05-09 12:58:25 UTC

Frollic,

We're watching the bug, however there is nothing obvious pattern and it is
not clear what is going wrong. Only a very small number of people have seen
this problem - I haven't seen anything like it in my own testing.

In addition, you are running a 32 bit kernel on a 64 bit processor with
more than 1GB of RAM. That doesn't make sense as you end up using highmem
and bounce buffers which will slow down your I/O performance a fair bit.

There are two issues at play here, some people have reported problems when
running RAID on top of USB devices, which is just begging for bad things
to happen with the short queues and slow performance of USB. The other case
is with real SATA connections, but there it also smells like there could be
dodgy hardware at play.

Jes

Comment 56 Emilio Scalise 2012-05-09 13:29:46 UTC

I would also add that the problem with slow io devices (without raid) and Xorg freezing seems almost to be gone.
Previously every machine I had GUI freezing while writing to slow usb disks or SD cards (both 32bit and 64 bit).

Perhaps problems related to raid arrays are different.

Regards

Comment 57 frollic nilsson 2012-05-09 13:30:54 UTC

Well, it's a very old system, which have been migrated from a previous setup. 

You can trust me, when I say I would have "upgraded" a long time ago, if there was an easy way of doing it :) 

Is there a way I/we can help you in investigating the problem ? 
I seem to run into the issue quite frequently :(

Btw, I think the problems started once I upgraded to FC16, I don't think I had them in FC15.

Comment 58 Jes Sorensen 2012-05-09 13:45:04 UTC

Frollic,

It may be the case, but as I said, running a 32bit kernel is begging for
problems if you want to have I/O performance. I realize upgrading might
be tricky, but it would be very interesting to know if the problems you
see go away if you run a 64 bit install on the box.

Emilio,

I suspect the USB issues are not so much related to raid, but simply that
the raid code is more likely to trigger the issue in the first place because
of a different way of hitting the device.

Jes

Comment 59 G. Michael Carter 2012-05-09 18:00:32 UTC

(In reply to comment #55)
> Frollic,
> 
> We're watching the bug, however there is nothing obvious pattern and it is
> not clear what is going wrong. Only a very small number of people have seen
> this problem - I haven't seen anything like it in my own testing.
> 
> In addition, you are running a 32 bit kernel on a 64 bit processor with
> more than 1GB of RAM. That doesn't make sense as you end up using highmem
> and bounce buffers which will slow down your I/O performance a fair bit.
> 
> There are two issues at play here, some people have reported problems when
> running RAID on top of USB devices, which is just begging for bad things
> to happen with the short queues and slow performance of USB. The other case
> is with real SATA connections, but there it also smells like there could be
> dodgy hardware at play.
> 
> Jes

If it's dodgy hardware that would be interesting.  I have the same problem on two different computers.  (both i7 2600k)   Different motherboards, and controller cards.  Different hard drive manufactures.

Comment 60 Jes Sorensen 2012-05-10 07:57:50 UTC

(In reply to comment #59)
> If it's dodgy hardware that would be interesting.  I have the same problem on
> two different computers.  (both i7 2600k)   Different motherboards, and
> controller cards.  Different hard drive manufactures.

This is the first time I have seen anyone mention multiple cases of this
problem, where as others have mentioned the problem went away when moving
to other hardware.

Can you list the controller info, motherboard chipset, and kernel version?

Are you running partitions directly on top of the raid or do you have lvm
in between?

From your previous posting I presume you are running the 64 bit kernel?

Jes

Comment 61 G. Michael Carter 2012-05-10 12:55:15 UTC

Created attachment 583544 [details]
Computer 1

The raid is direct partitions and the filesystem on top is XFS.

Comment 62 G. Michael Carter 2012-05-10 12:59:36 UTC

Created attachment 583552 [details]
Computer 2

Here's the second computer.

Comment 63 frollic nilsson 2012-05-14 18:06:32 UTC

OK, swapped to x86_64, problem appeared in less than 24hrs after the system was ready (same as when I ran i686), and the soft raid5 IO increased.

kernel is 3.3.4-3.fc16.x86_64

I'll try the 3.3.5-2 kernel after I reboot the server tomorrow morning.

Comment 64 Tommy He 2012-05-20 15:36:28 UTC

Though this issue seemed to have gone with kernel 3.2.X series, it returns on  all the current 3.3.X kernels on Fedora 17. That includes: 

kernel-3.3.0-1.fc17.x86_64
kernel-3.3.4-5.fc17.x86_64
kernel-3.3.6-3.fc17.x86_64

For other hadrware infos, please see my smolt profile: http://www.smolts.org/client/show/pub_79c55d9b-ea8d-45f5-8159-fb546e8d06f6

I had tried two different portable hard disks connected via USB. One is 500GB Seagate one formatted with NTFS and the other is the similar but formatted with ext4.
The system will completely freeze during either writing or reading large files to them.

Comment 65 frollic nilsson 2012-05-20 19:48:12 UTC

OK, upgraded to x86_64 last week, and (as expected) the problem still occurs.

Kernel release is 3.3.5-2.fc16.x86_64.

This time I'm actually able to kill the application causing the IO wait, but the wait it self doesn't go away.

Comment 66 Tommy He 2012-05-21 08:45:29 UTC

With the official announcement of kernel 3.4.0, I am kind of hope this issue to be fixed in upstream.

Will give the 3.4.0 f17 kernel a try once it's in Koji.

Comment 67 frollic nilsson 2012-05-24 10:57:19 UTC

After Tommy Hes findings (#64), I decided to downgrade to 3.2.6-3.fc16.x86_64. 

I've used the kernel for over a week, and had no freeze so far.

Comment 68 Jes Sorensen 2012-05-24 14:36:29 UTC

I am starting to think this isn't raid related, at least not all the time.

Copying files from an external USB drive to a fast memory stick, and I am
seeing my laptop freeze regularly.

3.3.5-2.fc16.x86_64

It feels like it fills up the write queue and then stalls while the writes
complete rather than schedule and let other stuff run in the mean time. It
doesn't lock up solid, but that is most likely due to the flash drive being
a very fast one....

Comment 69 Tommy He 2012-05-24 15:49:49 UTC

I tried a few other things to see if anything changed.

1. Upgrade to latest BIOS. No change.
2. Add "usb-handoff" to kernel boot parameter. No change.

Comment 70 Jes Sorensen 2012-05-25 07:28:01 UTC

Ran some more tests yesterday - I found I see this problem if I copy form
one USB drive to another. However when I switched to copying from network
to the flash drive, the hangs went away.

Question for those who see this problem with USB: Has anyone seen this on
USB using just one drive, or does it always happen when more than one USB
drive is involved?

If the latter, it could indicate we have the USB stack fighting over a lock.

Comment 71 Benjamín Valero Espinosa 2012-05-25 07:46:26 UTC

(In reply to comment #70)
> Question for those who see this problem with USB: Has anyone seen this on
> USB using just one drive, or does it always happen when more than one USB
> drive is involved?
> 
> If the latter, it could indicate we have the USB stack fighting over a lock.

I had this problem months ago making backups with DejaDup on a NTFS-formatted external hard disk. I also had a similar problem not so long ago (system almost unusable) copying a 8 GB file to a NTFS-formatted pendrive. I will try to make a test this weekend.

Comment 72 Daniel L. 2012-05-26 10:12:37 UTC

Jes,

I can confirm that the problem mostly happens when I am copying data from one USB drive to another (can't rember it happening when copying data from/to USB from another source (such as network) or from my internal HDD).

Moreover I found out that the problem occurs when I am copying data from a fast USB drive to a much slower one.

Comment 73 Tommy He 2012-05-26 13:57:16 UTC

(In reply to comment #70)
> Ran some more tests yesterday - I found I see this problem if I copy form
> one USB drive to another. However when I switched to copying from network
> to the flash drive, the hangs went away.
> 
> Question for those who see this problem with USB: Has anyone seen this on
> USB using just one drive, or does it always happen when more than one USB
> drive is involved?
> 
> If the latter, it could indicate we have the USB stack fighting over a lock.

The freeze happens on both cases, at least for me. The issue normally occurs when copying large files or multiple small files.

Comment 74 Tommy He 2012-06-10 16:03:03 UTC

Hi all,

This issue seems to be addressed in 3.4.0-1.fc17.x86_64. I tried copying and moving several large files from USB connected portable HDD to internal HDD and no system freeze happened.

Thanks,

Comment 75 Paulo Fidalgo 2012-08-03 12:10:04 UTC

I agree with Tommy He, I've stopped experiencing it.

Comment 76 frollic nilsson 2012-08-05 10:21:26 UTC

Yes,

problem appears to be fixed.

I transfered almost 1 TB between two RAID5 sets, where one of them was USB-based, no problem whatsoever.

Comment 77 Jes Sorensen 2012-08-07 12:32:14 UTC

Hi,

Since it looks like the problem has been resolved for everyone, I am going
to close this bug.

If you see this problem returning, please reopen the bug.

Thanks,
Jes

Comment 78 Mauricio Henriquez 2013-10-16 20:06:53 UTC

Hi guys, sadly I have to inform that I have to report the same problem described here on Fedora 19 kernel 3.11.3-201.fc19.i686.PAE.

The system freeze/hangs on heavy I/O loads, no matter if is from one internal partition to another or from a internal partition to an USB disk or from an USB disk to an internal partition.

The system complete freeze, no possible to go to other terminal I forced to do a hard reboot.

My internal system drive is a HP sps-drv hdd 750gb 7200sat2.5in new the USB disk can be any.

My system is an Alienware 15x Intel® Core™ i7 CPU Q 820 @ 1.73GHz × 8 with an Nvidia GeForce GTX 260M/PCIe/SSE2. The problem it also happen with the nouvou driver.

What else can I add to help to report this?, at least the "messages" log seems to report the same things already on this thread.

Comment 79 Scott Baker 2013-10-29 21:13:51 UTC

I just tried making an 1Gig SD card for my Raspberry Pi and I got this issue again. It's been solved for a LONG time, but it just came back. I'm running: 3.11.4-201.fc19.x86_64

Comment 80 Daniel L. 2013-11-09 15:27:30 UTC

Hi,

this problem caught me too on 3.11.6-200.fc19.x86_64. I was syncing data with eSATA drives when the system froze :(

With 3.11.3-201.fc19.x86_64 everything is working for me so far.

Comment 81 Scott Baker 2013-11-13 21:26:00 UTC

Not sure what changed... but when I made a Pi SD card today I got no slowness. I'm on 3.11.7-200.fc19.x86_64 now.

Comment 82 Peter Meyer 2014-01-18 15:20:13 UTC

I switched from Ubuntu 12.04 to Fedora (because it has a usable Desktop UI in the form of KDE). After 3 months of Fedora 17, I upgraded to Fedora 19. 

This bug unnerved me in Fedora 17, and continues to do so in Fedora 19.

Copying large amounts of data, starting a VM with >512MB in vmware ... most, but not everything that moves a large amount of data slows KDE down to a crawl. Twice now, some misbehaving program forced me to shutdown my computer via Button, because the UI didn't react and I could not kill the program.

This could either be a kernel bug, introduced later than 12.04 Ubuntu, or some Redhat/Ubuntu difference/optimization. 

Nevertheless, this never happended in Ubuntu nor Windows.

Running on an Lenovo Thinkpad X201: 3.12.7-200.fc19.x86_64 #1 SMP Fri Jan 10 15:32:06 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Please provide a workaround/fix for this.

Comment 83 Peter Meyer 2014-01-18 15:22:18 UTC

I switched from Ubuntu 12.04 to Fedora (because it has a usable Desktop UI in the form of KDE). After 3 months of Fedora 17, I upgraded to Fedora 19. 

This bug unnerved me in Fedora 17, and continues to do so in Fedora 19.

Copying large amounts of data, starting a VM with >512MB in vmware ... most, but not everything that moves a large amount of data slows KDE down to a crawl. Twice now, some misbehaving program forced me to shutdown my computer via Button, because the UI didn't react and I could not kill the program.

This could either be a kernel bug, introduced later than 12.04 Ubuntu, or some Redhat/Ubuntu difference/optimization. 

Nevertheless, this never happended in Ubuntu nor Windows.

Running on an Lenovo Thinkpad X201: 3.12.7-200.fc19.x86_64 #1 SMP Fri Jan 10 15:32:06 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Please provide a workaround/fix for this.

Comment 84 Roland Pallai 2014-01-22 21:12:19 UTC

(In reply to Scott Baker from comment #79)
> I just tried making an 1Gig SD card for my Raspberry Pi and I got this issue
> again. It's been solved for a LONG time, but it just came back. I'm running:
> 3.11.4-201.fc19.x86_64

I totally agree. It's been an issue before kernel 3.4 for me too then solved and now it's back. I'm running 3.12.7-300.fc20.x86_64.

Does not depend on the type of the destination drive, I've experienced it with internal HDD, pendrive, SD card. cp on filesystems and dd on raw block devices also affected.

Comment 85 Peter Meyer 2014-02-13 12:01:19 UTC

Could someone please advise, what to do and how to get to the root of the problem? 

Right now I am at a loss at what Linux Distribution is at all usable, Ubuntu has an unusable interface but not this annoying bug, redhat has a working KDE, but becomes unusable whenever something happens that creates a larger IO load.

If you need additional information I am willing to provide it.

Comment 86 frollic nilsson 2014-02-13 12:10:34 UTC

I've been running kernel-3.10.5-201.fc19 for a long time (180 days or so), without any soft-RAID5 issues at all.

You should be able to find it at http://koji.fedoraproject.org/koji/packageinfo?packageID=8

Today I upgraded to FC20 and 3.12.10-300.fc20, we'll see if it works properly.

Comment 87 Peter Meyer 2014-02-14 16:07:15 UTC

thank you,

I downloaded 
kernel-3.10.5-201.fc19.x86_64.rpm
kernel-devel-3.10.5-201.fc19.x86_64.rpm
kernel-headers-3.10.5-201.fc19.x86_64.rpm
kernel-tools-3.10.5-201.fc19.x86_64.rpm
kernel-tools-libs-3.10.5-201.fc19.x86_64.rpm

Installing with the rpm command did not work, so I tried 
# yum localinstall *.rpm 
which did (I come from Ubuntu, so everything Fedora-specific is new to me)

This worked.

Then I tried all things important to me, that depend upon the kernel version, namely vmware and virtualbox, which are working now. 

So I do indeed have a working system now! 
I will come back if it didn't solve the problem and comment again.

In the meantime thank you for your advice and help to provide me with a working linux system again.

Comment 88 Carl van Tonder 2014-02-19 09:50:13 UTC

Seeing this whenever my F19 system is carrying out a reasonable amount of I/O.

3.12.9-201.fc19.x86_64

Easiest way to reproduce is copying files to a USB stick (FAT or NTFS both seem to trigger).

Comment 89 Peter Meyer 2014-03-06 10:52:30 UTC

Installing Kernel 3.10 did the trick!

Now I just have to figure out, how to set this kernel as default without resetting all of my /bood/grub2/grub.cfg entries to some default. 

Yeah well, not complaining, but Ubuntu had a nice GUI for this :-)

..nevertheless, the problem is solved (for me) by a new kernel - thank you, especially frollic nilsson who pointed me to how to work around this problem

Note You need to log in before you can comment on or make changes to this bug.

benjavalero
bikehead
buhochileno
carl
daniell1
dap78
dzickus
elad
emisca
frollic
fschwarz
gansalmon
gilboad
henryju
hicham.haouari
itamar
jeff
Jes.Sorensen
joe.christy
jonathan
kernel-maint
lovenemesis
madhu.chinakonda
magnus.tuominen
mikey
mishu
panormitis
patrik.marxer
paulo.fidalgo.pt
peter.ceiley
risticmiroslav
sanjay.ankur
scott
sven
tbzatek