|Summary:||Bad I/O throughput when stressing Dell PERC 4/im LSI Fusion|
|Product:||Red Hat Enterprise Linux 3||Reporter:||Ingvar Hagelund <ingvar>|
|Component:||kernel||Assignee:||Tom Coughlan <coughlan>|
|Status:||CLOSED WONTFIX||QA Contact:||Brian Brock <bbrock>|
|Version:||3.0||CC:||dledford, johan.lithander, petrides, trond.nordheim|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2007-10-19 18:51:21 UTC||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description Ingvar Hagelund 2005-11-09 20:41:28 UTC
From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; nb-NO; rv:1.7.12) Gecko/20050920 Firefox/1.0.7 Description of problem: When stressing the raid, throghput is good for a short period of time (some 10 minutes), and then drops dramaticly. All I/O takes a very long time, so most processes using the disk goes iowait and load rises. System becomes totally unusable. Basic file operation takes several minutes. Login times out. We reproduced this by extracting a large (24GB compressed, 86GB uncompressed, 14.5 million files and directories) tarball onto an ext3 filesystem on a 90GB LVM volume. The underlaying disk is a hardware RAID1 mirror, which is a common resource, that is, all local volumes is on that mirror (as it's the only disk resource on the blade). We can reproduce this both with the driver in the u6 kernel 2.4.21-37.EL, and with Dells newer dkms driver mptlinux-2.05.16-1dkms. With the latest Dell driver, the situation becomes a tiny bit less terrible, but the system is still unusable. The hardware is a Dell 1855 blade server with 6GB RAM. The Dell PERC 4/im is actually an LSI Fusion-MPT in disguise: # lspci | grep LSI 04:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 08) Version-Release number of selected component (if applicable): kernel-2.4.21-37.EL How reproducible: Always Steps to Reproduce: 1. Add a very large compressed tarball, consisting of some 760,000 files, spread out over the bottom of a 18 level deep filetree (reverse hashtree directory structure), totally about 14.5 million files and directories. 2. Start untaring the file. Watch for some 10 minutes 3. See throughput for the rest of the system crumble Actual Results: All processes but the untarring process using the disk goes into iosleep. Terrible disk throughput. High system load (naturally). System unusable. Expected Results: Good throughput. Only short iowait periods. Additional info: # uname -a Linux some.host 2.4.21-37.EL #1 SMP Wed Sep 7 13:32:18 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux # lsmod | grep mpt mptscsih 43792 3 mptbase 50472 3 [mptscsih] diskdumplib 6548 0 [mptscsih mptbase] scsi_mod 130124 4 [usb-storage sg mptscsih sd_mod] # cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: DELL Model: VIRTUAL DISK IM Rev: 1998 Type: Direct-Access ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 06 Lun: 00 Vendor: SDR Model: GEM318P Rev: 1 Type: Processor ANSI SCSI revision: 02 exerpt from lspci -vvv 04:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 08) Subsystem: Dell: Unknown device 018a Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 72 (4250ns min, 4500ns max), cache line size 10 Interrupt: pin A routed to IRQ 42 Region 0: I/O ports at ec00 [size=256] Region 1: Memory at dfdf0000 (64-bit, non-prefetchable) [size=64K] Region 3: Memory at dfde0000 (64-bit, non-prefetchable) [size=64K] Expansion ROM at dfe00000 [disabled] [size=1M] Capabilities:  Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities:  Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- Address: 0000000000000000 Data: 0000 Capabilities:  PCI-X non-bridge device. Command: DPERE- ERO- RBC=0 OST=4 Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, DC=simple, DMMRBC=0, DMOST=0, DMCRS=0, RSCEM- We have also tried a similar blade with rhel4 latest kernel 2.6.9-22.0.1.EL. The problem is visible on this blade too, but takes a lot longer (some 70-80 minutes) to trig. After the problem is visible, throughput is very bad, but a little less terrible than with the rhel3 installation, which means, It's possible to log into the system and do small basic file operations like ls and cp without waiting several minutes for answer. The setup was tested in our customer's lab, using centos4, and worked flawlessly on simple SATA disks with the centos4 2.6.9-11.ELsmp kernel.
Comment 1 RHEL Product and Program Management 2007-10-19 18:51:21 UTC
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.