Bug 865755 - ext3/4: Writing large file on large memory system with slow disks makes machine unusable
Summary: ext3/4: Writing large file on large memory system with slow disks makes machi...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.8
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Carlos Maiolino
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-10-12 10:48 UTC by Michael Weiser
Modified: 2018-11-30 21:09 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-03-20 18:50:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Michael Weiser 2012-10-12 10:48:36 UTC
Description of problem:

We have a number of new machines with 128GB of RAM. Their system is on a hardware RAID1 of two 1TB SATA drives. They're installed with RHEL5.8 and latest patches, especially kernel 2.6.18-308.16.1.el5.

When writing a large file (e.g. 150GB) to the root filesystem, after about a minute further writes hang for a very long time.Since syslogd writes (and syncs) log messages for all logins, logging into the machine becomes impossible. Open sessions (e.g. via SSH) continue to run normally until a write is triggered.

After the large write has finished, it takes about another 10 minutes until the write cache seems to have shrunk to a size where the system will handle other write requests in a timely fashion. All in all it takes about 30 minutes for the system to become usable again.

This only happens with ext3 or ext4 as root filesystem. When running on ext2, the machine stays responsive all through the large write operation and logins are possible.

The symptom also does not appear when installing the machine with RHEL6.

The problem also affects a separate RAID0 of four 300GB SATA drives performing at about 600MB/s although not as severely: Writes still hang after a certain cache fill is reached but since it's not the root filesystem, logins into the machine remain possible.

Changing values in /proc/sys/vm/{dirty_ratio,dirty_background_ratio,dirty_write_centisecs,dirty_expire_centisecs} brings no noticeable change.

The problem is mitigated noticeably when limiting system RAM to 2GB using kernel parameter mem=2048M: First the system continues to work as if the write wasn't running at all. This can not be explained by the write cache because meanwhile about 50GB get written to disk. After that it starts to hang as with the full amount of memory but SSH logins do succeed eventually (after about two minutes). Stopping the running dd using Ctrl+C makes the system run normally again almost immediately (as opposed to the 10 minutes mentioned above).

It seems, a bug in RHEL5's ext3/4 is aggravated by the large discrepancy between amount of system RAM and root filesystem write speed.

Version-Release number of selected component (if applicable):
2.6.18-308.16.1.el5

How reproducible:

Steps to Reproduce:
1.dd if=/dev/zero of=/tmp/t bs=10240k count=15000
  
Actual results:
all further writes hang, system becomes unusable

Expected results:
system remains usable

Additional info:
The RAID controller is an LSI MegaRAID SAS 2208 [Thunderbolt] driven by stock RHEL5 driver megasas. All write caches are disabled and enabling them makes no difference.

Comment 1 Carlos Maiolino 2013-03-04 17:18:29 UTC
Hi, can you let me know the values you tested in /proc/sys/vm/dirty_ratio?

I was able to reproduce the same behaviour you're seeing, and this looks the system gets busy due the amount of data being written back to disk, making the IO system really busy, which isn't unexpected in systems with a huge amount of RAM.
RHEL6 has a better IO mechanism which mitigate the problem, that's why you are not facing this problem on rhel6.

A change in dirty_ratio should mitigate the problem, letting the machine usable while also writing to the root filesystem. This happens due the amount of data being written back per flush. 

I was able to reproduce this behaviour in a machine with 260GiB of RAM and dirty_ratio set with the default value (%40), and fixed it setting the dirty_ratio to 10 with:

#echo 10 > /proc/sys/vm/dirty_ratio

This should fix the lockups you're seeing,

--Carlos

Comment 2 Michael Weiser 2013-03-19 15:44:14 UTC
I'm pretty sure (as sure as I can be after five months) I did test everything from 5 to 95 in increments of 10 without any change. I can't test any more since the machines went into production in the meantime.

However, our customer has decided not to pursue the issue. Seemlingly, it doesn't pose as big a problem in production as initially anticipated. You can close this bug.

Comment 3 Carlos Maiolino 2013-03-20 18:50:52 UTC
Ok, I'm closing this bug, however, few free to reopen it if it becomes an issue again.

--Carlos


Note You need to log in before you can comment on or make changes to this bug.