Bug 141173 - RHEL4 U1: x86 smp mkfs.ext2 on USB ext hd causes xfs to be killed & video lost
Summary: RHEL4 U1: x86 smp mkfs.ext2 on USB ext hd causes xfs to be killed & video lost
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Larry Woodman
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 147461
TreeView+ depends on / blocked
 
Reported: 2004-11-29 18:56 UTC by Stuart Hayes
Modified: 2007-11-30 22:07 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-05-03 22:59:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
OOM on 2.6.9-1.860_ELsmp during mkfs on usb 2.0 HD (58.89 KB, text/plain)
2004-12-07 20:04 UTC, Robert Hentosh
no flags Details
contains meminfo and slabinfo when oom killer is active. (2.57 KB, text/plain)
2004-12-09 06:17 UTC, Robert Hentosh
no flags Details
Reproduced without USB. (100.00 KB, text/plain)
2004-12-15 20:26 UTC, Robert Hentosh
no flags Details
The last patch failed (15.09 KB, text/plain)
2004-12-16 22:05 UTC, Robert Hentosh
no flags Details
Failure with Congest kernel patch (dmesg output) (15.09 KB, text/plain)
2004-12-21 14:44 UTC, Robert Hentosh
no flags Details
OOPS as described in comment #59 (341.81 KB, image/jpeg)
2004-12-23 16:57 UTC, Amit Bhutani
no flags Details
syslog of system that had OOPS (177.12 KB, text/plain)
2004-12-23 17:15 UTC, Amit Bhutani
no flags Details
message log of congest3 kernel failure. (799.06 KB, text/plain)
2005-01-03 18:58 UTC, Robert Hentosh
no flags Details
Failure in Austin on congest3 after a few days (747.95 KB, text/plain)
2005-01-03 20:37 UTC, Robert Hentosh
no flags Details
Other issue failure (19.49 KB, text/plain)
2005-01-06 18:34 UTC, Robert Hentosh
no flags Details
Shredder single mkfs failure. (204.54 KB, text/plain)
2005-01-07 16:24 UTC, Robert Hentosh
no flags Details
Hardware configuration for SAN testcase (11.44 KB, text/plain)
2005-01-11 16:07 UTC, Robert Hentosh
no flags Details
2.6.9-4smp hang with single mkfs.ext in loop (225.33 KB, text/plain)
2005-01-11 17:45 UTC, Robert Hentosh
no flags Details
Failure with 2.6.9-5smp (101.57 KB, text/plain)
2005-01-11 20:34 UTC, Robert Hentosh
no flags Details
failure on single mkfs without "loop" (154.73 KB, application/octet-stream)
2005-01-13 15:10 UTC, Robert Hentosh
no flags Details
congest5 failure (8.39 KB, application/octet-stream)
2005-01-13 15:29 UTC, Robert Hentosh
no flags Details
/var/log/messages from RH cert test (863.28 KB, text/plain)
2005-02-11 17:50 UTC, Robert Hentosh
no flags Details

Description Stuart Hayes 2004-11-29 18:56:47 UTC
Description of problem:

We have seen this problem on a PE1655MC and then on a PE2800 when 
trying to duplicate it.

This loss of video happens when a USB external hard drive is attached 
and is being formatted with mkfs.ext2.  This is with RHEL4 Beta2 x86, 
smp kernel, run level 5 with X Windows running:

F1 virtual screen: hdparm -t /dev/sda            (boot drive)
F2 virtual screen: mkfs.ext2 /dev/sdb1          (USB external hard 
drive)
F6 virtual screen: top

When we see the usage going up past 10 or so, then we can expect the 
problem on the 1655MC.  On the PE2800 it has to go up to about 13-25 
or so before the problem is seen.

On the 1655MC blade server we lose all video, even virtual screens, 
every time during one format of the USB external hard drive.

But on the PE2800, we could not see it at first on the first few 
formats of the USB external hard drive.  However, if we run the 
mkfs.ext2 format in a loop, pausing every 5 seconds after each 
format, we see it complete many times then we get the loss of video.  
On the PE2800, this video loss is seen within about 15 minutes or so.

Using the Avocent kvm, we lose ALL video (console and all virtual 
screens are not visible nor accesslble), but using a direct connect 
monitor, as before on the 1655MC, xfs is seen as the last process 
killed, but on the monitor we see X  Windows collapsing, but then due 
to it being launched from /etc/inittab, it gets respawned....but by 
that time the user would have lost any remote connections and editing 
of files, and their desktop...


Version-Release number of selected component (if applicable):


How reproducible:
Easy

Steps to Reproduce:
1. see description
2.
3.
  
Actual results:
system crashes as described above

Expected results:
successful format, system still running

Additional info:

Comment 1 Stephen Tweedie 2004-12-01 14:58:03 UTC
"xfs is seen as the last process killed"

What log messages do you see when this is happening?

This feels like a VM problem.  I've fixed similar problems several
times in the past, especially on 2.2 / 2.4 systems, where mke2fs was
consuming all of memory with dirty buffer heads.  If processes are
being actively killed by the kernel, that could well be the
out-of-memory killer, which would imply that we've got a similar
buffer management problem in the current kernel.

However, we really need the kernel logs to know for sure --- can
somebody try to recreate this with netconsole or serial console
enabled?  Trying to capture "vmstat 1" during the problem could also
help to verify the problem.

If we confirm it's an out-of-memory problem, then the next step would
be to look for USB errors in the log.  If those are present, then
that's the next problem to attack.  If not, though, it implies that
it's a VM balancing problem that we need to deal with.

Comment 3 Jason Baron 2004-12-02 23:39:06 UTC
please verify this on the latest rhel4 beta kernel:

http://people.redhat.com/davej/kernels/RHEL4/RPMS.kernel/

thanks.

Comment 6 Robert Hentosh 2004-12-07 17:20:14 UTC
This problem still occurs on 2.6.9-1.849_ELsmp.  It definately seems 
to me to be the OOM going wild.  Right before it goes serial the 
memory usage seems fine.  I am attempting to gather the logs.

Comment 7 Robert Hentosh 2004-12-07 20:04:22 UTC
Created attachment 108064 [details]
OOM on 2.6.9-1.860_ELsmp during mkfs on usb 2.0  HD

Comment 8 Robert Hentosh 2004-12-07 23:21:31 UTC
Issue does not appear on stock 2.6.9 kernel.

Comment 9 Robert Hentosh 2004-12-09 06:09:41 UTC
**Initial** testing reveals that this issue is fixed by Rik's patch 
in BZ#133858 (on 12/04/2004)

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=133858


Comment 10 Robert Hentosh 2004-12-09 06:17:22 UTC
Created attachment 108177 [details]
contains meminfo and slabinfo when oom killer is active.

It seems that the dm layer is allocating a bunch of biovec-1 and bio from the
slab cache. Then then the memory is exhausted so the OOM killer comes out. I
can reproduce this problem easily by formatting an 80GB USB 2.0 HD attached to
a PE1655MC.  I don't need any other stress.

I am not sure if Rik's patch will prevent it in all cases... or just delay the
action.

Comment 11 Robert Hentosh 2004-12-09 06:19:30 UTC
What is really the throttling mechanism when one process does alot of 
writes to a device? What should stall the device untill to those 
writes complete and not suck up all cache?

Comment 12 Robert Hentosh 2004-12-09 17:10:07 UTC
My Bad. BZ133858 does not fix this issue.  It was booting a older 
kernel and not the correctly patched kernel with Rik's patch.  The 
patch does not resolve the issue.

Also, If you look at the logs posted in #10, you can see that the 
biovec-1 and bio slabs around 400,000 active objects.  With the 
kernel.org 2.6.9 kernel that doesn't have the problem the are around 
20,000.


Comment 13 Robert Hentosh 2004-12-14 16:54:08 UTC
This bug is still present with 2.6.9-1.906_ELsmp


Comment 15 Susan Denham 2004-12-15 15:28:07 UTC
We agree that this isn't pretty.   

We've included the following text in the RHEL 4 release notes, which
you haven't yet seen:

"The initial release of Red Hat Enterprise Linux 4 does not support
USB hard disk drives. However, other USB storage devices, such as
flash media, CD-ROM, and DVD-ROM devices are currently supported."

We're pursuing a fix for this for possible inclusion in early RHEL 4
update.

Comment 16 Robert Hentosh 2004-12-15 16:20:16 UTC
I would agree with your release notes if the root-cause was found to 
be only tied to large USB storage devices. I am not confident that it 
is at this time.


Comment 19 Robert Hentosh 2004-12-15 20:26:00 UTC
Created attachment 108649 [details]
Reproduced without USB.

This issue has now been reproduced without USB being present.  The system is a
PE700 with a CERC SATA 6CH card.  Two mkfs's where run simultaneously with an
increased number of inodes (-N with more than 40,000,000)

Comment 20 Rik van Riel 2004-12-15 21:01:14 UTC
Looks like we need the lower zone protection hack we've got in RHEL3,
ensuring that allocations with __GFP_HIGHMEM do not come from the DMA
zone, on a system with highmem present.

Active:45841 inactive:211029 dirty:642 writeback:194444 unstable:0
free:690536 slab:22839 mapped:45294 pagetables:928
DMA free:16kB min:16kB low:32kB high:48kB active:0kB inactive:11500kB
present:16384kB
protections[]: 0 0 0
Normal free:2256kB min:936kB low:1872kB high:2808kB active:0kB
inactive:768776kB present:901120kB
protections[]: 0 0 0
HighMem free:2759872kB min:512kB low:1024kB high:1536kB
active:183364kB inactive:63968kB present:3014400kB
protections[]: 0 0 0

Comment 21 Robert Hentosh 2004-12-15 21:16:57 UTC
Thanks for looking at this Rik.

Also, I just tested 2.6.9-1.906_EL_exitfixsmp and it didn't resolve 
the issue.


Comment 22 Tim Burke 2004-12-15 21:42:14 UTC
Robert,

Regarding comment #19.  Is the only "load" you put on the system the 2
simultaneous `mkfs` commands?  Or is thrasher or other stress running at the
same time?


Comment 23 Rogelio Noriega 2004-12-15 22:08:43 UTC
No that was the only load... other than running it under X windows.
The tar ball should have a PS listing.

Comment 24 Rogelio Noriega 2004-12-15 22:10:52 UTC
Tim,

Comment #23 was submitted by me. I was just reading the issue using
Rogers workstation.

-- Robert Hentosh.

Comment 25 Rik van Riel 2004-12-15 22:54:19 UTC
It looks like one of those ludicrously easy to introduce threshold
errors that tend to plague the VM on a regular basis. Basically, the
block device cache only uses lowmem (AFAIK) so most of highmem is free. 

It looks like 80% of the lowmem zone is in writeback state, leaving
not much pages around the VM could reclaim. Unfortunately, this is
only around 20% of total system memory, or half of the 40% required to
make the systemwide writeback throttling work right. This means that
the writers (mkfs) can continue dirtying pages without any throttling,
pushing the VM right off a cliff.

I guess this means that the VM should deal with this situation,
instead of relying on writeback throttling...


Comment 26 Robert Hentosh 2004-12-16 00:17:36 UTC
I am not sure why I am seeing the following behavior... but if one 
goes and does a 

   dd if=/dev/zero of=/dev/sdb1 bs=512 count=78116000

everything works fine... but... 

   dd if=/dev/zero of=/dev/sdb1 bs=4096 count=9764500

fails and causes the OOM to go postal. Using 512 as a block size 
results in the bio biovec-1 in /proc/slabinfo to stay bellow 200 and 
using 4096 causes it to blow up over 400,000 active objects.

BTW, if you look at /proc/slabinfo it seems that the tunables: 
<batchcount> and <limit> fields names are switched. (refer to 
mm/slab.c in functions s_start() for the header and s_show() for the 
data seq_printf )


Comment 27 Rik van Riel 2004-12-16 03:52:40 UTC
Robert,

I've combed through the VM code but can't see an easy workaround to
fix this bug, that won't risk introducing big regressions elsewhere in
the VM.  Because of this, I would like to ask you if you could please
try lowering /proc/sys/vm/dirty_ratio to something like 10% and see if
the bug is still reproducible.



Comment 31 Rik van Riel 2004-12-16 16:51:17 UTC
Robert, 

I have another question for you, trying to narrow down the bug
further. Can you reproduce the bug when you boot the system with
"mem=1G" ?

Comment 33 Robert Hentosh 2004-12-16 18:46:10 UTC
With mem=1G I do not see the problem.  The bio* slabs entries are 
still around 400000. I will try lowering the dirty_ratio.

Comment 34 Robert Hentosh 2004-12-16 19:15:10 UTC
So far I am unable to reproduce with /proc/sys/vm/dirty_ratio lowered 
to 10%.


Comment 35 Rik van Riel 2004-12-16 21:22:33 UTC
Robert,

I am not 100% positive yet whether my test patch fixes the bug, but I
would invite you to try and reproduce the bug with my test kernels
anyway. You can get test kernels, and the patch that produces them,
from my people page:

http://people.redhat.com/riel/.test-kernel-bug-141173/

If this kernel resolves your issue, I will also make sure the fix gets
merged into the upstream kernel.

Comment 37 Robert Hentosh 2004-12-16 22:05:25 UTC
Created attachment 108749 [details]
The last patch failed

Nope. It didn't pass.  Attached is the output of dmesg.

Comment 41 Susan Denham 2004-12-17 14:23:14 UTC
Sent to Dale earlier today in email:

Dale:

As you saw by Robert's most recent reply last evening, Rik's patch
didn't resolve the problem and he's working on another one.  However,
to keep as many things moving forward as possible -- and although we
do indeed have the exact configuration you're seeing this failure on
-- can you box up one of those systems TODAY and overnight it to us in
Westford, attention Joe Connolly?   If seeing for ourselves lets us
crack the problem, then it was worth the effort.

Could this be a BIOS problem?  We're running out of
possibilities to explain this....  Hence getting the actual hardware
in our hands.

Thanks,
Sue

Comment 44 Rik van Riel 2004-12-17 19:41:52 UTC
Robert,

I have created a test kernel with an additional patch
(rhel4-sc-congested.patch) as well as yesterday's patch.  Could you
please try out
http://people.redhat.com/riel/.test-kernel-bug-141173/kernel-smp-2.6.9-1.906_ELcongest.i686.rpm
 ?

I'm currently compiling the up and hugemem kernels of this set, too,
and will upload them later. Initial test results in our lab show
positive results, but I'm sure this new kernel could also use a good
smoke test...

Comment 45 Rik van Riel 2004-12-17 19:49:41 UTC
Reassigning to myself, since the bug doesn't look very USB related, at
least not these layers of it ...

Comment 46 Robert Hentosh 2004-12-17 20:30:37 UTC
The ELconest kernel has passed the first test.

So far so good.


Comment 49 Amit Bhutani 2004-12-20 14:56:28 UTC
A PE420SC running the 906_ELcongest kernel and Thrasher lost it's 
video over the weekend. You can see the "X" printed on the screen, 
some garbage characters but system is hung for all practical 
purposes. Had to do a physical reboot. This one did not have an 
attached serial console. Restarting the test with serial console. 

Comment 50 Robert Hentosh 2004-12-21 14:44:33 UTC
Created attachment 108947 [details]
Failure with Congest kernel patch (dmesg output)

Well, tests were running fine up to Saturday. Furthur testing on Monday
resulted in failures.  I am not sure what variable changed between the two. I
don't see any except that the machines were sitting idle for over sunday.  Here
is the dmesg output from the latest kernel, this system was just running
"mkfs.ext2 /dev/sdb1". It was a PE1655MC with a USB drive attached.

Comment 51 Rik van Riel 2004-12-21 16:09:20 UTC
OK, I guess that means I "forgot to clean a spot" and there are a few
more changes needed to the VM.  Andrew Morton already gave us another
one-line change yesterday (on lkml) and I will audit the code further
to see what else needs changing.

I will build you a new test kernel later today.

Comment 52 Rik van Riel 2004-12-21 18:35:37 UTC
OK, the new test kernel is ready.  Could you please try
kernel-smp-2.6.9-1.906_EL.congest2.i686.rpm from
http://people.redhat.com/riel/.test-kernel-bug-141173/ ?

Note that due to the urgency of this issue, we have not done internal
testing on this kernel yet, but there is only one line difference from
the previous kernel, so...

Comment 53 Amit Bhutani 2004-12-21 22:25:49 UTC
No kernel-smp-2.6.9-1.906_EL.congest2.i686.rpm in 
http://people.redhat.com/riel/.test-kernel-bug-141173/ ;-(

Last file posted to that URL was Dec 17.

Comment 54 Rik van Riel 2004-12-21 22:29:24 UTC
Oops, I put them in the wrong directory.  They should be there now...

Comment 55 Amit Bhutani 2004-12-22 00:05:23 UTC
So far so good with 906_EL.congest2. 

mkfs.ext3 on the external USB HD in a infinite loop with a sleep of 5 
secs between each iteration seems to have *not* invoked OOM yet. Test 
has only run for 15 mins though. 'free' is showing that not a lot of 
memory is being consumed by the single mkfs command. Based on that, I 
have gone ahead and created two additional partitions on the same USB 
HD and am formatting (mkfs.ext3) them as well in parallel in an 
infinite loop as well. Will update with results later tonight.

Comment 56 Amit Bhutani 2004-12-22 15:35:19 UTC
No OOM invokations yet! Test is running fine with the three mkfs.ext3 
processes chugging along. I even added an up2date in an infinite loop 
before leaving last night.

I do notice that swap is turned of on this system though. That should 
not matter, right ?   

Comment 57 Rik van Riel 2004-12-22 15:42:29 UTC
Having swap turned off should make OOM kills _more_ likely, not less.  Your test
results are great news, lets see if the system can go another day without OOM
kills...

Comment 59 Amit Bhutani 2004-12-23 16:56:07 UTC
Yuck! Kernel panic on the screen when I inspected the system this 
morning. No serical console attached but I will upload a screenshot 
in a moment as that is all I have to offer at this moment. 

The (partial) trace of the panic appears to be strikingly similar to 
the panic induced by Thrasher in bugzilla 141284.

Comment 60 Amit Bhutani 2004-12-23 16:57:44 UTC
Created attachment 109093 [details]
OOPS as described in comment #59

Comment 61 Amit Bhutani 2004-12-23 17:14:12 UTC
SysRq was enabled but was not responsive. Rebooted the machine and on 
examining the syslog, appears that OOM Killer was invoked twice, 
killing httpd in both the instances in a gap of ~ 1hr. Will attach 
syslog in a minute.

Comment 62 Amit Bhutani 2004-12-23 17:15:39 UTC
Created attachment 109095 [details]
syslog of system that had OOPS

Comment 65 Rik van Riel 2004-12-23 20:53:58 UTC
OK, so we have 660 MB of pages in writeback stage, and still get an
OOM kill.  In fact, it looks like all the inactive pages are under
writeback...

I will audit the VM code to try and find out what might cause this bug
to trigger.

Comment 66 Rik van Riel 2004-12-28 14:24:05 UTC
I have added another patch (rhel4-vm-extraround.patch) and uploaded
new test kernels (congest3) to
http://people.redhat.com/riel/.test-kernel-bug-141173/

Comment 67 Amit Bhutani 2004-12-30 17:54:49 UTC
India test folks have been able to reproduce the OOM invokation part 
of the issue with the latest "congest3" kernel. I have requested for 
syslog messages and also details on how long it took to reproduce the 
issue etc. Here is what information I have at the moment:
System: PE1655 (Shredder)
External USB HD
Kernel: 2.6.9-1.906_EL.congest3
Run Level: 5
mkfs.ext2 -N 40000000 /dev/sdb1
mkfs.ext2 -N 40000000 /dev/sdb2

Continuing to let the test run to see if it eventually panic's. 
Meanwhile regressing the same issue with the congest3 kernel in our 
setup here in Austin as well.

Comment 69 Robert Hentosh 2005-01-03 18:58:01 UTC
Created attachment 109268 [details]
message log of congest3 kernel failure.

The congest3 kernel was tested and resulted in failures. Here is
/var/log/messages file from that machine. Please note the that there is more
than one test run in the log file and double check when you see the oom message
in the log that it was actually from the congest3 kernel.

Comment 70 Rik van Riel 2005-01-03 19:30:44 UTC
Robert,

this morning I found another thing in vmscan.c that might be wrong,
and just finished compiling a kernel rpm (congest4) with the patch
rhel4-nr_scanned-count-writeback.patch, which you'll be able to find on:

http://people.redhat.com/riel/.test-kernel-bug-141173/

Comment 71 Robert Hentosh 2005-01-03 20:37:38 UTC
Created attachment 109277 [details]
Failure in Austin on congest3 after a few days

Here is the log file for a congest3 failure after a few days of running the
mkfs test.

Comment 73 Robert Hentosh 2005-01-05 01:29:20 UTC
Just an observation noted with the congest4.  We started the test of 
`mkfs.ext2 /dev/sdb1` on console F1 and `mkfs.ext2 /dev/sdb2` on 
console F2 and `mkfs.ext2 /dev/sdb3` on console F3. Each mkfs was in 
a loop to restart after it was complete.  The test ran without the 
OOM killer appearing for over 20 hours.  However it was noted that 
the first one start on /dev/sdb1 didn't complete its first iteration. 
The other mkfs's completed and continued to format.  I killed the 
other two and the first one still didn't complete after 10mins.  
checking iostat after the other formats were killed and only the one 
on F1 running.. the io through-put to /dev/sdb dropped to zero.

I re-ran this test under the 1-910 hugemem kernel and it behaved 
similarly. I.E. OOM killer didn't appear and the first 2 mkfs's were 
hung.

We also have another issue that was reported elsewhere that occurs 
when doing a `cat /dev/zero > /mnt/largedisk/tempfile`  Where the 
partition size is > 1TB.  The system eventually hangs and examining 
the /var/log/messages file discloses that the OOM killer was envoked.

Comment 75 Susan Denham 2005-01-05 14:35:01 UTC
Robert:  re: #73:  Dale says that the "other issue reported elsewhere"
to which you refer is coming from the Dell Storage team.  Are we going
to see more details on what they're seeing (log file, etc.)?



Comment 76 Dale Kaisner 2005-01-05 14:38:28 UTC
Yes, Robert's comments are referring to the "other issue."  Based on 
information gathered overnight, that issue does not appear to be 
directly related to this bugzilla.

Comment 77 Robert Hentosh 2005-01-06 18:34:23 UTC
Created attachment 109434 [details]
Other issue failure

Contains the messages file of the failure on the "other issue"

Comment 78 Robert Hentosh 2005-01-06 19:03:43 UTC
I have been able to reproduce the "other issue" reported overseas.  I 
installed a PE2850 with 6GB of RAM and attached a CX700 SAN with a 
~1TB lun.

I mounted the SAN at /bigdisk after formating it with 
`mkfs.ext2 /dev/sdd1`. I then ran the following command:

  while true
  do 
    dd if=/dev/zero of=/bigbisk/bigfile bs=1M count=1024000
  done

That command ran overnight without issue.  Curious to see the through 
put on the system. I then ftp'd over the sysstat rpm file and 
installed it.  I then ran `iostat 5` on another console window and 
noted that i was averaging about 300MB/s writes to the SAN.  

Noting that the log file from the original failure I was able to 
reproduce from the pre-rc2 kernel might be useful to Red Hat. I then 
took a floppy and mounted it.  At that instant I received a oom-
killer message.  The log file is attached above with both the failure 
from the pre-rc2 kernel and the latest 2.6.9-4 kernel.


Comment 79 Tim Burke 2005-01-07 00:58:02 UTC
Additional experiments we discussed on the phone to gather more datapoints:

- instead of doing if=/dev/zero, try a "real" input device, ie another disk.
- What if you vary the bs?  half and/or 3/4 the size.


Comment 80 Robert Hentosh 2005-01-07 15:47:46 UTC
Ran several tests last night.. all on the -4 kernel.

- PE2850 with 6GB attached to ~1TB SAN.   I took this configuration 
and repartitioned the SAN LUN to 3 approximately equal partitions.  I 
had to simultaneous cp's of large files on 1 and 3 being copied to 
partition 2. That ran overnight without issue.

- I had 4 PE1655MC blades each attached to a USB HD (120GB). Each one 
ran one single mkfs (not in a loop) and one of the four blades had an 
oom-killer. I will attach log after this comment.

- In remote testing we had a PE1655MC with one USB HD with 2 80GB 
partitions doing 2 simultaneous mkfs's. It resulted in the following 
oom-kill:

Kernel-2.6.9-4.ELsmp
Jan  7 08:27:15 shredder-1-2 kernel: oom-killer: gfp_mask=0xd0
Jan  7 08:27:15 shredder-1-2 kernel: DMA per-cpu:
Jan  7 08:27:15 shredder-1-2 kernel: cpu 0 hot: low 2, high 6, batch 1
Jan  7 08:27:15 shredder-1-2 kernel: cpu 0 cold: low 0, high 2, batch 
1
Jan  7 08:27:15 shredder-1-2 kernel: cpu 1 hot: low 2, high 6, batch 1
Jan  7 08:27:15 shredder-1-2 kernel: cpu 1 cold: low 0, high 2, batch 
1
Jan  7 08:27:15 shredder-1-2 kernel: Normal per-cpu:
Jan  7 08:27:15 shredder-1-2 kernel: cpu 0 hot: low 32, high 96, 
batch 16
Jan  7 08:27:15 shredder-1-2 kernel: cpu 0 cold: low 0, high 32, 
batch 16
Jan  7 08:27:15 shredder-1-2 kernel: cpu 1 hot: low 32, high 96, 
batch 16
Jan  7 08:27:15 shredder-1-2 kernel: cpu 1 cold: low 0, high 32, 
batch 16
Jan  7 08:27:15 shredder-1-2 kernel: HighMem per-cpu:
Jan  7 08:27:15 shredder-1-2 kernel: cpu 0 hot: low 32, high 96, 
batch 16
Jan  7 08:27:21 shredder-1-2 kernel: cpu 0 cold: low 0, high 32, 
batch 16
Jan  7 08:27:21 shredder-1-2 kernel: cpu 1 hot: low 32, high 96, 
batch 16
Jan  7 08:28:02 shredder-1-2 kernel: cpu 1 cold: low 0, high 32, 
batch 16
Jan  7 08:28:02 shredder-1-2 kernel:
Jan  7 08:28:02 shredder-1-2 kernel: Free pages:     1022784kB 
(1022784kB HighMem)
Jan  7 08:28:02 shredder-1-2 kernel: Active:31104 inactive:201824 
dirty:185930 writeback:9511 unstable:0 free:255696 slab:26826 
mapped:32478 pagetables:747
Jan  7 08:28:02 shredder-1-2 kernel: DMA free:0kB min:16kB low:32kB 
high:48kB active:68kB inactive:8452kB present:16384kB
Jan  7 08:28:02 shredder-1-2 kernel: protections[]: 0 0 0
Jan  7 08:28:02 shredder-1-2 kernel: Normal free:0kB min:936kB 
low:1872kB high:2808kB active:0kB inactive:772856kB present:901120kB
Jan  7 08:28:02 shredder-1-2 kernel: protections[]: 0 0 0
Jan  7 08:28:02 shredder-1-2 kernel: HighMem free:1022784kB min:512kB 
low:1024kB high:1536kB active:124348kB inactive:25988kB 
present:1179520kB
Jan  7 08:28:02 shredder-1-2 kernel: protections[]: 0 0 0
Jan  7 08:28:02 shredder-1-2 kernel: DMA: 0*4kB 0*8kB 0*16kB 0*32kB 
0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
Jan  7 08:28:02 shredder-1-2 kernel: Normal: 0*4kB 0*8kB 0*16kB 
0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
Jan  7 08:28:02 shredder-1-2 kernel: HighMem: 34*4kB 21*8kB 109*16kB 
168*32kB 51*64kB 3*128kB 22*256kB 11*512kB 5*1024kB 4*2048kB 
241*4096kB = 1022784kB
Jan  7 08:28:02 shredder-1-2 kernel: Swap cache: add 0, delete 0, 
find 0/0, race 0+0
Jan  7 08:28:02 shredder-1-2 kernel: Out of Memory: Killed process 
2033 (cannaserver).

- Also on a PE700 with several simultaneous mkfs's on SATA drives, 
the oom-killer was not seen in the log. But while in X-Windows some 
applications couldn't be launched. (like gnome-terminal, Firefox, 
GIMP, calculator etc..) The pointer would change to an hour glass and 
then back to a pointer without any application being launched. While 
other apps could be launched. (like Disk Management and Hardware 
Browser).



Comment 81 Jay Turner 2005-01-07 16:00:21 UTC
Robert . . . what mkfs command did you run specifically in the second case above
(Shredder with single mkfs)?  Thanks.

Comment 82 Robert Hentosh 2005-01-07 16:14:50 UTC
The command used in the four shredders to format the USB HD 
was 'mkfs.ext2 /dev/sdb1'  only one instance was excuted on each 
blade.


Comment 83 Robert Hentosh 2005-01-07 16:24:44 UTC
Created attachment 109474 [details]
Shredder single mkfs failure.

Comment 84 Tim Burke 2005-01-07 21:06:11 UTC
Robert - on the mkfs failures to the USB disks... can you help describe the
frequency of failure in more detail.  For example:

- does it repeatably happen?
- if it fails, is it from running that mkfs command once?  Or is it in a loop
which fails?  If so, how long does it have to run till an oom kill is
encountered?  How many successful mkfs complete prior to the failure?
- Are you seeing random failures on each of the blades?  or is only 1 of them
problematic?
- Is it safe to assume that this mkfs failure is only exhibited on large USB
devices?

Since we didn't have any USB disks, we ran out to CompUSA and got some, which we
are now hooking up.


Comment 85 Jeff Burke 2005-01-10 14:23:34 UTC
Created a test script that would run a "mkfs.ext2 /dev/sbd1" in a
loop. /dev/sdb is the Iomega 120GB USB external storage device. This
test ran successfully 500 times.  It was run with the following hardware:
Dell Power Edge 1850 (Service Tag: GG0Y761, Express Service code:
35797571785), (2)3.6G CPU w/HT, 4 GB ram, Iomega 120GB USB Storage.
RHEL 4, 2.6.9-5.ELsmp.



Comment 86 Robert Hentosh 2005-01-10 14:31:41 UTC
To Answer your questions. 

- At first it was very repeatable. We couldn't get a disk formatted without it 
envoking the OOM. With the latest kernel 2.6.9-4smp We now have to put the 
mkfs in a loop so that another mkfs will end after the first one completes.  
This will run over night on 4 boxes and then 2 of the boxes will fail within 
12 hours.  The distribution of when the failure occurs appears to be even. So 
we have seen failures within 5 mins and others occur several days.

- We have seen it fail within a single mkfs on 2.6.9-4smp that was not in a 
loop. As described above generally our tests are in a loop.  Again, the length 
of time seems to be an even distribution.  There are somewhere between 0 and 
>10 successes before a failure occurs.

- Yes.  The failures occur on all of the blades running the test and the 
distrbution of the failures appears to be even.  No one blade appears to fault 
more than the other.

- It seems the the drive (Actually the partition being formatted) has to be 
greater than 40GB before the problem is triggered. It was not a test focus 
though to see what the threshold is.  But the smaller drives seem to complete 
quickly and larger drives hit a performance drop that cause the format to take 
much longer.

We had to buy USB drives to expand our testing.  We purchased the cases with 
the IDE to USB/firewire converter and the IDE drive seperately.  This allowed 
us to get 120GB drives which were larger and cheaper than the pre-packaged USB 
HDs.  We wanted to see if the much larger drives would cause the problem to 
occur more frequently.  It didn't seem to increase the failure per unit time. 
But it that case it increased the failure per mkfs itteration.

Comment 89 Tim Burke 2005-01-11 13:41:53 UTC
Question for Dell to help get more datapoints...

We have been unable to reproduce the problem on hardware from other IHVs.  This
makes me wonder if certain storage adapters could be more vulnerable than others.
For example, would you hit the same issue encountered on the 2TB storage failure
if a Qlogic adapter was used?  We are pursuing various combinations here to 
see if a patern can be established.


Comment 90 Robert Hentosh 2005-01-11 16:07:33 UTC
Created attachment 109615 [details]
Hardware configuration for SAN testcase

Here is the lspci output of the PE2850 box attached to a SAN where we have seen
the problem.  The Fibre channel card is an EMULEX.

Comment 91 Robert Hentosh 2005-01-11 17:45:27 UTC
Created attachment 109625 [details]
2.6.9-4smp hang with single mkfs.ext in loop

This is the messages log file from a PE1655MC attached to a USB drive doing a
mkfs of the entire USB drive in a loop. The last mkfs eventually hung it was
given about 30mins at to see if progress was made then a control-Z was typed on
the screen with the mkfs running.  That was given 15 mins to suspend. At that
time I turned on sysrq and then took the SYSRQ-T/M dumps that appear in the
messages file attached.  After more time waited (more than 10mins) the mkfs
finally suspended.

Comment 92 Robert Hentosh 2005-01-11 17:51:29 UTC
I need to mention that the issue seen first with a 1TB filesystem was done not 
on a fibre channel card.. but a RAID card. (PERC4e)  It had more than 1TB of 
SCSI disks attached to it. Since I was unable to get that in SCSI 
configuration i reproduced the problem on the SAN locally.

So the issue has been witnessed on SATA (comment #19), USB, SCSI RAID (comment 
#77), and Fibre channel SAN (comment #78).



Comment 93 Robert Hentosh 2005-01-11 20:34:19 UTC
Created attachment 109638 [details]
Failure with 2.6.9-5smp

Failure on a PE1655MC (shredder) with USB HD after running 3 hours and 50 mins.
 Failure induced with following:

   while true
   do
      mkfs.ext2 /dev/sdb1
   done

Two other blades at this time are still running.

Comment 94 Tim Burke 2005-01-11 21:25:30 UTC
Last week on a con-call it was discussed that Dell would try to use a "real"
input device in the `dd` test, rather than /dev/zero (ref comment #79).  Any
luck taking if= another disk?


Comment 95 Susan Denham 2005-01-11 21:36:53 UTC
Dell had asked whether running a single mkfs.ext or dd if=/dev/zero in
a loop would create cumulative stress on the kernel that would explain
the OOM kill scenario.  At this time, we haven't established a firm
pattern that would indicate a cumulative effect and are expanding our
test scenarios and broadening our examination to other kernel
subsystems such as the I/O path.   

Comment 96 Robert Hentosh 2005-01-11 22:57:11 UTC
We (In reply to comment #94)
> Last week on a con-call it was discussed that Dell would try to use a "real"
> input device in the `dd` test, rather than /dev/zero (ref comment #79).  Any
> luck taking if= another disk?

We ran a test using cp insead of dd and were unable to produce a failure in 
that configuration.


Comment 97 Jeff Burke 2005-01-12 15:22:12 UTC
Update:
1.) Testing dd with LVM using command from Comment #26
        Started:  Jan 11,13:05
        Test:     `dd if=/dev/zero of=/dev/VolGroup00/LogVol02 bs=4096
count=9764500` in a loop.
        Status:   This test is still running
        Hardware: Dell PowerEdge 1800, Service Tag:CNQD961
        OS:       RHEL4 RC2, 2.6.9-5.ELsmp

2.) Testing dd with partions > 1TB
        Started:  Jan 11,13:20
        Test:     `dd if=/dev/zero of=/bigdisk/bigfile bs=1M count=1024000` in a
loop.
        Status:   This test is still running
        Hardware: Dell PowerEdge 2850, Service Tag:3054961, EMULEX HBA attached
to a Winchester Systems SAN
        OS:       RHEL4 RC2, 2.6.9-5.ELsmp

3.) Testing with USB storage device.
        Started:  Jan 11,13:47
        Test:     `mkfs.ext2 -N 40000000 /dev/sdb1` in a loop. 
        Status:   This test is still running
        Hardware: Dell PowerEdge SC1425, Service Tag:9CGQC61, USB 1.1 Hub attach
to an Iomega USB 120GB external storage device.
        OS:       RHEL4 RC2, 2.6.9-5.ELsmp

4.) Testing Bonnie (http://www.textuality.com/bonnie/) Recommended by Tom  Coughlan
        Started:  Jan 11,15:10
        Test:     `./Bonnie -d /spare/scratch -s 2047 -m pe1850.$num` in a loop.
        Status:   This test is still running
        Hardware: Dell PowerEdge 1850, Service Tag:GG0Y761
        OS:       RHEL4 RC2, 2.6.10 upstream Kernel

Other testing activities:
        FWIW: I had `dd if=/dev/zero of=/dev/VolGroup00/LogVol02 bs=4096
count=9764500` in a loop running on non Dell hardware. Approx 2 weeks, the
kernel was Rik van Riel congest3 kernel. It was still running when I stopped it
yesterday evening.

Comment 98 Rik van Riel 2005-01-13 00:30:38 UTC
Larry Woodman spotted a few more subtle bugs in vmscan.c, so I built
and uploaded a -congest5 kernel today.  This is based on 2.6.9-5.EL +
Larry's patch.

As usual, you can get it from
http://people.redhat.com/riel/.test-kernel-bug-141173/

Comment 99 Robert Hentosh 2005-01-13 15:03:29 UTC
I have started off 2 tests on with the congest5 kernel just now.  I 
didn't have access to them till now.

I also have done the following test. I have added the following line 
into the end of /etc/rc.local

  sh -c "sleep 1m; mkfs.ext2 /dev/sdb1; sleep 1m; reboot" &

There was some speculation that this issue wouldn't appear if the 
mkfs was not in a loop.  This was to verify that assertion. Failures 
were encountered with the 2.6.9-4 kernel.  I will attach the log in 
my next post.  2 machines were tested in this manner.  The one with 
the larger drive (160GB) produced more failures, but the 80GB USB HD 
did as well.


Comment 100 Robert Hentosh 2005-01-13 15:10:36 UTC
Created attachment 109715 [details]
failure on single mkfs without "loop"

Out of 26 mkfs's seprated by system reboots, 7 of those had triggered the oom
killer.

Comment 101 Robert Hentosh 2005-01-13 15:29:14 UTC
Created attachment 109716 [details]
congest5 failure

Congest5 failure.

Comment 102 Robert Hentosh 2005-01-13 15:33:36 UTC
The congest5 test case on a PE1655MC with a 160GB USB HD.  I put it 
in a mkfs loop but it didn't complete the first itteration.  I will 
put it into a reboot loop as described in comment #99 and see if it 
has better odds than the -5 unpatched kernel.

Comment 103 Jeff Burke 2005-01-13 19:40:54 UTC
Update:
      We (Red Hat) have received the Chassis, dual Shredder blades, 2 USB
External storage devices. I have started the `mkfs.ext2 -N 40000000 /dev/sdb1`
in a loop.  Both Shredder blades are running the exact same test. The kernel
version is 2.6.9-5.ELsmp.

      Note: Tests 1-3 from Comment #97 are still running without error.  I
stopped test 4 to load the -5 Congest5 kernel. I have started that test back up.




Comment 104 Jeff Burke 2005-01-17 15:48:05 UTC
Update:
      Tests 1-4 from Comment #97 are still running without error.

      On 1 of the 2 Shredder blades I have seen the OOM Kill happen very
intermittantly. The only time I have seen the issue happen is when I changed the
test to reboot after doing a `mkfs.ext2 /dev/sdb1` as per Comment #99.  The test
running mkfs in a loop never exhibited the issue.

      The kernel version is 2.6.9-5.ELsmp.  I have spoken to Larry Woodman.  He
is going to continue to debug this issue. 

Comment 106 Robert Hentosh 2005-01-19 16:20:12 UTC
Using the reboot loop on 2 shredder blades yeilded the following results.  The 
reboot loop was accomplished by adding the following line in /etc/rc.local:

   sh -c "sleep 1m; mkfs.ext2 /dev/sdb1; sleep 1m" &

Blade1 was running the -5smp kernel. Out of 41 cycles only 2 triggered the oom 
killer.

Blade2 was running the -5congest kernel. Out of 14 reboots 14 failed.

Looking at the SATA and SAN system configurations I have been unable to get 
them to fail on the -5smp kernel with both single instances of mkfs and 
multiple mkfs's or dd's.


Comment 107 Larry Woodman 2005-01-19 16:47:03 UTC
I'm pretty sure I found the problem.

In function get_dirty_limits() we do exclude highmem from the calculation
when the request is for lowmem:
---------------------------------------------------------------------------
#ifdef CONFIG_HIGHMEM
        /*
         * If this mapping can only allocate from low memory,
         * we exclude high memory from our count.
         */
        if (mapping && !(mapping_gfp_mask(mapping) & __GFP_HIGHMEM))
                available_memory -= totalhigh_pages;
#endif
---------------------------------------------------------------------------

But we use total_pages instead of available memory for the caclulation:

---------------------------------------------------------------------------
                                                                               
                                                 background = (background_ratio
* total_pages) / 100;
        dirty = (dirty_ratio * total_pages) / 100;
---------------------------------------------------------------------------

The following patch fixes this problem and we no longer see the OOM kills:
----------------------------------------------------------------------------
--- linux-2.6.9/mm/page-writeback.c.orig
+++ linux-2.6.9/mm/page-writeback.c
@@ -167,8 +167,8 @@ get_dirty_limits(struct writeback_state
        if (background_ratio >= dirty_ratio)
                background_ratio = dirty_ratio / 2;
                                                                               
                                                                         
-       background = (background_ratio * total_pages) / 100;
-       dirty = (dirty_ratio * total_pages) / 100;
+       background = (background_ratio * available_memory) / 100;
+       dirty = (dirty_ratio * available_memory) / 100;
        tsk = current;
        if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
                background += background / 4;
----------------------------------------------------------------------

I'm building an official kernel for Dell to test with now.

Larry Woodman

Comment 108 Larry Woodman 2005-01-20 15:26:07 UTC
The kernel to test with that includes the above patch is in:

>>>http://people.redhat.com/~lwoodman/RHEL4/


Larry Woodman


Comment 109 Robert Hentosh 2005-01-20 16:41:08 UTC
Initial tests the kernel from comment #108 are promising.


Comment 110 Russell C. Woodland 2005-01-20 18:01:18 UTC
Reviewed record to see progress from Red Hat and Dell.

Comment 111 Jeff Burke 2005-01-20 18:17:35 UTC
I have been running the Shredder test from Comment #106 with the
kernel from Comment #108 since yesterday at 11:45.  I have *not* seen
the OOM condition since loading the kernel from Comment #108.  It has
been more then 24 hours.

Comment 112 Robert Hentosh 2005-01-21 15:06:44 UTC
It has been over 20 hours and we have not witnessed an oom kill.  We will 
continue testing over the weekend.

Comment 113 Robert Hentosh 2005-01-25 15:29:32 UTC
The patch does fix this issue.  We are unable to reproduce the 
problem with the patched kernel.  We will regress this issue with 
the "update kernel" when it is available.

Comment 114 Susan Denham 2005-01-25 18:38:21 UTC
Thanks very much for all your exhaustive, Robert and team.  It's been
a long, haul and we appreciate greatly your efforts!   As we've
discussed, you'll see the fix in U1 (and a preliminary U1 kernel
available some time over the next few weeks so you can test; more
details about availability of this kernel soon).

Comment 115 Amit Bhutani 2005-01-25 20:08:23 UTC
Changing the title to reflect the Update in which a fix for this 
issue has been committed or being tracked for..

Comment 116 Jeff Burke 2005-02-08 16:18:04 UTC
Robert,
   I have run into an issue with one of the shredder blades loaned to RedHat.  I
have been running the reboot mkfs test on those two systems.  One of the two
systems hangs on reboots.  The message is that the memory size has been changed.
I can either strike a key to continue or go into setup. When I go into setup it
show the correct amount of memory 2048. I save the configuration and reboot it
is fine until the next reboot it will come up with that memory size adjustment
error again. 

   Have you run across this before? Any recommendations on how to get around it
or should I send it back to you.

Thanks in advance
Jeff

Comment 117 Robert Hentosh 2005-02-11 17:46:42 UTC
>    Have you run across this before? Any recommendations on how to get around 
it
> or should I send it back to you.
> Thanks in advance
> Jeff


I will probably need the system back to find out why BIOS is not doing that 
correctly ... is the BIOS version recent?  (Does it have an X?? type version 
number?)


Comment 118 Robert Hentosh 2005-02-11 17:50:33 UTC
Created attachment 110978 [details]
/var/log/messages from RH cert test

This is a new failure ... that seems to be strongly related. The oom-killer is
popping up on only one platform we have been running the redhat certification
test on.  Also, it fails on both the RC1 kernel and Larry's patched kernel. 
Both failures are in the log... Larry's kernel is the last run.

Comment 119 Robert Hentosh 2005-02-11 17:55:07 UTC
If you need me to I will open different bugzilla for this issue in #118. But 
it seems to be the same... only this happens while running our certifications 
on the Nagano platform PE1425.  This failure occured on 12GB of RAM. After 2 
hours of starting the cert.  It didn't fail on a 5GB configuration.  We 
believe it also didn't fail on SCSI configuration... This failure is on SATA 
drives.

This oom-killer issue now interferes with our ability to do certifications.

Comment 120 Larry Woodman 2005-02-11 18:08:30 UTC
Please close this problem and open up a different bug for the problem you
described in comment 118.  In that case someone has allocated all of the memory
and leaked it!

Normal free:696kB min:936kB low:1872kB high:2808kB active:8300kB inactive:7268kB
present:901120kB

As you can see, only about 16MB out of 900MB of the Normal zone is on any list
and the slabcache only accounts for ~150MB.  Someone allocated ~750MB of lowmem
and never freed it.

Larry Woodman


Comment 121 Robert Hentosh 2005-02-11 19:05:40 UTC
(In reply to comment #120)
> Please close this problem and open up a different bug for the problem you
> described in comment 118.  

Okay, I will open up a new issue then for this new problem. But, I will leave 
this issue open until we receive the kernel drop to regress the older issue.



Comment 122 Robert Hentosh 2005-02-11 19:23:36 UTC
Opened BZ#147832 for comment 118.


Comment 123 Jay Turner 2005-04-16 19:50:17 UTC
Dell, any update here from testing with the latest U1-candidate code?

Comment 124 Byju Michael 2005-04-21 08:52:29 UTC
This problem is not seen in  RHEL4 Update1 Beta.

Comment 125 Jay Turner 2005-04-21 14:15:37 UTC
/me celebrates and moves the bug to PROD_READY.

Comment 126 Amit Bhutani 2005-05-03 22:59:26 UTC
Based on comment #124, closing this issue.


Note You need to log in before you can comment on or make changes to this bug.