99335 – [PATCH] Reboot hangs (not completed) following run of "busy" system stress test

Bug 99335 - [PATCH] Reboot hangs (not completed) following run of "busy" system stress test

Summary: [PATCH] Reboot hangs (not completed) following run of "busy" system stress test

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	2.1
Hardware:	ia64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Jason Baron
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-07-17 19:45 UTC by Glen A. Foster
Modified:	2013-03-06 05:55 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2003-08-21 17:40:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Test patch to solve problem (479 bytes, patch) 2003-08-04 17:26 UTC, Doug Ledford	no flags	Details \| Diff
New version of test patch that fixes compile problem (484 bytes, patch) 2003-08-04 18:21 UTC, Doug Ledford	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2003:198	0	normal	SHIPPED_LIVE	Important: kernel security update	2003-08-21 04:00:00 UTC

Description Glen A. Foster 2003-07-17 19:45:06 UTC

Intro: This defect is 1 of 3 critical defects found at HP running in system
validation test configurations.   Information from an internal HP defect
database will be cut-and-pasted for more background information.

Description of problem: Following a run of "busy" (HP-internal system stress
test), a system reboot hung; the only way to gain control was to reset the
system through the Maintenance Processor firmware (see bug 99333).

Grant has tracked it down to the point where the reboot is hanging on processes
 in "uninterruptable sleep".

Version-Release number of selected component (if applicable):
RHAS 2.1 update 2 (kernel-smp-2.4.18-e.31)

How reproducible: don't know, intermittent _at worst_

Steps to Reproduce:
1. Run "busy" for ~2 hours, interrupt (halt) the test with <CNTL>C
    
Additional info: Original report and troubleshooting data will be attached.

Comment 1 Glen A. Foster 2003-07-17 19:47:15 UTC

======== ORIGINAL DEFECT INFO ========

Bug reported on: Wednesday 06/04 2003 at 15:22 155
Abe revision: 20.01
-------------------------------------------------------
SYMPTOMS:

Got following message during X-window shutdown.  not able to reboot the system
after got that message

waiting for X server to shut down ICE default IO error handler doing an exit(), 
pid = 1668, errno = 0 
.X connection to :0.0 broken (explicit kill or server shutdown).
Gdk-ERROR **: X connection to :0.0 broken (explicit kill or server shutdown).
Gdk-ERROR **: X connection to :0.0 broken (explicit kill or server shutdown).
Gdk-ERROR **: X connection to :0.0 broken (explicit kill or server shutdown).
.Killing X11
Gdk-ERROR **: X connection to :0.0 broken (explicit kill or server shutdown).
. 
Busy Test Done
.Gdk-ERROR **: X connection to :0.0 broken (explicit kill or server shutdown).
.
** WARNING **: 3 scalable icons still left when destroying icon factory

Comment 2 Glen A. Foster 2003-07-17 19:50:45 UTC

Updated on: Wednesday 07/09 2003 at 16:51 190
Abe revision: 20.02
-------------------------------------------------------
UPDATE:

After running busy for about 2h, ^C to stop it.
Then get:
ev01 -> ps -elH n 
  F S   UID   PID  PPID  C PRI  NI ADDR    SZ  WCHAN TTY        TIME CMD
100 S     0     1     0  0  75   0    -   179 49d590 ?          0:09 init
...
000 S 0 1838 1 0 75 0 - 210 42c1b0 ttyS0 0:00 /usr/bin/ksh /busy2/bin/busytest 
-v -s /busy2/busysetup 
000 D     0  5365  1838  0  75   0    -   199 4f0db0 ttyS0      0:00     sync
000 S 0 2167 1 0 75 0 - 206 42c1b0 ttyS0 0:00 
/usr/bin/ksh /busy2/bin/timestamp 
000 D     0 10681  2167  0  76   0    -   199 4f0db0 ttyS0      0:00     sync
000 S 0 2168 1 0 75 0 - 206 42c1b0 ttyS0 0:00 
/usr/bin/ksh /busy2/bin/loadstamp 
000 D     0  5364  2168  0  75   0    -   199 4f0db0 ttyS0      0:00     sync
100 D 0 2467 1 0 81 5 - 217 4f0db0 ttyS0 0:00 
badblocks -w -b 512 /dev/sdc 17782784 
000 S 0 24908 1 0 75 0 - 206 42c1b0 ttyS0 
0:00 /usr/bin/ksh /busy2/bin/dmesgstamp 
000 D     0 24922 24908  0  75   0    -   199 4f0db0 ttyS0      0:00     sync
ev01 -> 

"man ps" says:
D uninterruptible sleep (usually IO)

4f0db0 gets printed at "wait_o" when dropping the "n" parameter.

Willy tells me reboot attempts to kill all processes and then "sync"
the files.

I need help
o figuring out what the processes are sleeping on
o why they aren't getting woken up:
o someone to troll bugzilla for all RHAS 2.1 "hang" bugs

Comment 3 Glen A. Foster 2003-07-17 19:51:19 UTC

Updated on: Thursday 07/10 2003 at 16:05 191
Abe revision: 20.02
-------------------------------------------------------
UPDATE:

Matthew Wilcox beat me to figure out how to get "ps" command to dump
the sleeping processes:

Grant, I figured out how to use ps ;-)

ev01 -> ps -e -o pid,comm,wchan=WIDE-WCHAN-COLUMN        
  PID COMMAND          WIDE-WCHAN-COLUMN
    1 init             schedule_timeout
    2 migration_CPU0   migration_thread
    3 keventd          context_thread
    4 ksoftirqd_CPU0   ksoftirqd
    5 kswapd           schedule_timeout
    6 bdflush          interruptible_sleep_on
    7 kupdated         schedule_timeout
    8 mdrecoveryd      md_thread
   14 scsi_eh_0        down_interruptible
   15 scsi_eh_1        down_interruptible
   16 scsi_eh_2        down_interruptible
   17 scsi_eh_3        down_interruptible
   18 scsi_eh_4        down_interruptible
   19 scsi_eh_5        down_interruptible
   20 scsi_eh_6        down_interruptible
   21 scsi_eh_7        down_interruptible
   22 scsi_eh_8        down_interruptible
   23 scsi_eh_9        down_interruptible
   24 scsi_eh_10       down_interruptible
   25 scsi_eh_11       down_interruptible
   28 kjournald        interruptible_sleep_on
   81 khubd            phys_start
 1065 syslogd          schedule_timeout
 1070 klogd            do_syslog
 1090 portmap          schedule_timeout
 1118 rpc.statd        schedule_timeout
 1251 acpid            schedule_timeout
 1270 sshd             schedule_timeout
 1304 xinetd           schedule_timeout
 1345 sendmail         schedule_timeout
 1364 gpm              schedule_timeout
 1382 crond            schedule_timeout
 1430 xfs              schedule_timeout
 1466 atd              schedule_timeout
 1475 login            wait4
 1476 mingetty         schedule_timeout
 1477 mingetty         schedule_timeout
 1478 mingetty         schedule_timeout
 1479 mingetty         schedule_timeout
 1480 mingetty         schedule_timeout
 1481 mingetty         schedule_timeout
 1484 ksh              ia64_rt_sigsuspend
 1715 oafd             schedule_timeout
 1838 busytest         ia64_rt_sigsuspend
 2167 timestamp        ia64_rt_sigsuspend
 2168 loadstamp        ia64_rt_sigsuspend
 2467 badblocks        wait_on_buffer
 5364 sync             wait_on_buffer
 5365 sync             wait_on_buffer
10681 sync             wait_on_buffer
24908 dmesgstamp       ia64_rt_sigsuspend
24922 sync             wait_on_buffer
20569 ps               -

It's normal for the scsi EH threads to be sleeping in down_interruptible,
don't worry about that.  Obviously the sync processes in wait_on_buffer
are the worrying ones.

Comment 4 Glen A. Foster 2003-07-17 19:51:41 UTC

Updated on: Friday 07/11 2003 at 17:20 192
Abe revision: 20.02
-------------------------------------------------------
UPDATE:

 Tracking down another step, in this particular hang, only
one disk has IO outstanding and that's "sdc". All the other
hung processes seem to be waiting for that IO to complete.

ev01 -> ps -eH -o pid,wchan=WIDE-WCHAN-COLUMN,args
  PID WIDE-WCHAN-COLUMN COMMAND
    1 schedule_timeout  init
...
1838 ia64_rt_sigsuspen /usr/bin/ksh /busy2/bin/busytest -v -s /busy2/busysetup 
-t 4320 -r /busy2/results/res.0709.214108 
 5365 wait_on_buffer        sync
 2167 ia64_rt_sigsuspen   /usr/bin/ksh /busy2/bin/timestamp
10681 wait_on_buffer        sync
 2168 ia64_rt_sigsuspen   /usr/bin/ksh /busy2/bin/loadstamp
 5364 wait_on_buffer        sync
 2467 wait_on_buffer      badblocks -w -b 512 /dev/sdc 17782784
24908 ia64_rt_sigsuspen   /usr/bin/ksh /busy2/bin/dmesgstamp
24922 wait_on_buffer        sync

Comment 5 Larry Troan 2003-07-22 16:34:47 UTC

ISSUE TRACKER 26124 OPENED AS SEV 1 -- QU3 BLOCKER.

Comment 6 Doug Ledford 2003-08-04 17:26:46 UTC

Created attachment 93378 [details]
Test patch to solve problem

This patch *should* solve the problem seen here.  It makes sure that in the
event that we have an scsi_malloc() failure on a device with no outstanding
commands then we set that device as starved so that a command completion on
another device will trigger a requeue on this device.

Comment 7 Doug Ledford 2003-08-04 18:21:47 UTC

Created attachment 93380 [details]
New version of test patch that fixes compile problem

Fix a compile problem in the first patch

Comment 8 Larry Troan 2003-08-05 16:58:23 UTC

from glen foster....
Bob, Larry Troan needs some sort of read about the patch from Doug
Ledford... if that patch works, Red Hat would like to put it in an
security-errata kernel to be released very shortly.

Please let me know ASAP and I'll get word to Larry almost as soon...

Glen
----------------
Bob's response...
The initial patch has been runing since yesterday. Put in in the errata.

Comment 9 Bob Montgomery 2003-08-05 17:00:33 UTC

I applied the patch to e.31 and ran the "busy" test on the affected system ev01 
overnight (17 hours).  There were no hung sync processes and no unusual 
problems reported by busy.  

Without this patch, sync processes would often hang within the first hour, and 
always within three hours on my previous tests.  It looks like the patch fixes 
the problems we were seeing in this test.

Bob Montgomery, HP

Here is the patch I tested:

--- linux/drivers/scsi/scsi_lib.c.iorl_fix      2003-08-04 12:43:32.315265960 -0
400
+++ linux/drivers/scsi/scsi_lib.c       2003-08-04 13:03:13.150751680 -0400
@@ -1100,6 +1100,12 @@ void scsi_request_fn(request_queue_t * q
                scsi_dispatch_cmd(SCpnt);
                spin_lock_irq(q->queue_lock);
        }
+       if (!list_empty(&q->queue_head) && atomic_read(&SDpnt->device_busy) == 0
) {
+               SDpnt->starved = 1;
+               spin_lock(SHpnt->lock);
+               SHpnt->some_device_starved = 1;
+               spin_unlock(SHpnt->lock);
+       }
 }

Comment 11 Jason Baron 2003-08-08 16:26:37 UTC

i'm looking after this one...i've got the fix queued for the next erratum

Comment 12 Bob Montgomery 2003-08-08 20:49:29 UTC

In email, I wrote:

> The last remaining mystery for me is why the e.31 kernel
> experiences the scsi_malloc failures on this workload when
> the 2.4.21 kernel does not (according to my instrumentation,
> anyway).  The size of the malloc pool is actually lower on
> 2.4.21 (5104 sectors vs. 5680 sectors on e.31 as reported in
> the dma_sectors variable in scsi_dma.c).  I'll probably do a
> little more looking at that, just to see if I can satisfy my
> curiousity.

When I logged scsi_malloc failures on e.31, the length of the
failed requests was 0x1200, or 4608 bytes.

When I forced scsi_malloc failures on 2.4.21 by artificially
reducing the size of the dma pool, the length of the failed
requests was 0x1000, or 4096 bytes.

These requests are made by the init_io routine (scsi_init_io_v)
for a size corresponding to

	SYM_CONF_MAX_SG (96) * sizeof (struct scatterlist).

On e.31, sizeof (struct scatterlist) is 48 bytes.  On 2.4.21,
it's only 40 bytes because this field has been removed:

        char *orig_address;     /* for use by swiotlb */                    

This size difference figures into the calculations done in the
scsi_resize_dma_pool routine in scsi_dma.c to establish the size
of the pool.  That is seen in the larger pool calculation of
e.31 (5680 sectors) vs 2.4.21 (5104 sectors).

However, there is an effect that is not taken into account by
the calculation.

The scsi_malloc pool is managed as an array of bitmaps in
dma_malloc_freelist, each map representing one page worth of
memory, and each bit representing a 512-byte "sector" of the
page.  With a 16K page size, each page bitmap is a 32 bit
integer (16K / 512 = 32).  An allocation is not allowed to cross
a page boundary, meaning it can't appear in more than one page
bitmap, and that's what causes the difference for e.31.

Since the typical allocation on e.31 is for 96 * 48 bytes or 9
512-byte sectors, only three allocations will fit on a page,
using 3 * 9 = 27 bits of the map.  That effectively wastes 5/32
of the allocated memory in unusable fragments in each page,
unless other users of scsi_malloc have requests that fit in 5
sectors.

The typical allocation on 2.4.21 is for 96 * 40 bytes = 3840,
rounded up to 8 512-byte sectors.  Four allocations will fit in
each page, wasting no space.

Here is a dump of the first part of the dma_malloc_freelist on
e.31 at the time of a scsi_malloc failure for a 4608-byte
request:

scsi_malloc failure, len: 4608 - bitmaps:
0xffffffff 0x0fffffff 0x07ffffff 0x07ffffff
0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff
0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff
0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff
0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff
0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff
0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff
0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff
...

The 0x07ffffff pattern continues until the end of the array.
Aside from the first two bitmaps, the 5 unusable free sectors
can be seen as the 5 bits of zeros at the top of each map word.

Conclusion: because of the unfortunate size of struct
scatterlist on e.31, the scsi_malloc dma pool suffers a lot more
fragmentation loss than it does on 2.4.21, and scsi_malloc
failures are more likely on a busy system.  This exposed the bug
with the queue management that is addressed by Doug's patch for
Bugzilla #99335.

Since the simple design of scsi_malloc is susceptible to this
loss, the value of SYM_CONF_MAX_SG, the size of struct
scatterlist, and the PAGE_SIZE should either be managed to avoid
bad size combinations, or the initial size of the dma pool
should be "fudge-factored" upward to allow for the fragmentation
loss.  Smaller page sizes should have the potential for even
higher loss ratios.

Comment 13 Mark J. Cox 2003-08-21 17:40:49 UTC

An errata has been issued which should help the problem described in this bug report. 
This report is therefore being closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, please follow the link below. You may reopen 
this bug report if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2003-198.html

Note You need to log in before you can comment on or make changes to this bug.