Bug 99335
| Summary: | [PATCH] Reboot hangs (not completed) following run of "busy" system stress test | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 2.1 | Reporter: | Glen A. Foster <glen.foster> | ||||||
| Component: | kernel | Assignee: | Jason Baron <jbaron> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 2.1 | CC: | knoel, riel, shillman, tao | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | ia64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2003-08-21 17:40:49 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Glen A. Foster
2003-07-17 19:45:06 UTC
======== ORIGINAL DEFECT INFO ======== Bug reported on: Wednesday 06/04 2003 at 15:22 155 Abe revision: 20.01 ------------------------------------------------------- SYMPTOMS: Got following message during X-window shutdown. not able to reboot the system after got that message waiting for X server to shut down ICE default IO error handler doing an exit(), pid = 1668, errno = 0 .X connection to :0.0 broken (explicit kill or server shutdown). Gdk-ERROR **: X connection to :0.0 broken (explicit kill or server shutdown). Gdk-ERROR **: X connection to :0.0 broken (explicit kill or server shutdown). Gdk-ERROR **: X connection to :0.0 broken (explicit kill or server shutdown). .Killing X11 Gdk-ERROR **: X connection to :0.0 broken (explicit kill or server shutdown). . Busy Test Done .Gdk-ERROR **: X connection to :0.0 broken (explicit kill or server shutdown). . ** WARNING **: 3 scalable icons still left when destroying icon factory Updated on: Wednesday 07/09 2003 at 16:51 190 Abe revision: 20.02 ------------------------------------------------------- UPDATE: After running busy for about 2h, ^C to stop it. Then get: ev01 -> ps -elH n F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 100 S 0 1 0 0 75 0 - 179 49d590 ? 0:09 init ... 000 S 0 1838 1 0 75 0 - 210 42c1b0 ttyS0 0:00 /usr/bin/ksh /busy2/bin/busytest -v -s /busy2/busysetup 000 D 0 5365 1838 0 75 0 - 199 4f0db0 ttyS0 0:00 sync 000 S 0 2167 1 0 75 0 - 206 42c1b0 ttyS0 0:00 /usr/bin/ksh /busy2/bin/timestamp 000 D 0 10681 2167 0 76 0 - 199 4f0db0 ttyS0 0:00 sync 000 S 0 2168 1 0 75 0 - 206 42c1b0 ttyS0 0:00 /usr/bin/ksh /busy2/bin/loadstamp 000 D 0 5364 2168 0 75 0 - 199 4f0db0 ttyS0 0:00 sync 100 D 0 2467 1 0 81 5 - 217 4f0db0 ttyS0 0:00 badblocks -w -b 512 /dev/sdc 17782784 000 S 0 24908 1 0 75 0 - 206 42c1b0 ttyS0 0:00 /usr/bin/ksh /busy2/bin/dmesgstamp 000 D 0 24922 24908 0 75 0 - 199 4f0db0 ttyS0 0:00 sync ev01 -> "man ps" says: D uninterruptible sleep (usually IO) 4f0db0 gets printed at "wait_o" when dropping the "n" parameter. Willy tells me reboot attempts to kill all processes and then "sync" the files. I need help o figuring out what the processes are sleeping on o why they aren't getting woken up: o someone to troll bugzilla for all RHAS 2.1 "hang" bugs Updated on: Thursday 07/10 2003 at 16:05 191
Abe revision: 20.02
-------------------------------------------------------
UPDATE:
Matthew Wilcox beat me to figure out how to get "ps" command to dump
the sleeping processes:
Grant, I figured out how to use ps ;-)
ev01 -> ps -e -o pid,comm,wchan=WIDE-WCHAN-COLUMN
PID COMMAND WIDE-WCHAN-COLUMN
1 init schedule_timeout
2 migration_CPU0 migration_thread
3 keventd context_thread
4 ksoftirqd_CPU0 ksoftirqd
5 kswapd schedule_timeout
6 bdflush interruptible_sleep_on
7 kupdated schedule_timeout
8 mdrecoveryd md_thread
14 scsi_eh_0 down_interruptible
15 scsi_eh_1 down_interruptible
16 scsi_eh_2 down_interruptible
17 scsi_eh_3 down_interruptible
18 scsi_eh_4 down_interruptible
19 scsi_eh_5 down_interruptible
20 scsi_eh_6 down_interruptible
21 scsi_eh_7 down_interruptible
22 scsi_eh_8 down_interruptible
23 scsi_eh_9 down_interruptible
24 scsi_eh_10 down_interruptible
25 scsi_eh_11 down_interruptible
28 kjournald interruptible_sleep_on
81 khubd phys_start
1065 syslogd schedule_timeout
1070 klogd do_syslog
1090 portmap schedule_timeout
1118 rpc.statd schedule_timeout
1251 acpid schedule_timeout
1270 sshd schedule_timeout
1304 xinetd schedule_timeout
1345 sendmail schedule_timeout
1364 gpm schedule_timeout
1382 crond schedule_timeout
1430 xfs schedule_timeout
1466 atd schedule_timeout
1475 login wait4
1476 mingetty schedule_timeout
1477 mingetty schedule_timeout
1478 mingetty schedule_timeout
1479 mingetty schedule_timeout
1480 mingetty schedule_timeout
1481 mingetty schedule_timeout
1484 ksh ia64_rt_sigsuspend
1715 oafd schedule_timeout
1838 busytest ia64_rt_sigsuspend
2167 timestamp ia64_rt_sigsuspend
2168 loadstamp ia64_rt_sigsuspend
2467 badblocks wait_on_buffer
5364 sync wait_on_buffer
5365 sync wait_on_buffer
10681 sync wait_on_buffer
24908 dmesgstamp ia64_rt_sigsuspend
24922 sync wait_on_buffer
20569 ps -
It's normal for the scsi EH threads to be sleeping in down_interruptible,
don't worry about that. Obviously the sync processes in wait_on_buffer
are the worrying ones.
Updated on: Friday 07/11 2003 at 17:20 192
Abe revision: 20.02
-------------------------------------------------------
UPDATE:
Tracking down another step, in this particular hang, only
one disk has IO outstanding and that's "sdc". All the other
hung processes seem to be waiting for that IO to complete.
ev01 -> ps -eH -o pid,wchan=WIDE-WCHAN-COLUMN,args
PID WIDE-WCHAN-COLUMN COMMAND
1 schedule_timeout init
...
1838 ia64_rt_sigsuspen /usr/bin/ksh /busy2/bin/busytest -v -s /busy2/busysetup
-t 4320 -r /busy2/results/res.0709.214108
5365 wait_on_buffer sync
2167 ia64_rt_sigsuspen /usr/bin/ksh /busy2/bin/timestamp
10681 wait_on_buffer sync
2168 ia64_rt_sigsuspen /usr/bin/ksh /busy2/bin/loadstamp
5364 wait_on_buffer sync
2467 wait_on_buffer badblocks -w -b 512 /dev/sdc 17782784
24908 ia64_rt_sigsuspen /usr/bin/ksh /busy2/bin/dmesgstamp
24922 wait_on_buffer sync
ISSUE TRACKER 26124 OPENED AS SEV 1 -- QU3 BLOCKER. Created attachment 93378 [details]
Test patch to solve problem
This patch *should* solve the problem seen here. It makes sure that in the
event that we have an scsi_malloc() failure on a device with no outstanding
commands then we set that device as starved so that a command completion on
another device will trigger a requeue on this device.
Created attachment 93380 [details]
New version of test patch that fixes compile problem
Fix a compile problem in the first patch
from glen foster.... Bob, Larry Troan needs some sort of read about the patch from Doug Ledford... if that patch works, Red Hat would like to put it in an security-errata kernel to be released very shortly. Please let me know ASAP and I'll get word to Larry almost as soon... Glen ---------------- Bob's response... The initial patch has been runing since yesterday. Put in in the errata. I applied the patch to e.31 and ran the "busy" test on the affected system ev01
overnight (17 hours). There were no hung sync processes and no unusual
problems reported by busy.
Without this patch, sync processes would often hang within the first hour, and
always within three hours on my previous tests. It looks like the patch fixes
the problems we were seeing in this test.
Bob Montgomery, HP
Here is the patch I tested:
--- linux/drivers/scsi/scsi_lib.c.iorl_fix 2003-08-04 12:43:32.315265960 -0
400
+++ linux/drivers/scsi/scsi_lib.c 2003-08-04 13:03:13.150751680 -0400
@@ -1100,6 +1100,12 @@ void scsi_request_fn(request_queue_t * q
scsi_dispatch_cmd(SCpnt);
spin_lock_irq(q->queue_lock);
}
+ if (!list_empty(&q->queue_head) && atomic_read(&SDpnt->device_busy) == 0
) {
+ SDpnt->starved = 1;
+ spin_lock(SHpnt->lock);
+ SHpnt->some_device_starved = 1;
+ spin_unlock(SHpnt->lock);
+ }
}
i'm looking after this one...i've got the fix queued for the next erratum In email, I wrote:
> The last remaining mystery for me is why the e.31 kernel
> experiences the scsi_malloc failures on this workload when
> the 2.4.21 kernel does not (according to my instrumentation,
> anyway). The size of the malloc pool is actually lower on
> 2.4.21 (5104 sectors vs. 5680 sectors on e.31 as reported in
> the dma_sectors variable in scsi_dma.c). I'll probably do a
> little more looking at that, just to see if I can satisfy my
> curiousity.
When I logged scsi_malloc failures on e.31, the length of the
failed requests was 0x1200, or 4608 bytes.
When I forced scsi_malloc failures on 2.4.21 by artificially
reducing the size of the dma pool, the length of the failed
requests was 0x1000, or 4096 bytes.
These requests are made by the init_io routine (scsi_init_io_v)
for a size corresponding to
SYM_CONF_MAX_SG (96) * sizeof (struct scatterlist).
On e.31, sizeof (struct scatterlist) is 48 bytes. On 2.4.21,
it's only 40 bytes because this field has been removed:
char *orig_address; /* for use by swiotlb */
This size difference figures into the calculations done in the
scsi_resize_dma_pool routine in scsi_dma.c to establish the size
of the pool. That is seen in the larger pool calculation of
e.31 (5680 sectors) vs 2.4.21 (5104 sectors).
However, there is an effect that is not taken into account by
the calculation.
The scsi_malloc pool is managed as an array of bitmaps in
dma_malloc_freelist, each map representing one page worth of
memory, and each bit representing a 512-byte "sector" of the
page. With a 16K page size, each page bitmap is a 32 bit
integer (16K / 512 = 32). An allocation is not allowed to cross
a page boundary, meaning it can't appear in more than one page
bitmap, and that's what causes the difference for e.31.
Since the typical allocation on e.31 is for 96 * 48 bytes or 9
512-byte sectors, only three allocations will fit on a page,
using 3 * 9 = 27 bits of the map. That effectively wastes 5/32
of the allocated memory in unusable fragments in each page,
unless other users of scsi_malloc have requests that fit in 5
sectors.
The typical allocation on 2.4.21 is for 96 * 40 bytes = 3840,
rounded up to 8 512-byte sectors. Four allocations will fit in
each page, wasting no space.
Here is a dump of the first part of the dma_malloc_freelist on
e.31 at the time of a scsi_malloc failure for a 4608-byte
request:
scsi_malloc failure, len: 4608 - bitmaps:
0xffffffff 0x0fffffff 0x07ffffff 0x07ffffff
0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff
0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff
0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff
0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff
0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff
0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff
0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff
...
The 0x07ffffff pattern continues until the end of the array.
Aside from the first two bitmaps, the 5 unusable free sectors
can be seen as the 5 bits of zeros at the top of each map word.
Conclusion: because of the unfortunate size of struct
scatterlist on e.31, the scsi_malloc dma pool suffers a lot more
fragmentation loss than it does on 2.4.21, and scsi_malloc
failures are more likely on a busy system. This exposed the bug
with the queue management that is addressed by Doug's patch for
Bugzilla #99335.
Since the simple design of scsi_malloc is susceptible to this
loss, the value of SYM_CONF_MAX_SG, the size of struct
scatterlist, and the PAGE_SIZE should either be managed to avoid
bad size combinations, or the initial size of the dma pool
should be "fudge-factored" upward to allow for the fragmentation
loss. Smaller page sizes should have the potential for even
higher loss ratios.
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2003-198.html |