Intro: This defect is 1 of 3 critical defects found at HP running in system validation test configurations. Information from an internal HP defect database will be cut-and-pasted for more background information. Description of problem: Following a run of "busy" (HP-internal system stress test), a system reboot hung; the only way to gain control was to reset the system through the Maintenance Processor firmware (see bug 99333). Grant has tracked it down to the point where the reboot is hanging on processes in "uninterruptable sleep". Version-Release number of selected component (if applicable): RHAS 2.1 update 2 (kernel-smp-2.4.18-e.31) How reproducible: don't know, intermittent _at worst_ Steps to Reproduce: 1. Run "busy" for ~2 hours, interrupt (halt) the test with <CNTL>C Additional info: Original report and troubleshooting data will be attached.
======== ORIGINAL DEFECT INFO ======== Bug reported on: Wednesday 06/04 2003 at 15:22 155 Abe revision: 20.01 ------------------------------------------------------- SYMPTOMS: Got following message during X-window shutdown. not able to reboot the system after got that message waiting for X server to shut down ICE default IO error handler doing an exit(), pid = 1668, errno = 0 .X connection to :0.0 broken (explicit kill or server shutdown). Gdk-ERROR **: X connection to :0.0 broken (explicit kill or server shutdown). Gdk-ERROR **: X connection to :0.0 broken (explicit kill or server shutdown). Gdk-ERROR **: X connection to :0.0 broken (explicit kill or server shutdown). .Killing X11 Gdk-ERROR **: X connection to :0.0 broken (explicit kill or server shutdown). . Busy Test Done .Gdk-ERROR **: X connection to :0.0 broken (explicit kill or server shutdown). . ** WARNING **: 3 scalable icons still left when destroying icon factory
Updated on: Wednesday 07/09 2003 at 16:51 190 Abe revision: 20.02 ------------------------------------------------------- UPDATE: After running busy for about 2h, ^C to stop it. Then get: ev01 -> ps -elH n F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 100 S 0 1 0 0 75 0 - 179 49d590 ? 0:09 init ... 000 S 0 1838 1 0 75 0 - 210 42c1b0 ttyS0 0:00 /usr/bin/ksh /busy2/bin/busytest -v -s /busy2/busysetup 000 D 0 5365 1838 0 75 0 - 199 4f0db0 ttyS0 0:00 sync 000 S 0 2167 1 0 75 0 - 206 42c1b0 ttyS0 0:00 /usr/bin/ksh /busy2/bin/timestamp 000 D 0 10681 2167 0 76 0 - 199 4f0db0 ttyS0 0:00 sync 000 S 0 2168 1 0 75 0 - 206 42c1b0 ttyS0 0:00 /usr/bin/ksh /busy2/bin/loadstamp 000 D 0 5364 2168 0 75 0 - 199 4f0db0 ttyS0 0:00 sync 100 D 0 2467 1 0 81 5 - 217 4f0db0 ttyS0 0:00 badblocks -w -b 512 /dev/sdc 17782784 000 S 0 24908 1 0 75 0 - 206 42c1b0 ttyS0 0:00 /usr/bin/ksh /busy2/bin/dmesgstamp 000 D 0 24922 24908 0 75 0 - 199 4f0db0 ttyS0 0:00 sync ev01 -> "man ps" says: D uninterruptible sleep (usually IO) 4f0db0 gets printed at "wait_o" when dropping the "n" parameter. Willy tells me reboot attempts to kill all processes and then "sync" the files. I need help o figuring out what the processes are sleeping on o why they aren't getting woken up: o someone to troll bugzilla for all RHAS 2.1 "hang" bugs
Updated on: Thursday 07/10 2003 at 16:05 191 Abe revision: 20.02 ------------------------------------------------------- UPDATE: Matthew Wilcox beat me to figure out how to get "ps" command to dump the sleeping processes: Grant, I figured out how to use ps ;-) ev01 -> ps -e -o pid,comm,wchan=WIDE-WCHAN-COLUMN PID COMMAND WIDE-WCHAN-COLUMN 1 init schedule_timeout 2 migration_CPU0 migration_thread 3 keventd context_thread 4 ksoftirqd_CPU0 ksoftirqd 5 kswapd schedule_timeout 6 bdflush interruptible_sleep_on 7 kupdated schedule_timeout 8 mdrecoveryd md_thread 14 scsi_eh_0 down_interruptible 15 scsi_eh_1 down_interruptible 16 scsi_eh_2 down_interruptible 17 scsi_eh_3 down_interruptible 18 scsi_eh_4 down_interruptible 19 scsi_eh_5 down_interruptible 20 scsi_eh_6 down_interruptible 21 scsi_eh_7 down_interruptible 22 scsi_eh_8 down_interruptible 23 scsi_eh_9 down_interruptible 24 scsi_eh_10 down_interruptible 25 scsi_eh_11 down_interruptible 28 kjournald interruptible_sleep_on 81 khubd phys_start 1065 syslogd schedule_timeout 1070 klogd do_syslog 1090 portmap schedule_timeout 1118 rpc.statd schedule_timeout 1251 acpid schedule_timeout 1270 sshd schedule_timeout 1304 xinetd schedule_timeout 1345 sendmail schedule_timeout 1364 gpm schedule_timeout 1382 crond schedule_timeout 1430 xfs schedule_timeout 1466 atd schedule_timeout 1475 login wait4 1476 mingetty schedule_timeout 1477 mingetty schedule_timeout 1478 mingetty schedule_timeout 1479 mingetty schedule_timeout 1480 mingetty schedule_timeout 1481 mingetty schedule_timeout 1484 ksh ia64_rt_sigsuspend 1715 oafd schedule_timeout 1838 busytest ia64_rt_sigsuspend 2167 timestamp ia64_rt_sigsuspend 2168 loadstamp ia64_rt_sigsuspend 2467 badblocks wait_on_buffer 5364 sync wait_on_buffer 5365 sync wait_on_buffer 10681 sync wait_on_buffer 24908 dmesgstamp ia64_rt_sigsuspend 24922 sync wait_on_buffer 20569 ps - It's normal for the scsi EH threads to be sleeping in down_interruptible, don't worry about that. Obviously the sync processes in wait_on_buffer are the worrying ones.
Updated on: Friday 07/11 2003 at 17:20 192 Abe revision: 20.02 ------------------------------------------------------- UPDATE: Tracking down another step, in this particular hang, only one disk has IO outstanding and that's "sdc". All the other hung processes seem to be waiting for that IO to complete. ev01 -> ps -eH -o pid,wchan=WIDE-WCHAN-COLUMN,args PID WIDE-WCHAN-COLUMN COMMAND 1 schedule_timeout init ... 1838 ia64_rt_sigsuspen /usr/bin/ksh /busy2/bin/busytest -v -s /busy2/busysetup -t 4320 -r /busy2/results/res.0709.214108 5365 wait_on_buffer sync 2167 ia64_rt_sigsuspen /usr/bin/ksh /busy2/bin/timestamp 10681 wait_on_buffer sync 2168 ia64_rt_sigsuspen /usr/bin/ksh /busy2/bin/loadstamp 5364 wait_on_buffer sync 2467 wait_on_buffer badblocks -w -b 512 /dev/sdc 17782784 24908 ia64_rt_sigsuspen /usr/bin/ksh /busy2/bin/dmesgstamp 24922 wait_on_buffer sync
ISSUE TRACKER 26124 OPENED AS SEV 1 -- QU3 BLOCKER.
Created attachment 93378 [details] Test patch to solve problem This patch *should* solve the problem seen here. It makes sure that in the event that we have an scsi_malloc() failure on a device with no outstanding commands then we set that device as starved so that a command completion on another device will trigger a requeue on this device.
Created attachment 93380 [details] New version of test patch that fixes compile problem Fix a compile problem in the first patch
from glen foster.... Bob, Larry Troan needs some sort of read about the patch from Doug Ledford... if that patch works, Red Hat would like to put it in an security-errata kernel to be released very shortly. Please let me know ASAP and I'll get word to Larry almost as soon... Glen ---------------- Bob's response... The initial patch has been runing since yesterday. Put in in the errata.
I applied the patch to e.31 and ran the "busy" test on the affected system ev01 overnight (17 hours). There were no hung sync processes and no unusual problems reported by busy. Without this patch, sync processes would often hang within the first hour, and always within three hours on my previous tests. It looks like the patch fixes the problems we were seeing in this test. Bob Montgomery, HP Here is the patch I tested: --- linux/drivers/scsi/scsi_lib.c.iorl_fix 2003-08-04 12:43:32.315265960 -0 400 +++ linux/drivers/scsi/scsi_lib.c 2003-08-04 13:03:13.150751680 -0400 @@ -1100,6 +1100,12 @@ void scsi_request_fn(request_queue_t * q scsi_dispatch_cmd(SCpnt); spin_lock_irq(q->queue_lock); } + if (!list_empty(&q->queue_head) && atomic_read(&SDpnt->device_busy) == 0 ) { + SDpnt->starved = 1; + spin_lock(SHpnt->lock); + SHpnt->some_device_starved = 1; + spin_unlock(SHpnt->lock); + } }
i'm looking after this one...i've got the fix queued for the next erratum
In email, I wrote: > The last remaining mystery for me is why the e.31 kernel > experiences the scsi_malloc failures on this workload when > the 2.4.21 kernel does not (according to my instrumentation, > anyway). The size of the malloc pool is actually lower on > 2.4.21 (5104 sectors vs. 5680 sectors on e.31 as reported in > the dma_sectors variable in scsi_dma.c). I'll probably do a > little more looking at that, just to see if I can satisfy my > curiousity. When I logged scsi_malloc failures on e.31, the length of the failed requests was 0x1200, or 4608 bytes. When I forced scsi_malloc failures on 2.4.21 by artificially reducing the size of the dma pool, the length of the failed requests was 0x1000, or 4096 bytes. These requests are made by the init_io routine (scsi_init_io_v) for a size corresponding to SYM_CONF_MAX_SG (96) * sizeof (struct scatterlist). On e.31, sizeof (struct scatterlist) is 48 bytes. On 2.4.21, it's only 40 bytes because this field has been removed: char *orig_address; /* for use by swiotlb */ This size difference figures into the calculations done in the scsi_resize_dma_pool routine in scsi_dma.c to establish the size of the pool. That is seen in the larger pool calculation of e.31 (5680 sectors) vs 2.4.21 (5104 sectors). However, there is an effect that is not taken into account by the calculation. The scsi_malloc pool is managed as an array of bitmaps in dma_malloc_freelist, each map representing one page worth of memory, and each bit representing a 512-byte "sector" of the page. With a 16K page size, each page bitmap is a 32 bit integer (16K / 512 = 32). An allocation is not allowed to cross a page boundary, meaning it can't appear in more than one page bitmap, and that's what causes the difference for e.31. Since the typical allocation on e.31 is for 96 * 48 bytes or 9 512-byte sectors, only three allocations will fit on a page, using 3 * 9 = 27 bits of the map. That effectively wastes 5/32 of the allocated memory in unusable fragments in each page, unless other users of scsi_malloc have requests that fit in 5 sectors. The typical allocation on 2.4.21 is for 96 * 40 bytes = 3840, rounded up to 8 512-byte sectors. Four allocations will fit in each page, wasting no space. Here is a dump of the first part of the dma_malloc_freelist on e.31 at the time of a scsi_malloc failure for a 4608-byte request: scsi_malloc failure, len: 4608 - bitmaps: 0xffffffff 0x0fffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff 0x07ffffff ... The 0x07ffffff pattern continues until the end of the array. Aside from the first two bitmaps, the 5 unusable free sectors can be seen as the 5 bits of zeros at the top of each map word. Conclusion: because of the unfortunate size of struct scatterlist on e.31, the scsi_malloc dma pool suffers a lot more fragmentation loss than it does on 2.4.21, and scsi_malloc failures are more likely on a busy system. This exposed the bug with the queue management that is addressed by Doug's patch for Bugzilla #99335. Since the simple design of scsi_malloc is susceptible to this loss, the value of SYM_CONF_MAX_SG, the size of struct scatterlist, and the PAGE_SIZE should either be managed to avoid bad size combinations, or the initial size of the dma pool should be "fudge-factored" upward to allow for the fragmentation loss. Smaller page sizes should have the potential for even higher loss ratios.
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2003-198.html