Description of problem from mark.fasheh: A while back, a bug in __wtd_down_action was fixed in AS 2.1 x86, but somehow never made it's way into the ia64 tree. Attached is a patch to port the bug fix to ia64. Basically in arch/ia64/kernel/semaphore.c, __wtd_down_action() calls __wtd_down, which incements sem->sleepers which resulted in some invalid sem counts under heavy load. This shouldn't happen, so the solution was to call a new function __wtd_down_from_wakeup() which didn't increment sem->sleeprs. This resulted in a P1 bug for us from a large customer. After giving them a kernel with that fix, they've been running for days without any problem. We'd like to see this in the ia64 tree ASAP. Below are some e-mail excerpt which further explain the issue and provide a method for reproducing it. --Mark >> Routine __wtd_down() in arch/i386/kernel/semaphore.c is the >> AIO equivalent >> of routine __down(), basically handling down() failures and >> blocking until >> someone up()s the semaphore. The major difference is that >> "sem->sleepers++" happens only once each time __down() is >> called, when >> the process blocks, the for loop does not re-execute the >> "sem->sleepers++". >> >> __wtd_down() on the other hand executes the "sem->sleepers++" multiple >> times. __wtd_down is initially called from "__wtd_down_failed" in the >> context of the process calling the AIO syscall, and when it >> fails, again >> from "__wtd_down_action()" when whoever owns the inode >> semapore up()s it. >> This causes the combination of the sem->sleepers and the >> sem->count to get >> into an inconsistant state thereby allowing multiple down()s >> to the same >> semaphore to succeed at the same time, but the count never >> gets dropped >> below zero!!! Subsequent up()s to that semaphore result in >> the sem->count >> getting bumped to > 1. >> >> The inode eventually gets freed and reallocated without resetting the >> semaphore because inodes have a constructor in the slab >> cache. When the >> inode gets used for a pipe, multiple down()s succeed and the pipe data >> gets screwed up, leading to the infamous BUG in pipe.c. >> We(and Oracle) >> also think this is causing the Oracle listener process hangs >> because pipes >> are used between their processes. Version-Release number of selected component (if applicable): 2.4.18-e.43 How reproducible: Steps to Reproduce: And here's a way to reproduce the problem: > Reproducing the problem: > Load 10+ tables simultaneously into Oracle using sqlldr. Repeat for > 500-1000 iterations. The database must be configured with ASYNC IO ON, > archive logging ON, and logs should be large (50 meg+) each. Wait for > 20+ gigs of logs to accumulate. While still loading data, "rm" 20+ gig > of log files. Using "rm" produces the most consistent results. Wait and > repeat until the sqlldr logs show ORA-3113 errors or other TNS errors or > loader processes simply hang. Eventually, all loader processes will > simply hang. We can reproduce the problems 100% of the time using this > method. Actual results: Expected results: Additional info:
--- linux-2.4.18-e.43/arch/ia64/kernel/semaphore.c.orig 2004-04-23 17:04:49.000000000-0700 +++ linux-2.4.18-e.43/arch/ia64/kernel/semaphore.c 2004-04-23 17:09:23.000000000-0700 @@ -46,6 +46,7 @@ __up (struct semaphore *sem) static spinlock_t semaphore_lock = SPIN_LOCK_UNLOCKED; void __wtd_down(struct semaphore * sem, struct worktodo *wtd); +void __wtd_down_from_wakeup(struct semaphore * sem, struct worktodo *wtd); void __wtd_down_action(void *data) { @@ -55,7 +56,7 @@ void __wtd_down_action(void *data) wtd_pop(wtd); sem = wtd->data; - __wtd_down(sem, wtd); + __wtd_down_from_wakeup(sem, wtd); } void __wtd_down_waiter(wait_queue_t *wait) @@ -93,6 +94,33 @@ void __wtd_down(struct semaphore * sem, } } +/* + * Same as __wtd_down, but sem->sleepers is not incremented when coming from a wakeup. + */ +void __wtd_down_from_wakeup(struct semaphore * sem, struct worktodo *wtd) +{ + int gotit; + int sleepers; + + init_waitqueue_func_entry(&wtd->wait, __wtd_down_waiter); + wtd->data = sem; + + spin_lock_irq(&semaphore_lock); + sleepers = sem->sleepers; + gotit = add_wait_queue_exclusive_cond(&sem->wait, &wtd->wait, + atomic_add_negative(sleepers - 1, &sem->count)); + if (gotit) + sem->sleepers = 0; + else + sem->sleepers = 1; + spin_unlock_irq(&semaphore_lock); + + if (gotit) { + wake_up(&sem->wait); + wtd_queue(wtd); + } +} + /* Returns 0 if we acquired the semaphore, 1 if it was queued. */ int wtd_down(struct worktodo *wtd, struct semaphore *sem) {
this is in U5 for ia64, changing to modified.
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-327.html