Bug 234539 - multiple streams of I/O can cause system to lock up
multiple streams of I/O can cause system to lock up
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cmirror (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2007-03-29 17:26 EDT by Jonathan Earl Brassow
Modified: 2010-01-11 21:02 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2008-06-30 16:29:45 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Jonathan Earl Brassow 2007-03-29 17:26:59 EDT
I have 3 machine + 3 shared FSs on 3 cluster mirrors (9 I/O streams).

Rarely, a machine can simply lock-up with no indication of what happened.

If you kill the log server machine, it virtually gaurentees hitting the problem
(usually on the next machine to become the log server).  It then often continues
to strike... hitting each machine in a round-robin fashion.
Comment 1 Jonathan Earl Brassow 2007-03-29 17:28:37 EDT
have fix.  will check-in (and explain) after running overnight.
Comment 3 Jonathan Earl Brassow 2007-03-30 11:48:52 EDT
The issue appears to be the fact that I am using a spin lock across a blocking
operation.  The process holding the lock gets swapped out and other processes
come in and try to aquire the lock and spin.

I found other minor issues that could also be contributing factors:
1) a semaphore was being 'up'ed twice, rendering it ineffective
2) kmalloc operations should have been using the GFP_NOFS instead of GFP_KERNEL
3) inadequate reserves in mempools
Comment 4 Jonathan Earl Brassow 2007-04-03 14:28:37 EDT
        This bug provoked an audit of the communications exchange, locking,
        and memory allocations/stack usage.

        Communication fixes include:
        1) Added sequence numbers to ensure that replies from the server
        correctly correspond to client requests.  It was found that if
        a client timed out waiting for a server to respond, it would send
        the request again.  However, the server may have simply been too
        busy to respond in a timely fashion.  It ends up responding to
        both the original request and the resent request - causing the
        client and server to become out-of-sync WRT log requests.

        Locking fixes include:
        1) A semaphore was being "up"ed twice in some cases, rendering
        the lock impotent.

        2) A spin lock controlling region status lists was being held
        across blocking operations - sometimes causing deadlocks.  The
        spin lock was changed to a per-log lock, and some logging
        operations were restructured to better suit the way locking
        needed to be done.  A side-effect of this fix is a 20%
        improvement in write operations.

        3) The log list protection lock needed to change from a spin lock
        to a semaphore to allow blocking operations.

        Memory allocation fixes include:
        1) Wrong flags to kmalloc could cause deadlock.  Use NOFS instead
        of KERNEL.

        2) Mempools needed more reserves for low memory conditions.

        3) Server now allocates a communication structure instead of having
        it on the stack.  This reduces the likelyhood of stack corruption.
Comment 5 Jonathan Earl Brassow 2007-04-03 16:12:53 EDT
post -> modified
Comment 7 Corey Marthaler 2008-06-30 16:29:45 EDT
Closing this bug.

Note You need to log in before you can comment on or make changes to this bug.