I have 3 machine + 3 shared FSs on 3 cluster mirrors (9 I/O streams).
Rarely, a machine can simply lock-up with no indication of what happened.
If you kill the log server machine, it virtually gaurentees hitting the problem
(usually on the next machine to become the log server). It then often continues
to strike... hitting each machine in a round-robin fashion.
have fix. will check-in (and explain) after running overnight.
The issue appears to be the fact that I am using a spin lock across a blocking
operation. The process holding the lock gets swapped out and other processes
come in and try to aquire the lock and spin.
I found other minor issues that could also be contributing factors:
1) a semaphore was being 'up'ed twice, rendering it ineffective
2) kmalloc operations should have been using the GFP_NOFS instead of GFP_KERNEL
3) inadequate reserves in mempools
This bug provoked an audit of the communications exchange, locking,
and memory allocations/stack usage.
Communication fixes include:
1) Added sequence numbers to ensure that replies from the server
correctly correspond to client requests. It was found that if
a client timed out waiting for a server to respond, it would send
the request again. However, the server may have simply been too
busy to respond in a timely fashion. It ends up responding to
both the original request and the resent request - causing the
client and server to become out-of-sync WRT log requests.
Locking fixes include:
1) A semaphore was being "up"ed twice in some cases, rendering
the lock impotent.
2) A spin lock controlling region status lists was being held
across blocking operations - sometimes causing deadlocks. The
spin lock was changed to a per-log lock, and some logging
operations were restructured to better suit the way locking
needed to be done. A side-effect of this fix is a 20%
improvement in write operations.
3) The log list protection lock needed to change from a spin lock
to a semaphore to allow blocking operations.
Memory allocation fixes include:
1) Wrong flags to kmalloc could cause deadlock. Use NOFS instead
2) Mempools needed more reserves for low memory conditions.
3) Server now allocates a communication structure instead of having
it on the stack. This reduces the likelyhood of stack corruption.
post -> modified
Closing this bug.