I have 3 machine + 3 shared FSs on 3 cluster mirrors (9 I/O streams). Rarely, a machine can simply lock-up with no indication of what happened. If you kill the log server machine, it virtually gaurentees hitting the problem (usually on the next machine to become the log server). It then often continues to strike... hitting each machine in a round-robin fashion.
have fix. will check-in (and explain) after running overnight.
The issue appears to be the fact that I am using a spin lock across a blocking operation. The process holding the lock gets swapped out and other processes come in and try to aquire the lock and spin. I found other minor issues that could also be contributing factors: 1) a semaphore was being 'up'ed twice, rendering it ineffective 2) kmalloc operations should have been using the GFP_NOFS instead of GFP_KERNEL 3) inadequate reserves in mempools
This bug provoked an audit of the communications exchange, locking, and memory allocations/stack usage. Communication fixes include: 1) Added sequence numbers to ensure that replies from the server correctly correspond to client requests. It was found that if a client timed out waiting for a server to respond, it would send the request again. However, the server may have simply been too busy to respond in a timely fashion. It ends up responding to both the original request and the resent request - causing the client and server to become out-of-sync WRT log requests. Locking fixes include: 1) A semaphore was being "up"ed twice in some cases, rendering the lock impotent. 2) A spin lock controlling region status lists was being held across blocking operations - sometimes causing deadlocks. The spin lock was changed to a per-log lock, and some logging operations were restructured to better suit the way locking needed to be done. A side-effect of this fix is a 20% improvement in write operations. 3) The log list protection lock needed to change from a spin lock to a semaphore to allow blocking operations. Memory allocation fixes include: 1) Wrong flags to kmalloc could cause deadlock. Use NOFS instead of KERNEL. 2) Mempools needed more reserves for low memory conditions. 3) Server now allocates a communication structure instead of having it on the stack. This reduces the likelyhood of stack corruption.
post -> modified
Closing this bug.