Hide Forgot
Description of problem: When running the following "stress test" on GFS, the dd processes get stuck after about 30 minutes: root@nodea:~# touch /tmp/dd_running; for i in $(seq 1 48); do (while [ -e /tmp/dd_running ]; do dd if=/dev/mpath/P9500_0548 of=/mnt/gfstest/lilc066/dd.out.$i bs=64k count=16384 iflag=direct skip=$(echo "16384*$i"|bc) oflag=direct >/dev/null 2>&1; done& ) ; done root@nodeb:~# touch /tmp/dd_running; for i in $(seq 1 48); do (while [ -e /tmp/dd_running ]; do dd if=/dev/mpath/P9500_0548 of=/mnt/gfstest/lilc067/dd.out.$i bs=64k count=16384 iflag=direct skip=$(echo "16384*$i"|bc) oflag=direct >/dev/null 2>&1; done& ) ; done Version-Release number of selected component (if applicable): How reproducible: run the command listed above on both nodes Actual results: After a while < 30min all I/O to the GFS filesystem stops. All dd processes are waiting in glock_wait_internal: 19694 D dd glock_wait_internal 19701 D dd glock_wait_internal 19702 D dd glock_wait_internal 19706 D dd glock_wait_internal 19710 D dd glock_wait_internal 19714 D dd glock_wait_internal Expected results: dd should not get stuck Additional info: 2-Node Cluster: lilc066, lilc067 Storage: HP P9500
I don't think we can realistically figure out what is going on here if the customer has given up on it. We don't have the daemon which appears to be at the root of the problem. Also, the dd test is a very strange one: 1. It reads from a block device (is this separate from the one the fs is one? At least I hope it is) 2. It reads and writes with the odirect flag 3. It does not appear that the destination files are pre-allocated, so losing all the benefits of writing with odirect since this will turn into a buffered sync write in that case. That makes no sense to me as a use case unless the destination files have been preallocated. As a result I'm going to close this. If you think that is wrong, then please reopen.