Bug 719415

Summary: dd on GFS gets stuck in glock_wait_internal
Product: Red Hat Enterprise Linux 5 Reporter: Harald Klein <hklein>
Component: gfs-kmodAssignee: Robert Peterson <rpeterso>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.6CC: adas, anprice, bmarzins, rpeterso, swhiteho, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-01 11:46:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Harald Klein 2011-07-06 18:27:08 UTC
Description of problem:

When running the following "stress test" on GFS, the dd processes get stuck after about 30 minutes:

root@nodea:~# touch /tmp/dd_running; for i in $(seq 1 48); do (while [ -e /tmp/dd_running ]; do dd if=/dev/mpath/P9500_0548 of=/mnt/gfstest/lilc066/dd.out.$i bs=64k count=16384 iflag=direct skip=$(echo "16384*$i"|bc) oflag=direct >/dev/null 2>&1; done& ) ; done

root@nodeb:~# touch /tmp/dd_running; for i in $(seq 1 48); do (while [ -e /tmp/dd_running ]; do dd if=/dev/mpath/P9500_0548 of=/mnt/gfstest/lilc067/dd.out.$i bs=64k count=16384 iflag=direct skip=$(echo "16384*$i"|bc) oflag=direct >/dev/null 2>&1; done& ) ; done

Version-Release number of selected component (if applicable):


How reproducible:
run the command listed above on both nodes
  
Actual results:
After a while < 30min all I/O to the GFS filesystem stops. All dd processes are waiting in glock_wait_internal:

19694 D dd glock_wait_internal
19701 D dd glock_wait_internal
19702 D dd glock_wait_internal
19706 D dd glock_wait_internal
19710 D dd glock_wait_internal
19714 D dd glock_wait_internal

Expected results:
dd should not get stuck

Additional info:
2-Node Cluster: lilc066, lilc067
Storage: HP P9500

Comment 10 Steve Whitehouse 2011-08-01 11:46:46 UTC
I don't think we can realistically figure out what is going on here if the customer has given up on it. We don't have the daemon which appears to be at the root of the problem. Also, the dd test is a very strange one:

1. It reads from a block device (is this separate from the one the fs is one? At least I hope it is)
2. It reads and writes with the odirect flag
3. It does not appear that the destination files are pre-allocated, so losing all the benefits of writing with odirect since this will turn into a buffered sync write in that case.

That makes no sense to me as a use case unless the destination files have been preallocated.

As a result I'm going to close this. If you think that is wrong, then please reopen.