Bug 126531

Summary: dlm slowness with shared file IO from multiple nodes
Product: [Retired] Red Hat Cluster Suite Reporter: Dean Jansa <djansa>
Component: gfsAssignee: Ken Preslan <kpreslan>
Status: CLOSED CURRENTRELEASE QA Contact: Derek Anderson <danderso>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cmarthal
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-14 14:55:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dean Jansa 2004-06-22 21:15:20 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.1; Linux)

Description of problem:

Description of problem:
Shared file IO using lock_dlm is slow.

A test running to a single file in a gfs fs, from a single node is getting ~5000 requests/sec -- ~2,560,000 bytes/sec  (Each request is a 512 byte read/write).

Add a second node going after the same file, and you get poor performance.

The two processes, started at the same time will run for a short while at about 250 ops per sec, for perhaps a few seconds, then one of the
nodes stops (actually the IO rate drops to 1 op every 4+ hours) and
the other node runs in fits and spurts, perhaps 1-2 ops per minute.

For reference:

IO to a clvm vol from one node:
1200 req/seq ~ 614,400 bytes/sec

clvm from two nodes to same vol:
1000 req/seq ~ 512,000 bytes/sec - 512 k/sec

IO to raw disk from one node:
10,000 ops/sec  5,120,000 bytes/sec  ~ 5000 k /sec

from 2 nodes:
8,000 ops/sec  4,096,000 bytes/sec  ~ 4000 k /sec


Version-Release number of selected component (if applicable):
Lock_DLM (built Jun 17 2004 10:54:06) installed 

How reproducible:
Always

Steps to Reproduce:
1. Assume you are currently cd'ed into a gfs fs.
2. Run: iogen -t 1b -T 1b  10000b:sharefile | doio -m 1 on each node.
Start with a single node,  then try 2 nodes starting at once.

3. b_iogen -t 1b -T 1b  -d /dev/dean/lvol0 | b_doio -m 100
 and b_iogen -t 1b -T 1b  -d /dev/sda | b_doio -m 1000 was used for the clvm and raw disk IO numbers.

   

Additional info:

Comment 1 Christine Caulfield 2004-07-14 16:59:12 UTC
an email from Ken:

It's a problem I'd seen once or twice before, but it didn't seem to happen
too much and fell to the bottom of the todo pile.  But aparently there's
something about the way that the DLM threads work that triggers it.

The solution to the problem should be part of GFS.  Either GFS refuses to
respond to a callback for some minimum amount of time after a page fault,
or GFS somehow hooks into the scheduling code so the lock isn't released
until after the process that faulted gets run for a timeslice.

Either way, I think you can blame the bug on me and stop looking at it.
:-)


Comment 2 Ken Preslan 2004-09-13 22:53:40 UTC
I just checked in code that should fix this.

Would QA please verify that this solves the problem for them.
Thanks.



Comment 3 Ken Preslan 2004-09-13 22:55:47 UTC
Oops.  Didn't mean to close the bug.



Comment 4 Dean Jansa 2004-09-14 14:55:42 UTC
Looks good now...  I ran up to 6 nodes read/write to the shared file 
and other than the expected slow down as each node was added it 
seemed OK. 

Comment 5 Kiersten (Kerri) Anderson 2004-11-16 19:03:29 UTC
Updating version to the right level in the defects.  Sorry for the storm.