Red Hat Bugzilla – Bug 126531
dlm slowness with shared file IO from multiple nodes
Last modified: 2010-01-11 21:52:41 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.1; Linux)
Description of problem:
Description of problem:
Shared file IO using lock_dlm is slow.
A test running to a single file in a gfs fs, from a single node is getting ~5000 requests/sec -- ~2,560,000 bytes/sec (Each request is a 512 byte read/write).
Add a second node going after the same file, and you get poor performance.
The two processes, started at the same time will run for a short while at about 250 ops per sec, for perhaps a few seconds, then one of the
nodes stops (actually the IO rate drops to 1 op every 4+ hours) and
the other node runs in fits and spurts, perhaps 1-2 ops per minute.
IO to a clvm vol from one node:
1200 req/seq ~ 614,400 bytes/sec
clvm from two nodes to same vol:
1000 req/seq ~ 512,000 bytes/sec - 512 k/sec
IO to raw disk from one node:
10,000 ops/sec 5,120,000 bytes/sec ~ 5000 k /sec
from 2 nodes:
8,000 ops/sec 4,096,000 bytes/sec ~ 4000 k /sec
Version-Release number of selected component (if applicable):
Lock_DLM (built Jun 17 2004 10:54:06) installed
Steps to Reproduce:
1. Assume you are currently cd'ed into a gfs fs.
2. Run: iogen -t 1b -T 1b 10000b:sharefile | doio -m 1 on each node.
Start with a single node, then try 2 nodes starting at once.
3. b_iogen -t 1b -T 1b -d /dev/dean/lvol0 | b_doio -m 100
and b_iogen -t 1b -T 1b -d /dev/sda | b_doio -m 1000 was used for the clvm and raw disk IO numbers.
an email from Ken:
It's a problem I'd seen once or twice before, but it didn't seem to happen
too much and fell to the bottom of the todo pile. But aparently there's
something about the way that the DLM threads work that triggers it.
The solution to the problem should be part of GFS. Either GFS refuses to
respond to a callback for some minimum amount of time after a page fault,
or GFS somehow hooks into the scheduling code so the lock isn't released
until after the process that faulted gets run for a timeslice.
Either way, I think you can blame the bug on me and stop looking at it.
I just checked in code that should fix this.
Would QA please verify that this solves the problem for them.
Oops. Didn't mean to close the bug.
Looks good now... I ran up to 6 nodes read/write to the shared file
and other than the expected slow down as each node was added it
Updating version to the right level in the defects. Sorry for the storm.