126531 – dlm slowness with shared file IO from multiple nodes

Bug 126531 - dlm slowness with shared file IO from multiple nodes

Summary: dlm slowness with shared file IO from multiple nodes

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Ken Preslan
QA Contact:	Derek Anderson
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-06-22 21:15 UTC by Dean Jansa
Modified:	2010-01-12 02:52 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-09-14 14:55:42 UTC
Embargoed:

Attachments	(Terms of Use)

Description Dean Jansa 2004-06-22 21:15:20 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.1; Linux)

Description of problem:

Description of problem:
Shared file IO using lock_dlm is slow.

A test running to a single file in a gfs fs, from a single node is getting ~5000 requests/sec -- ~2,560,000 bytes/sec  (Each request is a 512 byte read/write).

Add a second node going after the same file, and you get poor performance.

The two processes, started at the same time will run for a short while at about 250 ops per sec, for perhaps a few seconds, then one of the
nodes stops (actually the IO rate drops to 1 op every 4+ hours) and
the other node runs in fits and spurts, perhaps 1-2 ops per minute.

For reference:

IO to a clvm vol from one node:
1200 req/seq ~ 614,400 bytes/sec

clvm from two nodes to same vol:
1000 req/seq ~ 512,000 bytes/sec - 512 k/sec

IO to raw disk from one node:
10,000 ops/sec  5,120,000 bytes/sec  ~ 5000 k /sec

from 2 nodes:
8,000 ops/sec  4,096,000 bytes/sec  ~ 4000 k /sec


Version-Release number of selected component (if applicable):
Lock_DLM (built Jun 17 2004 10:54:06) installed 

How reproducible:
Always

Steps to Reproduce:
1. Assume you are currently cd'ed into a gfs fs.
2. Run: iogen -t 1b -T 1b  10000b:sharefile | doio -m 1 on each node.
Start with a single node,  then try 2 nodes starting at once.

3. b_iogen -t 1b -T 1b  -d /dev/dean/lvol0 | b_doio -m 100
 and b_iogen -t 1b -T 1b  -d /dev/sda | b_doio -m 1000 was used for the clvm and raw disk IO numbers.

   

Additional info:

Comment 1 Christine Caulfield 2004-07-14 16:59:12 UTC

an email from Ken:

It's a problem I'd seen once or twice before, but it didn't seem to happen
too much and fell to the bottom of the todo pile.  But aparently there's
something about the way that the DLM threads work that triggers it.

The solution to the problem should be part of GFS.  Either GFS refuses to
respond to a callback for some minimum amount of time after a page fault,
or GFS somehow hooks into the scheduling code so the lock isn't released
until after the process that faulted gets run for a timeslice.

Either way, I think you can blame the bug on me and stop looking at it.
:-)

Comment 2 Ken Preslan 2004-09-13 22:53:40 UTC

I just checked in code that should fix this.

Would QA please verify that this solves the problem for them.
Thanks.

Comment 3 Ken Preslan 2004-09-13 22:55:47 UTC

Oops.  Didn't mean to close the bug.

Comment 4 Dean Jansa 2004-09-14 14:55:42 UTC

Looks good now...  I ran up to 6 nodes read/write to the shared file 
and other than the expected slow down as each node was added it 
seemed OK.

Comment 5 Kiersten (Kerri) Anderson 2004-11-16 19:03:29 UTC

Updating version to the right level in the defects.  Sorry for the storm.

Note You need to log in before you can comment on or make changes to this bug.