63041 – Quorumd starvation caused shootdown

Bug 63041 - Quorumd starvation caused shootdown

Summary: Quorumd starvation caused shootdown

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	clumanager
Sub Component:
Version:	2.1
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-04-09 15:42 UTC by Tim Burke
Modified:	2008-05-01 15:38 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2002-06-11 19:39:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2002:226	0	normal	SHIPPED_LIVE	Fixes for clumanager addressing starvation and service hangs	2002-10-08 04:00:00 UTC

Description Tim Burke 2002-04-09 15:42:46 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.78 [en] (X11; U; Linux 2.4.7-10 i686)

Description of problem:
clumanager 1.0.9

Tom established a scenario where a cluster shootdown occurred due to multiple
concurrent IO exercisers causing cluquorumd to not be scheduled to run.

Several potential remidies were proposed, including:
- quorumd's nice level was 1 lower than the other daemons.
- scale the cluster lock timeout dymanically based on the tunable parameters
govering the quorumd algorithm to declare a node down.
- breakup a svcmgr loop over 99 services, so it doesn't hold the lock too long.


Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1.Heavy load of multiple independent disk exercisers
2.Enabling debug level logging seemed to exacerbate the problem.
3.
	

Actual Results:  shootdown

Expected Results:  Quorumd should have been scheduled to run.

Additional info:

Comment 1 Tim Burke 2002-05-01 14:29:35 UTC

I also got a similar scenario of quorumd starvation as follows:

An external FC raid array configured as a single disk.  This was partitioned up
into 16 slices of /dev/sdb.  The power switch type was set to watchdog.  On each
cluster member the following was running:
- the cluster software itself
- 5 separate dd processes to different partitions (each repeatedly run out of a
while loop script)
- clustat -i 1
- a script in a while loop would run `cat /proc/slabinfo` once a second

After about a half an hour each system just rebooted (not at the same time,
about 15 minutes appart).  Larry and I looked for traces of an oops, but none
could be found (in /var/log/messages upon reboot).  Therefore, we concluded that
the watchdog timer was exploding after the 10 second interval had elapsed.

Comment 2 Lon Hohberger 2002-06-11 19:39:35 UTC

Fixes in pool, awaiting testing by other developers.

Note You need to log in before you can comment on or make changes to this bug.