From Bugzilla Helper:
User-Agent: Mozilla/4.78 [en] (X11; U; Linux 2.4.7-10 i686)
Description of problem:
Tom established a scenario where a cluster shootdown occurred due to multiple
concurrent IO exercisers causing cluquorumd to not be scheduled to run.
Several potential remidies were proposed, including:
- quorumd's nice level was 1 lower than the other daemons.
- scale the cluster lock timeout dymanically based on the tunable parameters
govering the quorumd algorithm to declare a node down.
- breakup a svcmgr loop over 99 services, so it doesn't hold the lock too long.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Heavy load of multiple independent disk exercisers
2.Enabling debug level logging seemed to exacerbate the problem.
Actual Results: shootdown
Expected Results: Quorumd should have been scheduled to run.
I also got a similar scenario of quorumd starvation as follows:
An external FC raid array configured as a single disk. This was partitioned up
into 16 slices of /dev/sdb. The power switch type was set to watchdog. On each
cluster member the following was running:
- the cluster software itself
- 5 separate dd processes to different partitions (each repeatedly run out of a
while loop script)
- clustat -i 1
- a script in a while loop would run `cat /proc/slabinfo` once a second
After about a half an hour each system just rebooted (not at the same time,
about 15 minutes appart). Larry and I looked for traces of an oops, but none
could be found (in /var/log/messages upon reboot). Therefore, we concluded that
the watchdog timer was exploding after the 10 second interval had elapsed.
Fixes in pool, awaiting testing by other developers.