From Bugzilla Helper: User-Agent: Mozilla/4.78 [en] (X11; U; Linux 2.4.7-10 i686) Description of problem: clumanager 1.0.9 Tom established a scenario where a cluster shootdown occurred due to multiple concurrent IO exercisers causing cluquorumd to not be scheduled to run. Several potential remidies were proposed, including: - quorumd's nice level was 1 lower than the other daemons. - scale the cluster lock timeout dymanically based on the tunable parameters govering the quorumd algorithm to declare a node down. - breakup a svcmgr loop over 99 services, so it doesn't hold the lock too long. Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: 1.Heavy load of multiple independent disk exercisers 2.Enabling debug level logging seemed to exacerbate the problem. 3. Actual Results: shootdown Expected Results: Quorumd should have been scheduled to run. Additional info:
I also got a similar scenario of quorumd starvation as follows: An external FC raid array configured as a single disk. This was partitioned up into 16 slices of /dev/sdb. The power switch type was set to watchdog. On each cluster member the following was running: - the cluster software itself - 5 separate dd processes to different partitions (each repeatedly run out of a while loop script) - clustat -i 1 - a script in a while loop would run `cat /proc/slabinfo` once a second After about a half an hour each system just rebooted (not at the same time, about 15 minutes appart). Larry and I looked for traces of an oops, but none could be found (in /var/log/messages upon reboot). Therefore, we concluded that the watchdog timer was exploding after the 10 second interval had elapsed.
Fixes in pool, awaiting testing by other developers.