Description of problem: clurgmgrd hangs on startup after failover Version-Release number of selected component (if applicable): 2 node cluster (RHEL5) with quorum disk cman: 2.0.64-1.0.1 rgmanager: 2.0.24-1 clurgmgrd is built from RHEL50 with "mark quorum disk node as dead" patch: + member_set_state(0, 0); /* Mark qdisk as dead */ Steps to Reproduce: - the cluster is up and running - kill Node1 (reboot -f) - when Node1 comes back up, rgmanager hangs on startup Actual results: - clustat reports "Timed out waiting for a response from Resource Group Manager" on Node1 (Node2 has no problems). - rebooting Node2 "unlocks" frozen Node1 rgmanager - attempts to shutdown rgmanager on *either* node result into zombie clurgmgrd. Looks like clurgmgrd gets stuck waiting on some kernel event (hard reset is the only option at this point) - the problem correlates with number of "script" resources in cluster.conf. This problem never happens with only one service/one script resource defined Additional info: This is the debug output from clurgmgrd after Node1 is rebooted. [2900] info: I am node #2 [2900] notice: Resource Group Manager Starting [2900] info: Loading Service Data [2900] debug: Loading Resource Rules [2900] debug: 10 rules loaded [2900] debug: Building Resource Trees [2900] debug: 13 resources defined [2900] debug: Loading Failover Domains [2900] debug: 1 domains defined [2900] info: Initializing Services [2900] debug: Initializing service:mysql [2900] debug: Initializing service:test <info> Executing /etc/init.d/mysqld stop <info> Executing /etc/init.d/mysqld stop Stopping MySQL: Stopping MySQL: [ OK ] [ OK ] <info> Executing /root/test/test1.sh stop <info> Executing /root/test/test5.sh stop test is running test is running <info> Executing /root/test/test2.sh stop <info> Executing /root/test/test6.sh stop test is running test is running <info> Executing /root/test/test3.sh stop test is running <info> Executing /root/test/test7.sh stop test is running <info> Executing /root/test/test4.sh stop <info> Executing /root/test/test8.sh stop test is running test is running <debug> 192.168.152.8 is not configured <info> /dev/sdb is not mounted [2900] info: Services Initialized [2900] debug: Event: Port Opened cman_is_listening: Device or resource busy [2900] info: State change: Local UP [2900] info: State change: db02 UP [2900] info: State change: /dev/sdc UP [2900] debug: Sending service states to CTX0x1ffa7fa0 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending node states to CTX0x1ffaa470 [2900] debug: Sending service states to CTX0x1ffa7fa0 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending node states to CTX0x1ffaa470 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending service states to CTX0x1ffa7fa0 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending node states to CTX0x1ffaa470 [2900] debug: Sending service states to CTX0x1ffa7fa0 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending node states to CTX0x1ffaa470 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending service states to CTX0x1ffa7fa0 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending node states to CTX0x1ffaa470 [2900] debug: Sending service states to CTX0x1ffa7fa0 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending node states to CTX0x1ffaa470 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending service states to CTX0x1ffa7fa0 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending node states to CTX0x1ffaa470 [2900] debug: Sending service states to CTX0x1ffa7fa0 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending node states to CTX0x1ffaa470 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending service states to CTX0x1ffa7fa0 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending node states to CTX0x1ffaa470 [2900] debug: Sending service states to CTX0x1ffa7fa0 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending node states to CTX0x1ffaa470 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending service states to CTX0x1ffa7fa0 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending node states to CTX0x1ffaa470 [2900] debug: Sending service states to CTX0x1ffa7fa0 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending node states to CTX0x1ffaa470 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending service states to CTX0x1ffa7fa0 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending node states to CTX0x1ffaa470 [2900] debug: Membership Change Event [2900] info: State change: db02 DOWN [2900] info: State change: /dev/sdc UP [2900] debug: Sending service states to CTX0x1ffa7fa0 no key for rg="service:mysql" no key for rg="service:test" [2900] debug: Sending node states to CTX0x1ffaa470 [3391] debug: Evaluating RG service:mysql, state stopped, owner none [3503] notice: Starting stopped service service:mysql [3391] debug: Evaluating RG service:test, state stopped, owner none [3391] debug: Event (1:2:1) Processed [3504] notice: Starting stopped service service:test [3391] debug: Evaluating RG service:mysql, state starting, owner db01 [3391] debug: Evaluating RG service:test, state starting, owner db01 [3391] debug: Event (0:1:1) Processed [3391] debug: Evaluating RG service:mysql, state starting, owner db01 [3391] debug: Evaluating RG service:test, state starting, owner db01 [3391] debug: Event (0:0:1) Processed [3391] debug: Evaluating RG service:mysql, state starting, owner db01 [3391] debug: Evaluating RG service:test, state starting, owner db01 [3391] debug: Event (0:1:0) Processed [3391] debug: Evaluating RG service:mysql, state starting, owner db01 [3391] debug: Evaluating RG service:test, state starting, owner db01 [3391] debug: Event (0:0:1) Processed <info> Executing /etc/init.d/mysqld start [2900] debug: Sending service states to CTX0x1ffa4ce0 [2900] debug: Sending node states to CTX0x1ffad3b0 Initializmounting /dev/sdb on /data> <debug> mount -t ext3 /dev/sdb /data [2900] debug: Sending service states to CTX0x1ffa4ce0 [2900] debug: Sending node states to CTX0x1ffa4ce0 Installing all prepared tables <debug> Link for eth0: Detected <info> Adding IPv4 address 192.168.152.8 to eth0 <debug> Sending gratuitous ARP: 192.168.152.8 00:0c:29:19:78:a8 brd ff:ff:ff:ff:ff:ff Fill help tables To start mysqld at boot time you have to copy support-files/mysql.server to the right place for your system PLEASE REMEMBER TO SET A PASSWORD FOR THE MySQL root USER ! To do so, start the server, then issue the following commands: /usr/bin/mysqladmin -u root password 'new-password' /usr/bin/mysqladmin -u root -h db01.23andme.com password 'new-password' See the manual for more instructions. NOTE: If you are upgrading from a MySQL <= 3.22.10 you should run the /usr/bin/mysql_fix_privilege_tables. Otherwise you will not be able to use the new GRANT command! You can start the MySQL daemon with: cd /usr ; /usr/bin/mysqld_safe & You can test the MySQL daemon with the benchmarks in the 'sql-bench' directory: cd sql-bench ; perl run-all-tests Please report any problems with the /usr/bin/mysqlbug script! The latest information about MySQL is available on the web at http://www.mysql.com Support MySQL by buying support/licenses at http://shop.mysql.com [ OK ] rdisc: no process killed <info> Executing /etc/init.d/mysqld start [2900] debug: Sending service states to CTX0x1ffa4ce0 [2900] debug: Sending node states to CTX0x1ffa4ce0 [2900] debug: Sending service states to CTX0x1ffa4ce0 [2900] debug: Sending node states to CTX0x1ffa4ce0 Starting MySQL: Starting MySQL: [ OK ] [ OK ] <info> Executing /root/test/test5.sh start test is running <info> Executing /root/test/test1.sh start test is running <info> Executing /root/test/test6.sh start test is running <info> Executing /root/test/test2.sh start test is running <info> Executing /root/test/test7.sh start test is running <info> Executing /root/test/test3.sh start test is running <info> Executing /root/test/test8.sh start test is running [3504] notice: Service service:test started <info> Executing /root/test/test4.sh start [2900] debug: Sending service states to CTX0x1ffaa470 [2900] debug: Sending node states to CTX0x1ffaa470 test is running [3503] notice: Service service:mysql started [2900] debug: Sending service states to CTX0x1ffaa470 [2900] debug: Sending node states to CTX0x1ffaa470 <info> Executing /etc/init.d/mysqld status [2900] debug: Sending service states to CTX0x1ffa8ab0 [2900] debug: Sending node states to CTX0x1ffa7b30 mysqld (pid 3837) is running... <info> Executing /root/test/test5.sh status test is running <info> Executing /root/test/test6.sh status test is running <info> Executing /root/test/test7.sh status test is running <info> Executing /root/test/test8.sh status <debug> Checking 192.168.152.8, Level 10 test is running <debug> 192.168.152.8 present on eth0 <debug> Link for eth0: Detected <debug> Link detected on eth0 <debug> Local ping to 192.168.152.8 succeeded <info> Executing /etc/init.d/mysqld status [2900] debug: Sending service states to CTX0x1ffa8ab0 [2900] debug: Sending node states to CTX0x1ffaa470 mysqld (pid 3837) is running... <info> Executing /root/test/test1.sh status test is running <info> Executing /root/test/test2.sh status test is running <info> Executing /root/test/test3.sh status test is running <info> Executing /root/test/test4.sh status test is running [2900] debug: Sending service states to CTX0x1ffa7b30 [2900] debug: Sending node states to CTX0x1ffa8ab0 <debug> Checking 192.168.152.8, Level 0 <debug> 192.168.152.8 present on eth0 <debug> Link for eth0: Detected <debug> Link detected on eth0 [2900] debug: Sending service states to CTX0x1ffaa470 [2900] debug: Sending node states to CTX0x1ffaa470 [2900] debug: Sending service states to CTX0x1ffaa470 [2900] debug: Sending node states to CTX0x1ffaa470 [2900] debug: Sending service states to CTX0x1ffaa470 [2900] debug: Sending node states to CTX0x1ffaa470 <info> Executing /etc/init.d/mysqld status mysqld (pid 3837) is running... [2900] debug: Sending service states to CTX0x1ffaa470 [2900] debug: Sending node states to CTX0x1ffaa470 [2900] notice: Shutting down [4805] notice: Stopping service service:mysql [4806] notice: Stopping service service:test <info> Executing /etc/init.d/mysqld stop <info> Executing /etc/init.d/mysqld stop [2900] debug: Sending service states to CTX0x1ffad730 [2900] debug: Sending node states to CTX0x1ffad730 [2900] debug: Sending service states to CTX0x1ffad730 [2900] debug: Sending node states to CTX0x1ffad730 Stopping MySQL: [ OK ] Stopping MySQL: [ OK ] <info> Executing /root/test/test1.sh stop test is running <info> Executing /root/test/test5.sh stop test is running <info> Executing /root/test/test2.sh stop test is running <info> Executing /root/test/test6.sh stop test is running <info> Executing /root/test/test3.sh stop test is running <info> Executing /root/test/test7.sh stop test is running <info> Executing /root/test/test4.sh stop test is running <info> Executing /root/test/test8.sh stop test is running [4806] notice: Service service:test is stopped [5083] debug: Resource groups locked; not evaluating [2900] debug: Sending service states to CTX0x1ffaa470 [2900] debug: Sending node states to CTX0x1ffaa470 <info> Removing IPv4 address 192.168.152.8 from eth0 Warning: Executing wildcard deletion to stay compatible with old scripts. Explicitly specify the prefix length ( 192.168.152.8/32) to avoid this warning. This special behaviour is likely to disappear in further releases, fix your scripts! rdisc: no process killed [2900] debug: Sending service states to CTX0x1ffaa470 [2900] debug: Sending node states to CTX0x1ffaa470 [2900] debug: Sending service states to CTX0x1ffaa470 [2900] debug: Sending node states to CTX0x1ffaa470 <info> unmounting /data [4805] notice: Service service:mysql is stopped [5252] debug: Resource groups locked; not evaluating [2900] debug: Sending service states to CTX0x1ffaa470 [2900] notice: Shutdown complete, exiting [root@db01 ~]#
Looks like "rg_lock(service:mysql) @ groups.c:101" waits indefinitely on some kernel event. {3965} rg_lock(service:mysql) @ groups.c:101 [3460] debug: Sending service states to CTX0xa2a7fd0 no key for rg="service:mysql" no key for rg="service:test" [3460] debug: Sending node states to CTX0xa2a7fd0 [3460] debug: Sending service states to CTX0xa2a7fd0 no key for rg="service:mysql" no key for rg="service:test" [3460] debug: Sending node states to CTX0xa2a7fd0 no key for rg="service:mysql" no key for rg="service:test"
Are you running 5.0.z or 5.1 ?
centos 5.0 Linux db01.test 2.6.18-8.1.15.el5 #1 SMP Mon Oct 22 08:32:28 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux cman-2.0.64-1.0.1.el5 rgmanager-2.0.24-1.el5.centos
You should get the 5.1 openais + cman + rgmanager packages at a minimum. DLM doesn't work with node-0 either in 5.0 due to a bug in dlm_controld. lvm2-cluster also has trouble.
All cluster version 5 defects should be reported under red hat enterprise linux 5 product name - not cluster suite.
This might also be like bug #327721
Thanks, I will try 5.1 and see if I can reproduce it there.
Upgraded to 5.1 and the problem went away. Closing the bug. Thanks.