Bug 366301 - clurgmgrd hangs on startup after failover
Summary: clurgmgrd hangs on startup after failover
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager
Version: 5.0
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: ---
: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-11-05 01:18 UTC by Alex Kompel
Modified: 2009-04-16 22:55 UTC (History)
1 user (show)

Fixed In Version: RHEL 5.1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-01-30 18:59:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Alex Kompel 2007-11-05 01:18:24 UTC
Description of problem:
clurgmgrd hangs on startup after failover

Version-Release number of selected component (if applicable):
2 node cluster (RHEL5) with quorum disk
cman: 2.0.64-1.0.1
rgmanager: 2.0.24-1
clurgmgrd is built from RHEL50 with "mark quorum disk node as dead" patch:
+ member_set_state(0, 0);         /* Mark qdisk as dead */

Steps to Reproduce:
- the cluster is up and running
- kill Node1 (reboot -f)
- when Node1 comes back up,  rgmanager hangs on startup
  
Actual results:
- clustat reports "Timed out waiting for a response from Resource Group Manager"
on Node1 (Node2 has no problems).
- rebooting Node2 "unlocks" frozen Node1 rgmanager
- attempts to shutdown rgmanager on *either* node result into zombie clurgmgrd.
Looks like clurgmgrd gets stuck waiting on some kernel event (hard reset is the
only option at this point)
- the problem correlates with number of "script" resources in cluster.conf. This
problem never happens with only one service/one script resource defined


Additional info:
This is the debug output from clurgmgrd after Node1 is rebooted.

[2900] info: I am node #2
[2900] notice: Resource Group Manager Starting
[2900] info: Loading Service Data
[2900] debug: Loading Resource Rules
[2900] debug: 10 rules loaded
[2900] debug: Building Resource Trees
[2900] debug: 13 resources defined
[2900] debug: Loading Failover Domains
[2900] debug: 1 domains defined
[2900] info: Initializing Services
[2900] debug: Initializing service:mysql
[2900] debug: Initializing service:test
<info>   Executing /etc/init.d/mysqld stop
<info>   Executing /etc/init.d/mysqld stop
Stopping MySQL:  Stopping MySQL:                           [  OK  ]
                                                           [  OK  ]
<info>   Executing /root/test/test1.sh stop
<info>   Executing /root/test/test5.sh stop
test is running
test is running
<info>   Executing /root/test/test2.sh stop
<info>   Executing /root/test/test6.sh stop
test is running
test is running
<info>   Executing /root/test/test3.sh stop
test is running
<info>   Executing /root/test/test7.sh stop
test is running
<info>   Executing /root/test/test4.sh stop
<info>   Executing /root/test/test8.sh stop
test is running
test is running
<debug>  192.168.152.8 is not configured
<info>   /dev/sdb is not mounted
[2900] info: Services Initialized
[2900] debug: Event: Port Opened
cman_is_listening: Device or resource busy
[2900] info: State change: Local UP
[2900] info: State change: db02 UP
[2900] info: State change: /dev/sdc UP
[2900] debug: Sending service states to CTX0x1ffa7fa0
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending node states to CTX0x1ffaa470
[2900] debug: Sending service states to CTX0x1ffa7fa0
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending node states to CTX0x1ffaa470
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending service states to CTX0x1ffa7fa0
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending node states to CTX0x1ffaa470
[2900] debug: Sending service states to CTX0x1ffa7fa0
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending node states to CTX0x1ffaa470
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending service states to CTX0x1ffa7fa0
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending node states to CTX0x1ffaa470
[2900] debug: Sending service states to CTX0x1ffa7fa0
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending node states to CTX0x1ffaa470
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending service states to CTX0x1ffa7fa0
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending node states to CTX0x1ffaa470
[2900] debug: Sending service states to CTX0x1ffa7fa0
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending node states to CTX0x1ffaa470
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending service states to CTX0x1ffa7fa0
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending node states to CTX0x1ffaa470
[2900] debug: Sending service states to CTX0x1ffa7fa0
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending node states to CTX0x1ffaa470
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending service states to CTX0x1ffa7fa0
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending node states to CTX0x1ffaa470
[2900] debug: Sending service states to CTX0x1ffa7fa0
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending node states to CTX0x1ffaa470
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending service states to CTX0x1ffa7fa0
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending node states to CTX0x1ffaa470
[2900] debug: Membership Change Event
[2900] info: State change: db02 DOWN
[2900] info: State change: /dev/sdc UP
[2900] debug: Sending service states to CTX0x1ffa7fa0
no key for rg="service:mysql"
no key for rg="service:test"
[2900] debug: Sending node states to CTX0x1ffaa470
[3391] debug: Evaluating RG service:mysql, state stopped, owner none
[3503] notice: Starting stopped service service:mysql
[3391] debug: Evaluating RG service:test, state stopped, owner none
[3391] debug: Event (1:2:1) Processed
[3504] notice: Starting stopped service service:test
[3391] debug: Evaluating RG service:mysql, state starting, owner db01
[3391] debug: Evaluating RG service:test, state starting, owner db01
[3391] debug: Event (0:1:1) Processed
[3391] debug: Evaluating RG service:mysql, state starting, owner db01
[3391] debug: Evaluating RG service:test, state starting, owner db01
[3391] debug: Event (0:0:1) Processed
[3391] debug: Evaluating RG service:mysql, state starting, owner db01
[3391] debug: Evaluating RG service:test, state starting, owner db01
[3391] debug: Event (0:1:0) Processed
[3391] debug: Evaluating RG service:mysql, state starting, owner db01
[3391] debug: Evaluating RG service:test, state starting, owner db01
[3391] debug: Event (0:0:1) Processed
<info>   Executing /etc/init.d/mysqld start
[2900] debug: Sending service states to CTX0x1ffa4ce0
[2900] debug: Sending node states to CTX0x1ffad3b0
Initializmounting /dev/sdb on /data>
<debug>  mount -t ext3  /dev/sdb /data
[2900] debug: Sending service states to CTX0x1ffa4ce0
[2900] debug: Sending node states to CTX0x1ffa4ce0
Installing all prepared tables
<debug>  Link for eth0: Detected
<info>   Adding IPv4 address 192.168.152.8 to eth0
<debug>  Sending gratuitous ARP: 192.168.152.8 00:0c:29:19:78:a8 brd

ff:ff:ff:ff:ff:ff
Fill help tables

To start mysqld at boot time you have to copy support-files/mysql.server
to the right place for your system

PLEASE REMEMBER TO SET A PASSWORD FOR THE MySQL root USER !
To do so, start the server, then issue the following commands:
/usr/bin/mysqladmin -u root password 'new-password'
/usr/bin/mysqladmin -u root -h db01.23andme.com password 'new-password'
See the manual for more instructions.

NOTE:  If you are upgrading from a MySQL <= 3.22.10 you should run
the /usr/bin/mysql_fix_privilege_tables. Otherwise you will not be
able to use the new GRANT command!

You can start the MySQL daemon with:
cd /usr ; /usr/bin/mysqld_safe &

You can test the MySQL daemon with the benchmarks in the 'sql-bench'

directory:
cd sql-bench ; perl run-all-tests

Please report any problems with the /usr/bin/mysqlbug script!

The latest information about MySQL is available on the web at
http://www.mysql.com
Support MySQL by buying support/licenses at http://shop.mysql.com
                                                           [  OK  ]
rdisc: no process killed
<info>   Executing /etc/init.d/mysqld start
[2900] debug: Sending service states to CTX0x1ffa4ce0
[2900] debug: Sending node states to CTX0x1ffa4ce0
[2900] debug: Sending service states to CTX0x1ffa4ce0
[2900] debug: Sending node states to CTX0x1ffa4ce0
Starting MySQL:  Starting MySQL:                           [  OK  ]
                                                           [  OK  ]
<info>   Executing /root/test/test5.sh start
test is running
<info>   Executing /root/test/test1.sh start
test is running
<info>   Executing /root/test/test6.sh start
test is running
<info>   Executing /root/test/test2.sh start
test is running
<info>   Executing /root/test/test7.sh start
test is running
<info>   Executing /root/test/test3.sh start
test is running
<info>   Executing /root/test/test8.sh start
test is running
[3504] notice: Service service:test started
<info>   Executing /root/test/test4.sh start
[2900] debug: Sending service states to CTX0x1ffaa470
[2900] debug: Sending node states to CTX0x1ffaa470
test is running
[3503] notice: Service service:mysql started
[2900] debug: Sending service states to CTX0x1ffaa470
[2900] debug: Sending node states to CTX0x1ffaa470
<info>   Executing /etc/init.d/mysqld status
[2900] debug: Sending service states to CTX0x1ffa8ab0
[2900] debug: Sending node states to CTX0x1ffa7b30
mysqld (pid 3837) is running...
<info>   Executing /root/test/test5.sh status
test is running
<info>   Executing /root/test/test6.sh status
test is running
<info>   Executing /root/test/test7.sh status
test is running
<info>   Executing /root/test/test8.sh status
<debug>  Checking 192.168.152.8, Level 10
test is running
<debug>  192.168.152.8 present on eth0
<debug>  Link for eth0: Detected
<debug>  Link detected on eth0
<debug>  Local ping to 192.168.152.8 succeeded
<info>   Executing /etc/init.d/mysqld status
[2900] debug: Sending service states to CTX0x1ffa8ab0
[2900] debug: Sending node states to CTX0x1ffaa470
mysqld (pid 3837) is running...
<info>   Executing /root/test/test1.sh status
test is running
<info>   Executing /root/test/test2.sh status
test is running
<info>   Executing /root/test/test3.sh status
test is running
<info>   Executing /root/test/test4.sh status
test is running
[2900] debug: Sending service states to CTX0x1ffa7b30
[2900] debug: Sending node states to CTX0x1ffa8ab0
<debug>  Checking 192.168.152.8, Level 0
<debug>  192.168.152.8 present on eth0
<debug>  Link for eth0: Detected
<debug>  Link detected on eth0
[2900] debug: Sending service states to CTX0x1ffaa470
[2900] debug: Sending node states to CTX0x1ffaa470
[2900] debug: Sending service states to CTX0x1ffaa470
[2900] debug: Sending node states to CTX0x1ffaa470
[2900] debug: Sending service states to CTX0x1ffaa470
[2900] debug: Sending node states to CTX0x1ffaa470
<info>   Executing /etc/init.d/mysqld status
mysqld (pid 3837) is running...
[2900] debug: Sending service states to CTX0x1ffaa470
[2900] debug: Sending node states to CTX0x1ffaa470
[2900] notice: Shutting down
[4805] notice: Stopping service service:mysql
[4806] notice: Stopping service service:test
<info>   Executing /etc/init.d/mysqld stop
<info>   Executing /etc/init.d/mysqld stop
[2900] debug: Sending service states to CTX0x1ffad730
[2900] debug: Sending node states to CTX0x1ffad730
[2900] debug: Sending service states to CTX0x1ffad730
[2900] debug: Sending node states to CTX0x1ffad730
Stopping MySQL:                                            [  OK  ]
Stopping MySQL:                                            [  OK  ]
<info>   Executing /root/test/test1.sh stop
test is running
<info>   Executing /root/test/test5.sh stop
test is running
<info>   Executing /root/test/test2.sh stop
test is running
<info>   Executing /root/test/test6.sh stop
test is running
<info>   Executing /root/test/test3.sh stop
test is running
<info>   Executing /root/test/test7.sh stop
test is running
<info>   Executing /root/test/test4.sh stop
test is running
<info>   Executing /root/test/test8.sh stop
test is running
[4806] notice: Service service:test is stopped
[5083] debug: Resource groups locked; not evaluating
[2900] debug: Sending service states to CTX0x1ffaa470
[2900] debug: Sending node states to CTX0x1ffaa470
<info>   Removing IPv4 address 192.168.152.8 from eth0
Warning: Executing wildcard deletion to stay compatible with old scripts.
         Explicitly specify the prefix length ( 192.168.152.8/32) to avoid

this warning.
         This special behaviour is likely to disappear in further releases,
         fix your scripts!
rdisc: no process killed
[2900] debug: Sending service states to CTX0x1ffaa470
[2900] debug: Sending node states to CTX0x1ffaa470
[2900] debug: Sending service states to CTX0x1ffaa470
[2900] debug: Sending node states to CTX0x1ffaa470


<info>   unmounting /data
[4805] notice: Service service:mysql is stopped
[5252] debug: Resource groups locked; not evaluating
[2900] debug: Sending service states to CTX0x1ffaa470
[2900] notice: Shutdown complete, exiting
[root@db01 ~]#

Comment 1 Alex Kompel 2007-11-05 23:23:31 UTC
Looks like "rg_lock(service:mysql) @ groups.c:101" waits indefinitely on some
kernel event.

{3965} rg_lock(service:mysql) @ groups.c:101
[3460] debug: Sending service states to CTX0xa2a7fd0
no key for rg="service:mysql"
no key for rg="service:test"
[3460] debug: Sending node states to CTX0xa2a7fd0
[3460] debug: Sending service states to CTX0xa2a7fd0
no key for rg="service:mysql"
no key for rg="service:test"
[3460] debug: Sending node states to CTX0xa2a7fd0
no key for rg="service:mysql"
no key for rg="service:test"

Comment 2 Lon Hohberger 2007-11-14 19:08:03 UTC
Are you running 5.0.z or 5.1 ?

Comment 3 Alex Kompel 2007-11-14 19:16:15 UTC
centos 5.0
Linux db01.test 2.6.18-8.1.15.el5 #1 SMP Mon Oct 22 08:32:28 EDT 2007 x86_64
x86_64 x86_64 GNU/Linux
cman-2.0.64-1.0.1.el5
rgmanager-2.0.24-1.el5.centos

Comment 4 Lon Hohberger 2007-11-14 19:25:49 UTC
You should get the 5.1 openais + cman + rgmanager packages at a minimum.  DLM
doesn't work with node-0 either in 5.0 due to a bug in dlm_controld. 
lvm2-cluster also has trouble.


Comment 5 Kiersten (Kerri) Anderson 2007-11-19 19:57:27 UTC
All cluster version 5 defects should be reported under red hat enterprise linux
5 product name - not cluster suite.

Comment 6 Lon Hohberger 2007-11-20 15:55:31 UTC
This might also be like bug #327721



Comment 7 Alex Kompel 2007-11-20 20:21:29 UTC
Thanks, I will try 5.1 and see if I can reproduce it there.


Comment 8 Alex Kompel 2008-01-30 18:38:19 UTC
Upgraded to 5.1 and the problem went away. Closing the bug. Thanks.



Note You need to log in before you can comment on or make changes to this bug.