Bug 207144
Summary: | rgmanager stuck in join/update state (cause: I/O wait on stdout/err) | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Lenny Maiorani <lenny> |
Component: | rgmanager | Assignee: | Lon Hohberger <lhh> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | cluster-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2007-0149 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-05-10 21:17:37 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Lenny Maiorani
2006-09-19 17:54:56 UTC
More info possibly related: frii01 was fenced by frii02. When frii01 came back up rgmanager on frii02 never completed marking frii01 as joining. When frii02 rgmanager was restarted it joined properly. Here are some logs from /var/log/messages... Sep 14 21:20:29 frii01 clurgmgrd[15602]: <info> Logged in SG "usrm::manager" Sep 14 21:20:29 frii01 clurgmgrd[15602]: <info> Magma Event: Membership Change Sep 14 21:20:29 frii01 clurgmgrd[15602]: <info> State change: Local UP Sep 14 21:20:29 frii01 clurgmgrd[15602]: <info> State change: frii02 UP Sep 14 21:20:30 frii01 clurgmgrd[15602]: <info> Magma Event: Membership Change Sep 14 21:20:30 frii01 clurgmgrd[15602]: <info> State change: frii02 DOWN Sep 14 18:54:36 frii02 clurgmgrd[12763]: <info> Magma Event: Membership Change Sep 14 18:54:36 frii02 clurgmgrd[12763]: <info> State change: frii01 DOWN Sep 14 21:20:17 frii02 clurgmgrd[12763]: <err> #37: Error receiving message header Sep 14 21:20:17 frii02 clurgmgrd[12763]: <notice> Shutting down Sep 14 21:20:17 frii02 clurgmgrd[12763]: <notice> Stopping service 172.16.107.225 Sep 14 21:20:17 frii02 clurgmgrd[12763]: <notice> Stopping service 172.16.106.225 Sep 14 21:20:17 frii02 clurgmgrd[12763]: <notice> Stopping service 172.16.105.225 Sep 14 21:20:17 frii02 clurgmgrd[12763]: <notice> Stopping service 172.16.104.225 Sep 14 21:20:17 frii02 clurgmgrd: [12763]: <info> Removing IPv4 address 172.16.106.225/24 from eth0.106 Sep 14 21:20:17 frii02 clurgmgrd[12763]: <notice> Stopping service 172.16.103.225 Sep 14 21:20:17 frii02 clurgmgrd: [12763]: <info> Removing IPv4 address 172.16.107.225/24 from eth0.107 Sep 14 21:20:17 frii02 clurgmgrd[12763]: <notice> Stopping service 172.16.102.225 Sep 14 21:20:17 frii02 clurgmgrd: [12763]: <info> Removing IPv4 address 172.16.105.225/24 from eth0.105 Sep 14 21:20:17 frii02 clurgmgrd[12763]: <notice> Stopping service 172.16.101.225 Sep 14 21:20:17 frii02 clurgmgrd: [12763]: <info> Removing IPv4 address 172.16.104.225/24 from eth0.104 Sep 14 21:20:17 frii02 clurgmgrd[12763]: <notice> Stopping service frii-mgmt Sep 14 21:20:17 frii02 clurgmgrd: [12763]: <info> Removing IPv4 address 172.16.102.225/24 from eth0.102 Sep 14 21:20:17 frii02 clurgmgrd: [12763]: <info> Removing IPv4 address 172.16.103.225/24 from eth5 Sep 14 21:20:18 frii02 clurgmgrd[12763]: <notice> Stopping service snapshot Sep 14 21:20:18 frii02 clurgmgrd: [12763]: <info> Removing IPv4 address 172.16.101.225/24 from eth0.101 Sep 14 21:20:18 frii02 clurgmgrd: [12763]: <info> Removing IPv4 address 172.16.103.226/24 from eth0.103 Sep 14 21:20:18 frii02 clurgmgrd[12763]: <notice> Stopping service email_notifier Sep 14 21:20:18 frii02 clurgmgrd: [12763]: <info> Executing /mnt/crosswalk/snapshot/run_snap stop Sep 14 21:20:18 frii02 clurgmgrd[12763]: <notice> Stopping service nvbackup Sep 14 21:20:18 frii02 clurgmgrd: [12763]: <info> Executing /opt/crosswalk/email_notifier/run_email_notifier stop Sep 14 21:20:18 frii02 clurgmgrd[12763]: <notice> Service snapshot is stopped Sep 14 21:20:18 frii02 clurgmgrd: [12763]: <info> Executing /mnt/igridbackup/nvbackup/nvmain stop Sep 14 21:20:18 frii02 clurgmgrd[12763]: <notice> Service email_notifier is stopped Sep 14 21:20:19 frii02 clurgmgrd[12763]: <notice> Service nvbackup is stopped Sep 14 21:20:27 frii02 clurgmgrd[12763]: <notice> Service 172.16.106.225 is stopped Sep 14 21:20:27 frii02 clurgmgrd[12763]: <notice> Service 172.16.107.225 is stopped Sep 14 21:20:27 frii02 clurgmgrd[12763]: <notice> Service 172.16.105.225 is stopped Sep 14 21:20:28 frii02 clurgmgrd[12763]: <notice> Service 172.16.104.225 is stopped Sep 14 21:20:28 frii02 clurgmgrd[12763]: <notice> Service 172.16.102.225 is stopped Sep 14 21:20:28 frii02 clurgmgrd[12763]: <notice> Service 172.16.103.225 is stopped Sep 14 21:20:28 frii02 clurgmgrd[12763]: <notice> Service 172.16.101.225 is stopped Sep 14 21:20:28 frii02 clurgmgrd[12763]: <notice> Service frii-mgmt is stopped Sep 14 21:20:31 frii02 clurgmgrd[12763]: <notice> Shutdown complete, exiting Side note: The kernel is 2.6.9-34 with a patch recently introduced from bz #199673. Thanks for all of the information. I'm wondering if this is just a simple join synchronization problem in the SM magma plugin. Well, it is something we are seeing VERY regularly after a fence. Another information point... we changed deadnode_timeout to 64 and hello_timer to 5. Figured out that one of the services being run by rgmanager was printing to STDERR and STDOUT. This was causing the rgmanager some problems. As soon as we took the printf's out of that service this problem stopped. I guess the bug isn't really that rgmanager doesn't start up, but rather is is not properly handling an interrupt. Created attachment 137328 [details]
Switches logging in rgmanager / clulog utility to use non-blocking output for clulog() calls
Created attachment 137594 [details]
stdout and stderr from rgmanager
New info: We are able to reproduce this by doing 2 consecutive fences to the
same node. Attached are some logs of STDOUT and STDERR from rgmanager on each
node.
Story: node1 was fenced and node2 is the failover domain for all node1
services. rgmanager on node1 is shown to be started by clustat, is still in the
"join" state according to /proc/cluster/services.
Created attachment 137773 [details]
Use nonblocking output and also do not block in recv() in the main thread
Fenced node1 and when attempting to rejoin, we are stuck. Here are backtraces and /proc/cluster/services info from each node: node 1: #0 0x0000003162bbf0a6 in __select_nocancel () from /lib64/tls/libc.so.6 #1 0x0000002a9558be30 in sm_wait_join_complete (p=0x542e00) at sm.c:242 #2 0x0000002a9558c1f7 in sm_login (self=0x542d20, fd=4, groupname=0x4173d0 "usrm::manager") at sm.c:362 #3 0x00000033e5503930 in cp_login (cpp=0x542d20, fd=4, groupname=0x4173d0 "usrm::manager") at plugin.c:708 #4 0x00000033e5502945 in clu_login (fd=4, groupname=0x4173d0 "usrm::manager") at global.c:295 #5 0x0000000000406489 in main (argc=2, argv=0x7fbffffd58) at main.c:814 #6 0x0000003162b1c4bb in __libc_start_main () from /lib64/tls/libc.so.6 #7 0x0000000000403a3a in _start () #8 0x0000007fbffffd48 in ?? () #9 0x000000000000001c in ?? () #10 0x0000000000000002 in ?? () #11 0x0000007fbffffeeb in ?? () #12 0x0000007fbffffeff in ?? () #13 0x0000000000000000 in ?? () No /proc/cluster/services info for usrm Node 2: #0 0x0000003c1560b52a in recv () from /lib64/tls/libpthread.so.0 #1 0x00000039c9a02539 in msg_peek (sockfd=12, buf=0x7fbffffaf0, count=20) at message.c:853 #2 0x00000000004055a6 in dispatch_msg (fd=12, nodeid=2) at main.c:346 #3 0x0000000000405de2 in event_loop (clusterfd=4) at main.c:548 #4 0x0000000000406533 in main (argc=2, argv=0x7fbffffd58) at main.c:839 #5 0x0000003c14d1c4bb in __libc_start_main () from /lib64/tls/libc.so.6 #6 0x0000000000403a3a in _start () #7 0x0000007fbffffd48 in ?? () #8 0x000000000000001c in ?? () #9 0x0000000000000002 in ?? () #10 0x0000007fbffffeeb in ?? () #11 0x0000007fbffffeff in ?? () #12 0x0000000000000000 in ?? () User: "usrm::manager" 16 19 update U-4,1,1 [3 2 1] Node 3: #0 0x0000003f5af0b52a in recv () from /lib64/tls/libpthread.so.0 #1 0x0000003e92802539 in msg_peek (sockfd=9, buf=0x7fbffffaf0, count=20) at message.c:853 #2 0x00000000004055a6 in dispatch_msg (fd=9, nodeid=3) at main.c:346 #3 0x0000000000405de2 in event_loop (clusterfd=4) at main.c:548 #4 0x0000000000406533 in main (argc=2, argv=0x7fbffffd58) at main.c:839 #5 0x0000003f5a61c4bb in __libc_start_main () from /lib64/tls/libc.so.6 #6 0x0000000000403a3a in _start () #7 0x0000007fbffffd48 in ?? () #8 0x000000000000001c in ?? () #9 0x0000000000000002 in ?? () #10 0x0000007fbffffeeb in ?? () #11 0x0000007fbffffeff in ?? () #12 0x0000000000000000 in ?? () User: "usrm::manager" 16 18 update U-4,1,1 [3 2 1] This patch has solved the problem. Tested it a few times and the re-join happens each time. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0149.html |