Bug 512799
Summary: | Two-Node cluster does not function properly if a node is shutdown. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Everett Bennett <everett.bennett> | ||||||
Component: | cman | Assignee: | David Teigland <teigland> | ||||||
Status: | CLOSED DUPLICATE | QA Contact: | Cluster QE <mspqa-list> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | low | ||||||||
Version: | 5.3 | CC: | adas, bmarzins, cluster-maint, edamato, everett.bennett, lhh, mwaite, rpeterso, swhiteho, teigland | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2009-07-22 20:27:54 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Everett Bennett
2009-07-20 19:44:03 UTC
ibmblade05> Jul 20 15:26:09 ibmblade05 openais[3837]: [TOTEM] entering GATHER state from 11. Jul 20 15:26:09 ibmblade05 openais[3837]: [TOTEM] Creating commit token because I am the rep. Jul 20 15:26:09 ibmblade05 openais[3837]: [TOTEM] Saving state aru f high seq received f Jul 20 15:26:09 ibmblade05 openais[3837]: [TOTEM] Storing new sequence id for ring 140 Jul 20 15:26:09 ibmblade05 openais[3837]: [TOTEM] entering COMMIT state. Jul 20 15:26:09 ibmblade05 openais[3837]: [TOTEM] entering RECOVERY state. Jul 20 15:26:09 ibmblade05 openais[3837]: [TOTEM] position [0] member 158.155.4.146: Jul 20 15:26:09 ibmblade05 openais[3837]: [TOTEM] previous ring seq 316 rep 158.155.4.146 Jul 20 15:26:09 ibmblade05 openais[3837]: [TOTEM] aru f high delivered f received flag 1 Jul 20 15:26:09 ibmblade05 openais[3837]: [TOTEM] position [1] member 158.155.4.148: Jul 20 15:26:09 ibmblade05 openais[3837]: [TOTEM] previous ring seq 316 rep 158.155.4.148 Jul 20 15:26:09 ibmblade05 openais[3837]: [TOTEM] aru a high delivered a received flag 1 Jul 20 15:26:09 ibmblade05 openais[3837]: [TOTEM] Did not need to originate any messages in recovery. Jul 20 15:26:09 ibmblade05 openais[3837]: [TOTEM] Sending initial ORF token Jul 20 15:26:09 ibmblade05 clurgmgrd[5073]: <debug> Membership Change Event Jul 20 15:26:09 ibmblade05 openais[3837]: [CLM ] CLM CONFIGURATION CHANGE Jul 20 15:26:09 ibmblade05 openais[3837]: [CLM ] New Configuration: Jul 20 15:26:09 ibmblade05 openais[3837]: [CLM ] r(0) ip(158.155.4.146) Jul 20 15:26:09 ibmblade05 clurgmgrd[5073]: <debug> Node 1 is not listening Jul 20 15:26:09 ibmblade05 openais[3837]: [CLM ] Members Left: Jul 20 15:26:09 ibmblade05 clurgmgrd[5073]: <debug> Membership Change Event Jul 20 15:26:09 ibmblade05 openais[3837]: [CLM ] Members Joined: Jul 20 15:26:09 ibmblade05 clurgmgrd[5073]: <debug> Node 1 is not listening Jul 20 15:26:09 ibmblade05 openais[3837]: [CLM ] CLM CONFIGURATION CHANGE Jul 20 15:26:09 ibmblade05 clurgmgrd[5073]: <debug> Membership Change Event Jul 20 15:26:09 ibmblade05 openais[3837]: [CLM ] New Configuration: Jul 20 15:26:09 ibmblade05 clurgmgrd[5073]: <debug> Node 1 is not listening Jul 20 15:26:09 ibmblade05 openais[3837]: [CLM ] r(0) ip(158.155.4.146) Jul 20 15:26:09 ibmblade05 openais[3837]: [CLM ] r(0) ip(158.155.4.148) Jul 20 15:26:09 ibmblade05 openais[3837]: [CLM ] Members Left: Jul 20 15:26:09 ibmblade05 openais[3837]: [CLM ] Members Joined: Jul 20 15:26:09 ibmblade05 openais[3837]: [CLM ] r(0) ip(158.155.4.148) Jul 20 15:26:09 ibmblade05 openais[3837]: [SYNC ] This node is within the primary component and will provide service. Jul 20 15:26:09 ibmblade05 openais[3837]: [TOTEM] entering OPERATIONAL state. Jul 20 15:26:09 ibmblade05 openais[3837]: [CLM ] got nodejoin message 158.155.4.146 Jul 20 15:26:09 ibmblade05 openais[3837]: [CLM ] got nodejoin message 158.155.4.148 Jul 20 15:26:09 ibmblade05 openais[3837]: [CPG ] got joinlist message from node 2 Jul 20 15:26:47 ibmblade05 clurgmgrd[5073]: <debug> Event: Port Opened Jul 20 15:26:47 ibmblade05 clurgmgrd[5073]: <info> State change: ibmblade07 UP *** After the node ibmblade07 is shutdown, the application starts on ibmblade05, but never completes. ibmblade05> ibmblade05> ibmblade05> clustat Cluster Status for ime2nodecl @ Mon Jul 20 15:29:40 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ ibmblade07 1 Offline ibmblade05 2 Online, Local, RG-Master Service Name Owner (Last) State ------- ---- ----- ------ ----- service:imeappl01 ibmblade05 service:imeslave (ibmblade05) ibmblade05> ps -fucgi UID PID PPID C STIME TTY TIME CMD cgi 11374 11165 0 14:59 ? 00:00:00 cgi_daemon -u cgi cgi_startup start cgi 11375 11374 0 14:59 ? 00:00:00 /opt/dcs/usr/local/bin/bash /opt/dcs_6.0/bin/cgi_startup start cgi 11537 11375 0 14:59 ? 00:00:00 /opt/dcs/usr/local/bin/bash /opt/dcs_6.0/bin/cgi_startup start -nolog cgi 11916 1 0 14:59 ? 00:00:00 /opt/dcs_6.0/appst/bin/task_manager cgi 11917 1 0 14:59 ? 00:00:00 cgi_mlog -s 10000000 -o /var/dcs_6.0/mlog/task_manager.log cgi 11994 1 0 14:59 ? 00:00:00 /opt/dcs_6.0/dcs/bin/dcs_tape_rpc_serv cgi 11995 1 0 14:59 ? 00:00:00 cgi_mlog -s 10000000 -o /var/dcs_6.0/mlog/dcs_tape_rpc_serv.log cgi 12044 11537 0 14:59 ? 00:00:00 /opt/dcs/usr/local/bin/bash /opt/dcs_6.0/uio/files/startup.sh check_db cgi 12056 12044 0 14:59 ? 00:00:00 /usr/local/bin/bash /opt/dcs_6.0/bin/cgi_check_databases -no_server UIO_DB_PATH:uio_maint_values_db cgi 12057 12056 0 14:59 ? 00:00:00 /usr/local/bin/bash /opt/dcs_6.0/bin/cgi_check_databases -no_server UIO_DB_PATH:uio_maint_values_db cgi 12058 12057 0 14:59 ? 00:00:00 uio_check UIO_DB_PATH:uio_maint_values_db After the 1st node, ibmblade07, is booted back up, the application remains in the same state: ibmblade05> clustat Cluster Status for ime2nodecl @ Mon Jul 20 15:54:35 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ ibmblade07 1 Online, RG-Worker ibmblade05 2 Online, Local, RG-Master Service Name Owner (Last) State ------- ---- ----- ------ ----- service:imeappl01 ibmblade05 starting service:imeslave (ibmblade05) stopped ibmblade05> ps -fucgi UID PID PPID C STIME TTY TIME CMD cgi 11374 11165 0 14:59 ? 00:00:00 cgi_daemon -u cgi cgi_start cgi 11375 11374 0 14:59 ? 00:00:00 /opt/dcs/usr/local/bin/bash cgi 11537 11375 0 14:59 ? 00:00:00 /opt/dcs/usr/local/bin/bash cgi 11916 1 0 14:59 ? 00:00:00 /opt/dcs_6.0/appst/bin/task cgi 11917 1 0 14:59 ? 00:00:00 cgi_mlog -s 10000000 -o /va cgi 11994 1 0 14:59 ? 00:00:00 /opt/dcs_6.0/dcs/bin/dcs_ta cgi 11995 1 0 14:59 ? 00:00:00 cgi_mlog -s 10000000 -o /va cgi 12044 11537 0 14:59 ? 00:00:00 /opt/dcs/usr/local/bin/bash cgi 12056 12044 0 14:59 ? 00:00:00 /usr/local/bin/bash /opt/dc cgi 12057 12056 0 14:59 ? 00:00:00 /usr/local/bin/bash /opt/dc cgi 12058 12057 0 14:59 ? 00:00:00 uio_check UIO_DB_PATH:uio_m ibmblade05> Created attachment 354384 [details]
Cluster configuration file for 2node cluster
Kernel level: ibmblade05> uname -a Linux ibmblade05 2.6.18-128.1.10.0.1.el5 #1 SMP Thu May 7 17:49:20 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux Processes stuck in disk-wait can't be killed (even with -9). There are a number of reasons this can occur. The service is stuck in the 'starting' state which is probably correct. Need: - output of 'cman_tool nodes' - output of 'group_tool ls' - sysrq-trace (echo t > /proc/sysrq-trigger) - full logs (after taking the sysrq-trace) Oh, and - output of 'cman_tool status' Created attachment 354503 [details]
Current logs from running node, ibmblade07
Current logs from running node, ibmblade07
ibmblade07> ps -fucgi UID PID PPID C STIME TTY TIME CMD cgi 615 1 0 11:21 ? 00:00:00 /opt/dcs_6.0/appst/bin/task_manager cgi 616 1 0 11:21 ? 00:00:00 cgi_mlog -s 10000000 -o /var/dcs_6.0/mlo cgi 694 1 0 11:21 ? 00:00:00 /opt/dcs_6.0/dcs/bin/dcs_tape_rpc_serv cgi 695 1 0 11:21 ? 00:00:00 cgi_mlog -s 10000000 -o /var/dcs_6.0/mlo cgi 751 32690 0 11:21 ? 00:00:00 /opt/dcs/usr/local/bin/bash /opt/dcs_6.0 cgi 765 751 0 11:21 ? 00:00:00 /usr/local/bin/bash /opt/dcs_6.0/bin/cgi cgi 766 765 0 11:21 ? 00:00:00 /usr/local/bin/bash /opt/dcs_6.0/bin/cgi cgi 767 766 0 11:21 ? 00:00:00 uio_check UIO_DB_PATH:uio_maint_values_d cgi 32527 32318 0 11:21 ? 00:00:00 cgi_daemon -u cgi cgi_startup start cgi 32528 32527 0 11:21 ? 00:00:00 /opt/dcs/usr/local/bin/bash /opt/dcs_6.0 cgi 32690 32528 0 11:21 ? 00:00:00 /opt/dcs/usr/local/bin/bash /opt/dcs_6.0 ibmblade07> ibmblade07> clustat Cluster Status for ime2nodecl @ Tue Jul 21 11:32:37 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ ibmblade07 1 Online, Local, RG-Master ibmblade05 2 Offline Service Name Owner (Last) State ------- ---- ----- ------ ----- service:imeappl01 ibmblade07 starting service:imeslave (ibmblade07) stopped ibmblade07> cman_tool nodes Node Sts Inc Joined Name 1 M 336 2009-07-21 11:15:57 ibmblade07 2 X 344 ibmblade05 ibmblade07> group_tool ls type level name id state fence 0 default 00010002 none [1] dlm 1 clvmd 00020002 none [1] dlm 1 appl01 00040002 none [1] dlm 1 data01 00060002 none [1] dlm 1 data02 00080002 none [1] dlm 1 data03 000a0002 none [1] dlm 1 rgmanager 000b0002 none [1] gfs 2 appl01 00030002 none [1] gfs 2 data01 00050002 none [1] gfs 2 data02 00070002 none [1] gfs 2 data03 00090002 none [1] ibmblade07> bmblade07> cman_tool status Version: 6.1.0 Config Version: 5 Cluster Name: ime2nodecl Cluster Id: 39062 Cluster Member: Yes Cluster Generation: 348 Membership state: Cluster-Member Nodes: 1 Expected votes: 1 Total votes: 1 Quorum: 1 Active subsystems: 9 Flags: 2node Dirty Ports Bound: 0 11 177 Node name: ibmblade07 Node ID: 1 Multicast addresses: 239.192.152.47 Node addresses: 158.155.4.148 Ok, so the cluster states all look fine. The DLM isn't blocked on fencing, and none of the GFS groups are in recovery (which typically they are when they block I/O and so forth). It sounds like some of your processes are waiting for something to occur on the GFS (or GFS2) side of things, and that this isn't a problem with rgmanager, which appears to be operating. Looking at your logs, you have a process in the disk-wait state, presumably one that should have completed starting some time ago: uio_check D ffff810001014f00 0 767 766 (NOTLB) ffff81082079fe08 0000000000000086 ffff81082079fd68 0000000000000092 0000000000000000 0000000000000002 ffff81081a33b100 ffff81011cb9d080 000007da158b7269 000000000001a5e8 ffff81081a33b2e8 0000000280088904 Call Trace: [<ffffffff885b1a69>] :dlm:dlm_posix_lock+0x172/0x210 [<ffffffff8009dbfb>] autoremove_wake_function+0x0/0x2e [<ffffffff885e0c72>] :gfs2:gfs2_lock+0xc3/0xcf [<ffffffff80039903>] fcntl_setlk+0x11e/0x273 [<ffffffff800db57d>] sys_fstatfs+0x57/0x68 [<ffffffff800b4702>] audit_syscall_entry+0x16e/0x1a1 [<ffffffff8002e25b>] sys_fcntl+0x269/0x2dc [<ffffffff8005d28d>] tracesys+0xd5/0xe0 My guess is that it's stuck because of one of these GFS2 quotad threads (I cut out the stack traces from the first three because they're all the same): gfs2_quotad D ffff810001014f00 0 27119 281 27131 27118 (L-TLB) ... gfs2_quotad D ffff810001014f00 0 27145 281 27155 27144 (L-TLB) ... gfs2_quotad D ffff810001014f00 0 27175 281 27186 27174 (L-TLB) ... gfs2_quotad D ffff810001014f00 0 27199 281 27227 27198 (L-TLB) ffff81080deabe00 0000000000000046 ffff81080fccc000 ffffffff885d5fe5 ffff81080e2f7990 000000000000000a ffff81080de940c0 ffff81011cb9d080 00000804ab028d60 0000000000000709 ffff81080de942a8 00000002885eb311 Call Trace: [<ffffffff885d5fe5>] :gfs2:gfs2_glock_dq+0x117/0x132 [<ffffffff8006395c>] schedule_timeout+0x8a/0xad [<ffffffff80095053>] process_timeout+0x0/0x5 [<ffffffff885e6e75>] :gfs2:gfs2_quotad+0x194/0x1c2 [<ffffffff8009dbfb>] autoremove_wake_function+0x0/0x2e [<ffffffff885e6ce1>] :gfs2:gfs2_quotad+0x0/0x1c2 [<ffffffff8009d9e3>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032383>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8009d9e3>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032285>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Rgmanager also is not internally "stuck" (if it was, you would not be able to run 'clustat'), leading me to believe that if a bug exists that it is within GFS2 or the DLM and not rgmanager. We'll need: - dlm_tool lockdump [lockspace] - dlm_tool lockdebug [lockspace] ... for each file system lockspace in use. I think this is your list of lockspaces, but you should double check to be sure. There is one file system lockspace for each GFS/GFS2 mount. * appl01 * data01 * data02 * data03 So, exactly how do I do this? bmblade07> gfs2_tool df /dcs/appl01 /dcs/appl01: SB lock proto = "lock_dlm" SB lock table = "ime2nodecl:appl01" SB ondisk format = 1801 SB multihost format = 1900 Block size = 4096 Journals = 8 Resource Groups = 550 Mounted lock proto = "lock_dlm" Mounted lock table = "ime2nodecl:appl01" Mounted host data = "jid=1:id=196610:first=0" Journal number = 1 Lock module flags = 0 Local flocks = FALSE Local caching = FALSE Type Total Used Free use% ------------------------------------------------------------------------ data 35709248 814235 34895013 2% inodes 34907929 12916 34895013 0% ibmblade07> dlm_tool lockdump appl01 can't open /sys/kernel/debug/dlm/appl01_locks: No such file or directory ibmblade07> dlm_tool lockdump ime2nodecl:appl01 can't open /sys/kernel/debug/dlm/ime2nodecl:appl01_locks: No such file or directory ibmblade07> dlm_tool lockdump ime2nodecl can't open /sys/kernel/debug/dlm/ime2nodecl_locks: No such file or directory ibmblade07> ls -al /sys/kernel/debug/ I have IBM RDAC drivers to deal with, so, if I need to boot off a debug kernel, I will need to load the RDAC drivers into the debug kernel. That could take some time. Since I am recreating this issue on Dev servers rather than the customer site, I can update to the latest RPM's and attempt to recreate the issue for further research as well. Just let me know. The process stuck in dlm_posix_lock may be a result of bug 474163 which is a bug in the plock_ownership mode. Is cman-2.0.100 available for RHEL 5.3? So, what would you like for me to do at this time? I'd suggest using plock_ownership="0" for now to avoid bug 474163, and try the tests with that. Well. We rely on file locks. Is this going to be a problem? We have processes on 2+ nodes access files at the same time. plock_ownership is just a performance-related tunable. It improves performance on file locks that are not shared between nodes, but happen to just be used (and frequently reused) from a single node. For file locks that are used from multiple nodes it actually reduces performance somewhat. When we first started testing RHEL Clustering, we were using GFS . Now with GFS2, is the line in cluster.conf needed? ibmblade05> grep plock /etc/cluster/cluster.conf <gfs_controld plock_ownership="0" plock_rate_limit="0"/> Setting plock_ownership to '0' seems to work. ibmblade05> clustat Cluster Status for ime2nodecl @ Tue Jul 21 15:26:04 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ ibmblade07 1 Offline ibmblade05 2 Online, Local, RG-Master Service Name Owner (Last) State ------- ---- ----- ------ ----- service:imeappl01 ibmblade05 started service:imeslave (ibmblade05) stopped The gfs_controld daemon and the <gfs_controld/> cluster.conf config is used for both gfs1 and gfs2. It seems to fix the issue. Should we 'set' plock_ownership to '0' for 2-node Clusters or all clusters? plock_ownership should be 0 for all clusters Is the recommended solution or should I tell the customer they may be a fix in the near future? I'd call this a recommended configuration. An important detail I forgot to mention is that plock_ownership cannot be changed without stopping the cluster, changing cluster.conf on all nodes, then restarting the cluster. It's possible that the application would benefit in the future by enbaling the plock_ownership optimization. It all depends on how it uses posix locks. If instances of the app on different nodes tend to mostly take their own separate locks, and repeatedly lock/unlock the same ones, then ownership can help. But, if the same locks are most often taken from multiple nodes (as you'd expect from an app sharing files), then the optimization wouldn't likely help. Thanks for the info. One other question, do you have any idea why the 'halted' node was not 'fenced' ? What should you expect in a 2-node configuration? Are there other configuration settings required to make 'fencing' work in a 2-node configuration? (In reply to comment #24) > Thanks for the info. > One other question, do you have any idea why the 'halted' node was not 'fenced' > ? > What should you expect in a 2-node configuration? Are there other > configuration settings required to make 'fencing' work in a 2-node > configuration? Never mind. It appears to be working properly now. Can you make any other recommendations for a 2-node cluster. Can you review the attached cluster.conf and see if any other 'settings' may be of use? I found this reference: - http://www.mail-archive.com/linux-cluster@redhat.com/msg06573.html Also, mounting the fs with the nodiratime option and setting: <dlm plock_ownership="1" plock_rate_limit="0"/> <gfs_controld plock_rate_limit="0"/> in cluster.conf reduced rm -r time on the directory referenced below from ~ 9 mins to ~ 5 mins. Also, from dlm_controld man page, you have this setting to consider: <dlm> <lockspace name="foo" nodir="1"> <master name="node01" weight="2"/> <master name="node02" weight="1"/> </lockspace> </dlm> From your experiences, what could you suggest for other tuning considerations? gfs should always be mounted with noatime (nodiratime adds nothing more); atime hurts performance immensely. The plock settings will have no effect on fs operations apart from file locking. The dlm lockspace/master settings are for specialized configurations and can be ignored. It was an area of research that never produced any general or practical results for users. The attached cluster.conf looks fine (I can't really comment on the <rm> section), but I'd suggest a larger post_join_delay, perhaps 10 seconds, unless you've found a reason to keep it lower. *** This bug has been marked as a duplicate of bug 474163 *** |