Bug 214290
Summary: | send plock error -1 in gfs_controld logs | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Abhijith Das <adas> | ||||||
Component: | openais | Assignee: | Steven Dake <sdake> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 5.0 | CC: | ccaulfie, cluster-maint, kanderso, rkenna, teigland | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | 5.0.0 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2006-11-28 21:35:03 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Abhijith Das
2006-11-06 22:10:43 UTC
cman_tool nodes output: [root@camel ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 58804 2006-11-06 15:28:26 camel 2 M 58804 2006-11-06 15:28:26 merit 3 M 58804 2006-11-06 15:28:26 winston 4 M 58804 2006-11-06 15:28:26 kool 5 M 58804 2006-11-06 15:28:26 salem [root@merit ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 58804 2006-11-06 15:28:36 camel 2 M 58764 2006-11-06 15:17:10 merit 3 M 58792 2006-11-06 15:27:00 winston 4 M 58780 2006-11-06 15:18:17 kool 5 M 58776 2006-11-06 15:18:11 salem [root@winston ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 58804 2006-11-06 15:29:28 camel 2 M 58792 2006-11-06 15:27:52 merit 3 M 58784 2006-11-06 15:27:52 winston 4 M 58792 2006-11-06 15:27:52 kool 5 M 58792 2006-11-06 15:27:52 salem [root@kool ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 58804 2006-11-06 15:30:46 camel 2 M 58780 2006-11-06 15:20:28 merit 3 M 58792 2006-11-06 15:29:11 winston 4 M 58780 2006-11-06 15:20:28 kool 5 M 58780 2006-11-06 15:20:28 salem [root@salem ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 58804 2006-11-06 15:32:27 camel 2 M 58776 2006-11-06 15:22:03 merit 3 M 58792 2006-11-06 15:30:51 winston 4 M 58780 2006-11-06 15:22:08 kool 5 M 58764 2006-11-06 15:22:03 salem group_tool -v output: [root@camel ~]# group_tool -v type level name id state node id local_done fence 0 default 00010003 none [1 2 3 4 5] dlm 1 clvmd 00010001 none [1 2 3 4 5] [root@merit ~]# group_tool -v type level name id state node id local_done fence 0 default 00010003 none [1 2 3 4 5] dlm 1 clvmd 00010001 none [1 2 3 4 5] dlm 1 soot 00020002 FAIL_START_WAIT 1 100030003 0 [2 4 5] dlm 1 ash 00040002 FAIL_START_WAIT 1 100030003 0 [2 4 5] dlm 1 cancer 00060002 FAIL_START_WAIT 1 100030003 0 [2 4 5] gfs 2 soot 00010002 FAIL_START_WAIT 1 100030003 0 [2 4 5] gfs 2 ash 00030002 FAIL_START_WAIT 1 100030003 0 [2 4 5] gfs 2 cancer 00050002 FAIL_START_WAIT 1 100030003 0 [2 4 5] [root@winston ~]# group_tool -v type level name id state node id local_done fence 0 default 00010003 none [1 2 3 4 5] dlm 1 clvmd 00010001 none [1 2 3 4 5] gfs 2 soot 00000000 JOIN_STOP_WAIT 3 300040001 1 [2 3 4 5] [root@kool ~]# group_tool -v type level name id state node id local_done fence 0 default 00010003 none [1 2 3 4 5] dlm 1 clvmd 00010001 none [1 2 3 4 5] dlm 1 soot 00020002 FAIL_START_WAIT 1 100030003 1 [2 4 5] dlm 1 ash 00040002 FAIL_START_WAIT 1 100030003 1 [2 4 5] dlm 1 cancer 00060002 FAIL_START_WAIT 1 100030003 1 [2 4 5] gfs 2 soot 00010002 FAIL_START_WAIT 1 100030003 1 [2 4 5] gfs 2 ash 00030002 FAIL_START_WAIT 1 100030003 1 [2 4 5] gfs 2 cancer 00050002 FAIL_START_WAIT 1 100030003 1 [2 4 5] [root@salem ~]# group_tool -v type level name id state node id local_done fence 0 default 00010003 none [1 2 3 4 5] dlm 1 clvmd 00010001 none [1 2 3 4 5] dlm 1 soot 00020002 FAIL_START_WAIT 1 100030003 1 [2 4 5] dlm 1 ash 00040002 FAIL_START_WAIT 1 100030003 1 [2 4 5] dlm 1 cancer 00060002 FAIL_START_WAIT 1 100030003 1 [2 4 5] gfs 2 soot 00010002 FAIL_START_WAIT 1 100030003 1 [2 4 5] gfs 2 ash 00030002 FAIL_START_WAIT 1 100030003 1 [2 4 5] gfs 2 cancer 00050002 FAIL_START_WAIT 1 100030003 1 [2 4 5] cman_tool status output: [root@camel ~]# cman_tool status Version: 6.0.1 Config Version: 1 Cluster Name: smoke Cluster Id: 3471 Cluster Member: Yes Cluster Generation: 58804 Membership state: Cluster-Member Nodes: 5 Expected votes: 5 Total votes: 5 Quorum: 3 Active subsystems: 7 Flags: Ports Bound: 0 11 Node name: camel Node ID: 1 Multicast addresses: 239.192.13.156 Node addresses: 10.15.89.52 [root@merit ~]# cman_tool status Version: 6.0.1 Config Version: 1 Cluster Name: smoke Cluster Id: 3471 Cluster Member: Yes Cluster Generation: 58804 Membership state: Cluster-Member Nodes: 5 Expected votes: 5 Total votes: 5 Quorum: 3 Active subsystems: 6 Flags: Ports Bound: 0 11 Node name: merit Node ID: 2 Multicast addresses: 239.192.13.156 Node addresses: 10.15.89.54 [root@winston ~]# cman_tool status Version: 6.0.1 Config Version: 1 Cluster Name: smoke Cluster Id: 3471 Cluster Member: Yes Cluster Generation: 58804 Membership state: Cluster-Member Nodes: 5 Expected votes: 5 Total votes: 5 Quorum: 3 Active subsystems: 7 Flags: Ports Bound: 0 11 Node name: winston Node ID: 3 Multicast addresses: 239.192.13.156 Node addresses: 10.15.89.53 [root@kool ~]# cman_tool status Version: 6.0.1 Config Version: 1 Cluster Name: smoke Cluster Id: 3471 Cluster Member: Yes Cluster Generation: 58804 Membership state: Cluster-Member Nodes: 5 Expected votes: 5 Total votes: 5 Quorum: 3 Active subsystems: 7 Flags: Ports Bound: 0 11 Node name: kool Node ID: 4 Multicast addresses: 239.192.13.156 Node addresses: 10.15.89.56 [root@salem ~]# cman_tool status Version: 6.0.1 Config Version: 1 Cluster Name: smoke Cluster Id: 3471 Cluster Member: Yes Cluster Generation: 58804 Membership state: Cluster-Member Nodes: 5 Expected votes: 5 Total votes: 5 Quorum: 3 Active subsystems: 7 Flags: Ports Bound: 0 11 Node name: salem Node ID: 5 Multicast addresses: 239.192.13.156 Node addresses: 10.15.89.57 Created attachment 140519 [details]
group_tool dump on all nodes
Created attachment 140520 [details]
/var/log/messages on the smoke cluster nodes:
Beta2 Blocker proposed. Problem found when running revolver on the smoke cluster in the same configuration as the qe release criteria. Test is being restarted to collect more information. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering. This request is not yet committed for inclusion in release. I believe I have identified the cause of the wierd configuration changes experienced in the qa and engineering labs. Every time a new processor is added or an existing processor is removed from the configuration, a new round of consensus gathering (membership determiniation) occurs. This is what revolver is exercising. A join message is sent upon entering the gather phase of the protocol. A join timer is started (100 msec timeout). If consensus is not reached within 100 msec, the join timeout expires - a new join message is sent - the join timer is restarted. This means that the join message will be sent every 100 msec until consensus is reached. To bound the time period under which consensus should be tried to be reached, a consensus timer is started on entry to the gather phase. When this consensus timer expires, any processors with which consensus could not be reached are added to the failed list and gather is entered again and the process is repeated until a membership is formed. In this case, overload of the network can cause the join message to be lost by some of the processors in the network. The consensus timeout is configured to 200msec, so the join message is only resent once (for a total of two times) or when a new join message should be sent because a processor identified a new potential member. This causes a node to be excluded from the membership. A short while later, the nodes multicast a message. If the message is from a different ring, the membership protocol is started (gather entered, see above). This time around, however, consensus is reached because the messages or retries get through to the nodes. This behavior confuses upper level components. The consensus timeout must be less then the token timeout. The consensus timeout/join timeout determines the number of times a join message is resent. I suggest changing the join timeout to 60msec and the consensus timeout to 4800 msec to offer the greatest possibility of forming a new configuration under network overload conditions. These changes are made in the cman parser so cman must be rebuilt after it is modified. I have also identified a discrepency with the specification vs the code which causes mp5 to fail after 30-45 mins. I don't completely understand the scenario but I'd defer to the specification and the fact that mp5 now runs properly. This patch is a one liner and must have been missed in a previous commit because it is in one of my work trees under which i ran mp5 for several days without failure. This requires a rebuild of the openais package. cman defaults changed: Checking in ais.c; /cvs/cluster/cluster/cman/daemon/ais.c,v <-- ais.c new revision: 1.45; previous revision: 1.44 done fixed in openais-0.80.1-15.el5 and some newer version of cman |