Bug 516758

Summary:	rgmanager: local_node_name does not check if magma_tool failed.
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Eduardo Damato <edamato>
Component:	rgmanager	Assignee:	Lon Hohberger <lhh>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	low	Docs Contact:
Priority:	low
Version:	4	CC:	cfeist, cluster-maint, djansa, fnadge, iannis, sbradley, tao
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	rgmanager-1.9.88-1.el4	Doc Type:	Bug Fix
Doc Text:	Previously, the function local_node_name in /resources/utils/member_util.sh did not properly check if magma_tool failed and could return an empty string. With this update, this issue is resolved.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-02-16 15:07:14 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Eduardo Damato 2009-08-11 12:26:27 UTC

Description of problem:

the function local_node_name on /resources/utils/member_util.sh does not check if magma_tool failed and can return an empty string.

Version-Release number of selected component (if applicable):

rgmanager-1.9.87-1.el4

How reproducible:

every time magma_tool returns an error. Deterministic.

Steps to Reproduce:
1.activate HALVM with proper configuration.
2.cman_tool leave force
3.notice the errors in the log saying wrong configuration for HALVM
  
Actual results:

When magma_tool fails (and one could argue that the cluster is not working at all), scripts might misinterpret the value from the function local_node_name.

Expected results:

local_node_name return 2 if there is a problem when treating the output of magma_tool

Additional info:

This issue is low priority because it happens most often on a clusternode that is already outside the cluster and/or can not access magma.

The problem can occur in many instances, namely when a machine has been disconnected from the cluster. It will certainly be fenced, but the errors below give the wrong impression that the setup of the cluster is wrong:

Aug 10 09:06:49 node2 clurgmgrd: [7365]: <err> WARNING: An improper setup can cause data corruption! 
Aug 10 09:06:49 node2 clurgmgrd: [7365]: <err> HA LVM:  Improper setup detected 

(sanitized input)


Aug 10 09:06:29 node2 kernel: CMAN: sendmsg failed: -22 
Aug 10 09:06:49 node2 last message repeated 4 times 
Aug 10 09:06:49 node2 kernel: CMAN: No functional network interfaces, leaving cluster 
Aug 10 09:06:49 node2 kernel: CMAN: sendmsg failed: -22 
Aug 10 09:06:49 node2 kernel: CMAN: sendmsg failed: -22 
Aug 10 09:06:49 node2 kernel: CMAN: we are leaving the cluster. 
Aug 10 09:06:49 node2 kernel: WARNING: dlm_emergency_shutdown 
Aug 10 09:06:49 node2 clurgmgrd[7365]: <warning> #67: Shutting down uncleanly 
Aug 10 09:06:49 node2 clurgmgrd[7365]: <debug> Emergency stop of cluster_cible_BDD 
Aug 10 09:06:49 node2 ccsd[7262]: Cluster manager shutdown.  Attemping to reconnect... 
Aug 10 09:06:49 node2 kernel: WARNING: dlm_emergency_shutdown finished 1 
Aug 10 09:06:49 node2 kernel: SM: 00000003 sm_stop: SG still joined 
Aug 10 09:06:49 node2 udev[12286]: removing device node '/dev/misc/dlm_Magma' 
Aug 10 09:06:49 node2 udevd[2353]: udev done! 
Aug 10 09:06:49 node2 ccsd[7262]: Cluster is not quorate.  Refusing connection. 
Aug 10 09:06:49 node2 ccsd[7262]: Error while processing connect: Connection refused 
Aug 10 09:06:49 node2 clurgmgrd: [7365]: <err> stop: Could not match /dev/VG/LV with a real device 
Aug 10 09:06:49 node2 clurgmgrd[7365]: <notice> stop on fs "FS" returned 2 (invalid argument(s)) 
Aug 10 09:06:49 node2 ccsd[7262]: Cluster is not quorate.  Refusing connection. 
Aug 10 09:06:49 node2 ccsd[7262]: Error while processing connect: Connection refused 
Aug 10 09:06:49 node2 clurgmgrd: [7365]: <err> HA LVM:  Improper setup detected 
Aug 10 09:06:49 node2 ccsd[7262]: Cluster is not quorate.  Refusing connection. 
Aug 10 09:06:49 node2 ccsd[7262]: Error while processing connect: Connection refused 
Aug 10 09:06:49 node2 clurgmgrd: [7365]: <err> - @ missing from "volume_list" in lvm.conf 
Aug 10 09:06:49 node2 ccsd[7262]: Cluster is not quorate.  Refusing connection. 
Aug 10 09:06:49 node2 clurgmgrd: [7365]: <err> WARNING: An improper setup can cause data corruption! 
Aug 10 09:06:50 node2 ccsd[7262]: Cluster is not quorate.  Refusing connection. 
Aug 10 09:06:50 node2 clurgmgrd: [7365]: <err> Unable to determine cluster node name 
Aug 10 09:06:50 node2 ccsd[7262]: Cluster is not quorate.  Refusing connection. 
.. 
Aug 10 09:06:51 node2 qdiskd[7336]: <err> cman_dispatch: Host is down 
Aug 10 09:06:51 node2 qdiskd[7336]: <err> Halting qdisk operations 
 
Impact is therefore low.

Comment 2 Eduardo Damato 2009-08-11 12:31:12 UTC

Created attachment 357019 [details]
initial patch to return 2 when magma_tool fails.


Proposing the following patch to fix the problem. Arguably HALVM should also do input sanity checks and reject the output from local_node_name if it is empty.

Eduardo.

Comment 6 Lon Hohberger 2010-10-21 22:14:38 UTC

http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=1ad2e0a85482ab86e63613a345f9a18bcdc42cba

Comment 10 Florian Nadge 2011-01-03 14:06:21 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, the function local_node_name in /resources/utils/member_util.sh did not properly check if magma_tool failed and could return an empty string. With this update, this issue is resolved.

Comment 11 errata-xmlrpc 2011-02-16 15:07:14 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0264.html