Bug 516758

Summary: rgmanager: local_node_name does not check if magma_tool failed.
Product: [Retired] Red Hat Cluster Suite Reporter: Eduardo Damato <edamato>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: low Docs Contact:
Priority: low    
Version: 4CC: cfeist, cluster-maint, djansa, fnadge, iannis, sbradley, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: rgmanager-1.9.88-1.el4 Doc Type: Bug Fix
Doc Text:
Previously, the function local_node_name in /resources/utils/member_util.sh did not properly check if magma_tool failed and could return an empty string. With this update, this issue is resolved.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-16 15:07:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eduardo Damato 2009-08-11 12:26:27 UTC
Description of problem:

the function local_node_name on /resources/utils/member_util.sh does not check if magma_tool failed and can return an empty string.

Version-Release number of selected component (if applicable):

rgmanager-1.9.87-1.el4

How reproducible:

every time magma_tool returns an error. Deterministic.

Steps to Reproduce:
1.activate HALVM with proper configuration.
2.cman_tool leave force
3.notice the errors in the log saying wrong configuration for HALVM
  
Actual results:

When magma_tool fails (and one could argue that the cluster is not working at all), scripts might misinterpret the value from the function local_node_name.

Expected results:

local_node_name return 2 if there is a problem when treating the output of magma_tool

Additional info:

This issue is low priority because it happens most often on a clusternode that is already outside the cluster and/or can not access magma.

The problem can occur in many instances, namely when a machine has been disconnected from the cluster. It will certainly be fenced, but the errors below give the wrong impression that the setup of the cluster is wrong:

Aug 10 09:06:49 node2 clurgmgrd: [7365]: <err> WARNING: An improper setup can cause data corruption! 
Aug 10 09:06:49 node2 clurgmgrd: [7365]: <err> HA LVM:  Improper setup detected 

(sanitized input)


Aug 10 09:06:29 node2 kernel: CMAN: sendmsg failed: -22 
Aug 10 09:06:49 node2 last message repeated 4 times 
Aug 10 09:06:49 node2 kernel: CMAN: No functional network interfaces, leaving cluster 
Aug 10 09:06:49 node2 kernel: CMAN: sendmsg failed: -22 
Aug 10 09:06:49 node2 kernel: CMAN: sendmsg failed: -22 
Aug 10 09:06:49 node2 kernel: CMAN: we are leaving the cluster. 
Aug 10 09:06:49 node2 kernel: WARNING: dlm_emergency_shutdown 
Aug 10 09:06:49 node2 clurgmgrd[7365]: <warning> #67: Shutting down uncleanly 
Aug 10 09:06:49 node2 clurgmgrd[7365]: <debug> Emergency stop of cluster_cible_BDD 
Aug 10 09:06:49 node2 ccsd[7262]: Cluster manager shutdown.  Attemping to reconnect... 
Aug 10 09:06:49 node2 kernel: WARNING: dlm_emergency_shutdown finished 1 
Aug 10 09:06:49 node2 kernel: SM: 00000003 sm_stop: SG still joined 
Aug 10 09:06:49 node2 udev[12286]: removing device node '/dev/misc/dlm_Magma' 
Aug 10 09:06:49 node2 udevd[2353]: udev done! 
Aug 10 09:06:49 node2 ccsd[7262]: Cluster is not quorate.  Refusing connection. 
Aug 10 09:06:49 node2 ccsd[7262]: Error while processing connect: Connection refused 
Aug 10 09:06:49 node2 clurgmgrd: [7365]: <err> stop: Could not match /dev/VG/LV with a real device 
Aug 10 09:06:49 node2 clurgmgrd[7365]: <notice> stop on fs "FS" returned 2 (invalid argument(s)) 
Aug 10 09:06:49 node2 ccsd[7262]: Cluster is not quorate.  Refusing connection. 
Aug 10 09:06:49 node2 ccsd[7262]: Error while processing connect: Connection refused 
Aug 10 09:06:49 node2 clurgmgrd: [7365]: <err> HA LVM:  Improper setup detected 
Aug 10 09:06:49 node2 ccsd[7262]: Cluster is not quorate.  Refusing connection. 
Aug 10 09:06:49 node2 ccsd[7262]: Error while processing connect: Connection refused 
Aug 10 09:06:49 node2 clurgmgrd: [7365]: <err> - @ missing from "volume_list" in lvm.conf 
Aug 10 09:06:49 node2 ccsd[7262]: Cluster is not quorate.  Refusing connection. 
Aug 10 09:06:49 node2 clurgmgrd: [7365]: <err> WARNING: An improper setup can cause data corruption! 
Aug 10 09:06:50 node2 ccsd[7262]: Cluster is not quorate.  Refusing connection. 
Aug 10 09:06:50 node2 clurgmgrd: [7365]: <err> Unable to determine cluster node name 
Aug 10 09:06:50 node2 ccsd[7262]: Cluster is not quorate.  Refusing connection. 
.. 
Aug 10 09:06:51 node2 qdiskd[7336]: <err> cman_dispatch: Host is down 
Aug 10 09:06:51 node2 qdiskd[7336]: <err> Halting qdisk operations 
 
Impact is therefore low.

Comment 2 Eduardo Damato 2009-08-11 12:31:12 UTC
Created attachment 357019 [details]
initial patch to return 2 when magma_tool fails.


Proposing the following patch to fix the problem. Arguably HALVM should also do input sanity checks and reject the output from local_node_name if it is empty.

Eduardo.

Comment 10 Florian Nadge 2011-01-03 14:06:21 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, the function local_node_name in /resources/utils/member_util.sh did not properly check if magma_tool failed and could return an empty string. With this update, this issue is resolved.

Comment 11 errata-xmlrpc 2011-02-16 15:07:14 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0264.html