Bug 1658557

Summary: After node eviction, during stopping of rgmanager, the error "unable to determine cluster node name,HA LVM: Improper setup detected' error" was logged.
Product: Red Hat Enterprise Linux 6 Reporter: SUNGTM <t.sungtm>
Component: clusterAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED WONTFIX QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.5CC: ccaulfie, cluster-maint, cww, rpeterso, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-18 19:07:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description SUNGTM 2018-12-12 12:13:48 UTC
Description of problem:

When a Cluster node A detects that the other node B missed qdisk updates, it sends eviction notice to B. In the node B, cman killed as per logs. When rgmanager stopped, "unable to find cluster node name,HA LVM: Improper setup detected' error logged.


Please, confirm that when cman components were killed on the affected node, all corosync/cman commands won't work. Is it correct?


We checked the RHEL HA resource agent Shell Script source codes for the verifying the following error statements found at the log:

rgmanager[92694]: [lvm] HA LVM:  Improper setup detected
rgmanager[92704]: [lvm] * @ missing from "volume_list" in lvm.conf
rgmanager[92719]: [lvm] Owner of VG_DB/lv_db is not in the cluster


As per Source code listing /usr/share/cluster/lvm.sh, if 'volume_list' won't match with Cluster member name, it throws the above-mentioned errors:

    if ! lvm dumpconfig activation/volume_list | grep $(local_node_name); then
                    ocf_log err "HA LVM:  Improper setup detected"
                    ocf_log err "* @$(local_node_name) missing from \"volume_list\" in lvm.conf"
                    return $OCF_ERR_GENERIC
     fi


Cluster member name is obtained from a function local_node_name through Source code listing /usr/share/cluster/utils/member_util.sh:

    local_node_name()
    {
           ...
           ...

            if which cman_tool &> /dev/null; then
                    # Use cman_tool

                    line=$(cman_tool status | grep -i "Node name: $1")
                    [ -n "$line" ] || return 1
                    echo ${line/*name: /}
                    return 0
            fi
           ...
           ...

            return 1
    }


As already A detected the missing of qdisk updates from B, A evicted the other node that caused cman components were killed.

A:
qdiskd[5904]: Writing eviction notice for node 2
qdiskd[5904]: Node 2 evicted

B:

corosync[5860]: cman killed by node 1 because we were killed by cman_tool or other application
rgmanager[8032]: #67: Shutting down uncleanly
fenced[6619]: cluster is down, exiting


As cman components were killed on DCN-02, all cman related commands won't work (cman_tool, ccs etc.,) [PLEASE, CONFIRM THIS!!!!!] and other processes were also exiting. 
Before rgmanager process exited (uncleanly), all Cluster resources were tried to stop.
Firstly, stopping of FS resource succeeded. During LVM stopping, Cluster member was checked through cman_tool command which didn't produce any output. This caused that the above-mentioned statements (HA-LVM) were logged on DCN-02 node.


Even though, getting error 'HA-LVM improper setting detected' may seem an expected behaviour with respect to existing resource agent script (lvm.sh), it may be a bug in the context of LVM resource deactivation as already cman/corosync components were killed and trying to execute those component-related commands will always be null. 



Version-Release number of selected component (if applicable):

resource-agents-3.9.2-40.el6.x86_64

How reproducible:


Steps to Reproduce:
1. Node A evicts Node B when B misses qdisk updates
2. Cman components were killed on Node B
3. rgmanager shutting down started. File system resource was successfully stopped. But, stopping of HA-LVM resource (i.e, deactivation) failed due to 'cman' processes were killed before stopping of rgmanager during node evictions.



Actual results:

rgmanager[92694]: [lvm] HA LVM:  Improper setup detected
rgmanager[92704]: [lvm] * @ missing from "volume_list" in lvm.conf

Expected results:


HA-LVM resource should be successfully stopped.

Additional info:

Comment 2 Chris Williams 2019-06-18 19:07:24 UTC
When Red Hat shipped 6.8 on May 10, 2016 Red Hat Enterprise Linux 6 entered Maintenance Support 1 Phase.

https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_1_Phase

That means only "Critical and Important Security errata advisories (RHSAs) and Urgent Priority Bug Fix errata advisories (RHBAs) may be released". RHEL 6 is now in Maintenance Phase 2 and this BZ does not appear to meet Maintenance Support 2 Phase criteria so is being closed WONTFIX. If this is critical for your environment please open a case in the Red Hat Customer Portal, https://access.redhat.com ,provide a thorough business justification and ask that the BZ be re-opened for consideration in the next minor release.