Bug 1658557 - After node eviction, during stopping of rgmanager, the error "unable to determine cluster node name,HA LVM: Improper setup detected' error" was logged.
Summary: After node eviction, during stopping of rgmanager, the error "unable to deter...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster
Version: 6.5
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: rc
: ---
Assignee: Christine Caulfield
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-12 12:13 UTC by SUNGTM
Modified: 2019-06-18 19:07 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-18 19:07:24 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description SUNGTM 2018-12-12 12:13:48 UTC
Description of problem:

When a Cluster node A detects that the other node B missed qdisk updates, it sends eviction notice to B. In the node B, cman killed as per logs. When rgmanager stopped, "unable to find cluster node name,HA LVM: Improper setup detected' error logged.


Please, confirm that when cman components were killed on the affected node, all corosync/cman commands won't work. Is it correct?


We checked the RHEL HA resource agent Shell Script source codes for the verifying the following error statements found at the log:

rgmanager[92694]: [lvm] HA LVM:  Improper setup detected
rgmanager[92704]: [lvm] * @ missing from "volume_list" in lvm.conf
rgmanager[92719]: [lvm] Owner of VG_DB/lv_db is not in the cluster


As per Source code listing /usr/share/cluster/lvm.sh, if 'volume_list' won't match with Cluster member name, it throws the above-mentioned errors:

    if ! lvm dumpconfig activation/volume_list | grep $(local_node_name); then
                    ocf_log err "HA LVM:  Improper setup detected"
                    ocf_log err "* @$(local_node_name) missing from \"volume_list\" in lvm.conf"
                    return $OCF_ERR_GENERIC
     fi


Cluster member name is obtained from a function local_node_name through Source code listing /usr/share/cluster/utils/member_util.sh:

    local_node_name()
    {
           ...
           ...

            if which cman_tool &> /dev/null; then
                    # Use cman_tool

                    line=$(cman_tool status | grep -i "Node name: $1")
                    [ -n "$line" ] || return 1
                    echo ${line/*name: /}
                    return 0
            fi
           ...
           ...

            return 1
    }


As already A detected the missing of qdisk updates from B, A evicted the other node that caused cman components were killed.

A:
qdiskd[5904]: Writing eviction notice for node 2
qdiskd[5904]: Node 2 evicted

B:

corosync[5860]: cman killed by node 1 because we were killed by cman_tool or other application
rgmanager[8032]: #67: Shutting down uncleanly
fenced[6619]: cluster is down, exiting


As cman components were killed on DCN-02, all cman related commands won't work (cman_tool, ccs etc.,) [PLEASE, CONFIRM THIS!!!!!] and other processes were also exiting. 
Before rgmanager process exited (uncleanly), all Cluster resources were tried to stop.
Firstly, stopping of FS resource succeeded. During LVM stopping, Cluster member was checked through cman_tool command which didn't produce any output. This caused that the above-mentioned statements (HA-LVM) were logged on DCN-02 node.


Even though, getting error 'HA-LVM improper setting detected' may seem an expected behaviour with respect to existing resource agent script (lvm.sh), it may be a bug in the context of LVM resource deactivation as already cman/corosync components were killed and trying to execute those component-related commands will always be null. 



Version-Release number of selected component (if applicable):

resource-agents-3.9.2-40.el6.x86_64

How reproducible:


Steps to Reproduce:
1. Node A evicts Node B when B misses qdisk updates
2. Cman components were killed on Node B
3. rgmanager shutting down started. File system resource was successfully stopped. But, stopping of HA-LVM resource (i.e, deactivation) failed due to 'cman' processes were killed before stopping of rgmanager during node evictions.



Actual results:

rgmanager[92694]: [lvm] HA LVM:  Improper setup detected
rgmanager[92704]: [lvm] * @ missing from "volume_list" in lvm.conf

Expected results:


HA-LVM resource should be successfully stopped.

Additional info:

Comment 2 Chris Williams 2019-06-18 19:07:24 UTC
When Red Hat shipped 6.8 on May 10, 2016 Red Hat Enterprise Linux 6 entered Maintenance Support 1 Phase.

https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_1_Phase

That means only "Critical and Important Security errata advisories (RHSAs) and Urgent Priority Bug Fix errata advisories (RHBAs) may be released". RHEL 6 is now in Maintenance Phase 2 and this BZ does not appear to meet Maintenance Support 2 Phase criteria so is being closed WONTFIX. If this is critical for your environment please open a case in the Red Hat Customer Portal, https://access.redhat.com ,provide a thorough business justification and ask that the BZ be re-opened for consideration in the next minor release.


Note You need to log in before you can comment on or make changes to this bug.