Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1079207

Summary: clustat -i 0 rarely segfaults in flag_rgmanager_nodes() while cluster is repetitively shutting down and starting up
Product: Red Hat Enterprise Linux 6 Reporter: Frantisek Reznicek <freznice>
Component: rgmanagerAssignee: Ryan McCabe <rmccabe>
Status: CLOSED WONTFIX QA Contact: cluster-qe <cluster-qe>
Severity: low Docs Contact:
Priority: low    
Version: 6.5CC: cfeist, cluster-maint, esammons, fdinitto, jruemker, mjuricek
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rgmanager-3.0.12.1-32.el6 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-07 21:38:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
cluster.conf none

Description Frantisek Reznicek 2014-03-21 08:36:32 UTC
Description of problem:

clustat -i 0 rarely segfaults in flag_rgmanager_nodes() while cluster is repetitively shutting down and starting up.

During the layered product testing (MRG/M) I very rarely see that clustat -i 0 segfaults while starting to toggle cluster up and down.

There were seen about 5 segfaults of thousands of executions.

Cluster Status for dtests_ha @ Fri Mar 21 09:21:16 2014
Member Status: Quorate

Resource Group Manager not running; no service information available.

Membership information not available
Segmentation fault (core dumped)

The core file analysis showed following:

[root@localhost ~]# gdb `which clustat` core.18989
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1)
...
Reading symbols from /usr/sbin/clustat...Reading symbols done.
[New Thread 18989]
Missing separate debuginfo for ...
Core was generated by `clustat -i 0'.
Program terminated with signal 11, Segmentation fault.
#0  flag_rgmanager_nodes (argc=<value optimized out>, argv=<value optimized out>)
    at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:145
145			for (n = 0; n < cml->cml_count; n++) {
Missing separate debuginfos, use: debuginfo-install clusterlib-3.0.12.1-59.el6_5.1.x86_64 corosynclib-1.4.1-17.el6.x86_64 glibc-2.12-1.132.el6.x86_64 libxml2-2.7.6-14.el6.x86_64 ncurses-libs-5.7-3.20090208.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) info thr
* 1 Thread 0x7f569b2a8700 (LWP 18989)  flag_rgmanager_nodes (argc=<value optimized out>, 
    argv=<value optimized out>)
    at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:145
(gdb) t a a bt

Thread 1 (Thread 0x7f569b2a8700 (LWP 18989)):
#0  flag_rgmanager_nodes (argc=<value optimized out>, argv=<value optimized out>)
    at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:145
#1  main (argc=<value optimized out>, argv=<value optimized out>)
    at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:1182
(gdb) quit


Version-Release number of selected component (if applicable):
rgmanager-3.0.12.1-19.el6.x86_64

How reproducible:
<0.1% (very difficult)

Steps to Reproduce:
1. set-up clustering for 3 nodes, start the cluster
2. on A, B, C cluster nodes execute clustat -i 0
3. stop cluster using node A
4. run below script on foreign node N to toggle cluster in the loop commanding node A
5. nodes B and C are eventually core dumping at the moment of script execution
   
Actual results:
  clustat segfaults.

Expected results:
  clustat should not segfault.

Additional info (reproduction script):


while true; do
  ssh root.2.103 'ccs -h $(hostname) -p ricci --stopall'
  ssh root.2.103 'source /tmp/foo.sh;detect_cman_state 30'
  if [ $? == 1 ]; then
    true
  else
    cat /tmp/clustat.log
    break
  fi
  ssh root.2.103 'ccs -h $(hostname) -p ricci --startall'
  ssh root.2.103 'source /tmp/foo.sh;detect_cman_state 30'
  if [ $? == 0 ]; then
    true
  else
    cat /tmp/clustat.log
    break
  fi
done



# in root.2.103:/tmp/foo.sh
# detect_cman_state <to>
function detect_cman_state ()
{
  local int_ts=${SECONDS}
  
  local int_to=60
  [ -n "$1" ] && int_to=$1
  
  local int_ret=5
  
  while true; do
    clustat &> /tmp/clustat.log 2>/dev/null
    if [ "$?" != "0" ]; then
      echo "clustat.down $((${SECONDS} - ${int_ts})) secs"
      int_ret=1
      break
    else
      local started_cnt=$(grep 'service:qpidd_' /tmp/clustat.log | grep -c started)
      if [ "${started_cnt}" == "4" ]; then
        echo "clustat.up $((${SECONDS} - ${int_ts})) secs"
        int_ret=0
        break
      fi
    fi
    if [ "${SECONDS}" -gt "$(( ${int_ts} + ${int_to} ))"  ]; then
      int_ret=2
      break
    fi
    sleep 1
  done
  return ${int_ret}
}

Comment 2 Frantisek Reznicek 2014-03-21 08:43:48 UTC
Created attachment 877162 [details]
cluster.conf

The cluster configureation is following (although now it may not be the key point):
 * three nodes A, B, C
 * all with attached cluster.conf
 * fence_virtd fencing used
 * all machines (A, B, C) at last 6.5 x86_64 as well as VM provider (N)
 * fencing propagated through separate libvirt isolated network 192.168.10.0/24
 * cluster data propagated through separate libvirt isolated network 192.168.6.0/24
 * all iptables down
 * selinux in Enforcing

Comment 3 Frantisek Reznicek 2014-04-01 08:14:25 UTC
Raising reproducibility to 2%.

Optimized steps to reproduce:
 1. start the cluster
 2. Have clustat -i 0 runnig on all cluster nodes
 3. stop cluster (ccs ... --stopall)
 4. stop cluster again (ccs ... --stopall)
 5. force killing all RHCS daemons (cman/corosync, ricci, rgmanager)
 6. start cluster (ccs ... --startall)
 7. if (no clustat crash) goto step 3.

Last attempts with rgmanager-3.0.12.1-19.el6.i686.

Comment 4 Frantisek Reznicek 2014-04-01 10:49:48 UTC
The last seen crashes on RHEL 6.5 Server i686 are showing the same backtrace:

Core was generated by `clustat -i 0'.
Program terminated with signal 11, Segmentation fault.
#0  0x0804b36c in flag_rgmanager_nodes (argc=3, argv=0xbf894954)
    at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:145
145			for (n = 0; n < cml->cml_count; n++) {
Missing separate debuginfos, use: debuginfo-install clusterlib-3.0.12.1-59.el6_5.1.i686 corosynclib-1.4.1-17.el6.i686 ...
(gdb) t a a bt

Thread 1 (Thread 0xb77ce6c0 (LWP 27533)):
#0  0x0804b36c in flag_rgmanager_nodes (argc=3, argv=0xbf894954)
    at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:145
#1  main (argc=3, argv=0xbf894954)
    at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:1182

Comment 6 Frantisek Reznicek 2014-07-15 13:50:01 UTC
The issus is still happening even with latest rgmanager bits on RHEL6.5 Server x86_64.

# rpm -q rgmanager
rgmanager-3.0.12.1-19.el6.x86_64
# gdb `which clustat` core.30900 
...
Loaded symbols for /usr/lib64/libcoroipcc.so.4.0.0
Core was generated by `clustat -i0'.
Program terminated with signal 11, Segmentation fault.
#0  flag_rgmanager_nodes (argc=<value optimized out>, argv=<value optimized out>)
    at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:145
145			for (n = 0; n < cml->cml_count; n++) {
(gdb) bt
#0  flag_rgmanager_nodes (argc=<value optimized out>, argv=<value optimized out>)
    at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:145
#1  main (argc=<value optimized out>, argv=<value optimized out>)
    at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:1182
(gdb) quit


Comment 2 still keeps correct reproduction steps.

Raising reproducibility to ~ 50% as it is easy to trigger this issue just by having terminals opened with clustat -i0 (on all nodes) and on another one loop the ccs -h <> --startall and --stopall.

Comment 14 John Ruemker 2016-08-02 20:29:11 UTC
At least one of the backtraces (comment #6) and the general description of the reproducer are in line with the issue that was just recently fixed in 6.8.z and is set to be fixed in 6.9: https://bugzilla.redhat.com/show_bug.cgi?id=1228170.  Can someone that was able to reproduce this try again using the fixed version there?

Comment 22 Chris Feist 2017-11-07 21:38:43 UTC
Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com/