Bug 1079207
| Summary: | clustat -i 0 rarely segfaults in flag_rgmanager_nodes() while cluster is repetitively shutting down and starting up | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Frantisek Reznicek <freznice> | ||||
| Component: | rgmanager | Assignee: | Ryan McCabe <rmccabe> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | cluster-qe <cluster-qe> | ||||
| Severity: | low | Docs Contact: | |||||
| Priority: | low | ||||||
| Version: | 6.5 | CC: | cfeist, cluster-maint, esammons, fdinitto, jruemker, mjuricek | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | rgmanager-3.0.12.1-32.el6 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2017-11-07 21:38:43 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Created attachment 877162 [details]
cluster.conf
The cluster configureation is following (although now it may not be the key point):
* three nodes A, B, C
* all with attached cluster.conf
* fence_virtd fencing used
* all machines (A, B, C) at last 6.5 x86_64 as well as VM provider (N)
* fencing propagated through separate libvirt isolated network 192.168.10.0/24
* cluster data propagated through separate libvirt isolated network 192.168.6.0/24
* all iptables down
* selinux in Enforcing
Raising reproducibility to 2%. Optimized steps to reproduce: 1. start the cluster 2. Have clustat -i 0 runnig on all cluster nodes 3. stop cluster (ccs ... --stopall) 4. stop cluster again (ccs ... --stopall) 5. force killing all RHCS daemons (cman/corosync, ricci, rgmanager) 6. start cluster (ccs ... --startall) 7. if (no clustat crash) goto step 3. Last attempts with rgmanager-3.0.12.1-19.el6.i686. The last seen crashes on RHEL 6.5 Server i686 are showing the same backtrace:
Core was generated by `clustat -i 0'.
Program terminated with signal 11, Segmentation fault.
#0 0x0804b36c in flag_rgmanager_nodes (argc=3, argv=0xbf894954)
at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:145
145 for (n = 0; n < cml->cml_count; n++) {
Missing separate debuginfos, use: debuginfo-install clusterlib-3.0.12.1-59.el6_5.1.i686 corosynclib-1.4.1-17.el6.i686 ...
(gdb) t a a bt
Thread 1 (Thread 0xb77ce6c0 (LWP 27533)):
#0 0x0804b36c in flag_rgmanager_nodes (argc=3, argv=0xbf894954)
at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:145
#1 main (argc=3, argv=0xbf894954)
at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:1182
The issus is still happening even with latest rgmanager bits on RHEL6.5 Server x86_64.
# rpm -q rgmanager
rgmanager-3.0.12.1-19.el6.x86_64
# gdb `which clustat` core.30900
...
Loaded symbols for /usr/lib64/libcoroipcc.so.4.0.0
Core was generated by `clustat -i0'.
Program terminated with signal 11, Segmentation fault.
#0 flag_rgmanager_nodes (argc=<value optimized out>, argv=<value optimized out>)
at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:145
145 for (n = 0; n < cml->cml_count; n++) {
(gdb) bt
#0 flag_rgmanager_nodes (argc=<value optimized out>, argv=<value optimized out>)
at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:145
#1 main (argc=<value optimized out>, argv=<value optimized out>)
at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:1182
(gdb) quit
Comment 2 still keeps correct reproduction steps.
Raising reproducibility to ~ 50% as it is easy to trigger this issue just by having terminals opened with clustat -i0 (on all nodes) and on another one loop the ccs -h <> --startall and --stopall.
At least one of the backtraces (comment #6) and the general description of the reproducer are in line with the issue that was just recently fixed in 6.8.z and is set to be fixed in 6.9: https://bugzilla.redhat.com/show_bug.cgi?id=1228170. Can someone that was able to reproduce this try again using the fixed version there? Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available. The official life cycle policy can be reviewed here: http://redhat.com/rhel/lifecycle This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL: https://access.redhat.com/ |
Description of problem: clustat -i 0 rarely segfaults in flag_rgmanager_nodes() while cluster is repetitively shutting down and starting up. During the layered product testing (MRG/M) I very rarely see that clustat -i 0 segfaults while starting to toggle cluster up and down. There were seen about 5 segfaults of thousands of executions. Cluster Status for dtests_ha @ Fri Mar 21 09:21:16 2014 Member Status: Quorate Resource Group Manager not running; no service information available. Membership information not available Segmentation fault (core dumped) The core file analysis showed following: [root@localhost ~]# gdb `which clustat` core.18989 GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1) ... Reading symbols from /usr/sbin/clustat...Reading symbols done. [New Thread 18989] Missing separate debuginfo for ... Core was generated by `clustat -i 0'. Program terminated with signal 11, Segmentation fault. #0 flag_rgmanager_nodes (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:145 145 for (n = 0; n < cml->cml_count; n++) { Missing separate debuginfos, use: debuginfo-install clusterlib-3.0.12.1-59.el6_5.1.x86_64 corosynclib-1.4.1-17.el6.x86_64 glibc-2.12-1.132.el6.x86_64 libxml2-2.7.6-14.el6.x86_64 ncurses-libs-5.7-3.20090208.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) info thr * 1 Thread 0x7f569b2a8700 (LWP 18989) flag_rgmanager_nodes (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:145 (gdb) t a a bt Thread 1 (Thread 0x7f569b2a8700 (LWP 18989)): #0 flag_rgmanager_nodes (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:145 #1 main (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/utils/clustat.c:1182 (gdb) quit Version-Release number of selected component (if applicable): rgmanager-3.0.12.1-19.el6.x86_64 How reproducible: <0.1% (very difficult) Steps to Reproduce: 1. set-up clustering for 3 nodes, start the cluster 2. on A, B, C cluster nodes execute clustat -i 0 3. stop cluster using node A 4. run below script on foreign node N to toggle cluster in the loop commanding node A 5. nodes B and C are eventually core dumping at the moment of script execution Actual results: clustat segfaults. Expected results: clustat should not segfault. Additional info (reproduction script): while true; do ssh root.2.103 'ccs -h $(hostname) -p ricci --stopall' ssh root.2.103 'source /tmp/foo.sh;detect_cman_state 30' if [ $? == 1 ]; then true else cat /tmp/clustat.log break fi ssh root.2.103 'ccs -h $(hostname) -p ricci --startall' ssh root.2.103 'source /tmp/foo.sh;detect_cman_state 30' if [ $? == 0 ]; then true else cat /tmp/clustat.log break fi done # in root.2.103:/tmp/foo.sh # detect_cman_state <to> function detect_cman_state () { local int_ts=${SECONDS} local int_to=60 [ -n "$1" ] && int_to=$1 local int_ret=5 while true; do clustat &> /tmp/clustat.log 2>/dev/null if [ "$?" != "0" ]; then echo "clustat.down $((${SECONDS} - ${int_ts})) secs" int_ret=1 break else local started_cnt=$(grep 'service:qpidd_' /tmp/clustat.log | grep -c started) if [ "${started_cnt}" == "4" ]; then echo "clustat.up $((${SECONDS} - ${int_ts})) secs" int_ret=0 break fi fi if [ "${SECONDS}" -gt "$(( ${int_ts} + ${int_to} ))" ]; then int_ret=2 break fi sleep 1 done return ${int_ret} }