Bug 863940 - Don't call sync_* funcs for unloaded services OR currently unloading services
Summary: Don't call sync_* funcs for unloaded services OR currently unloading services
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync
Version: 6.3
Hardware: All
OS: All
urgent
urgent
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 830799 895654
TreeView+ depends on / blocked
 
Reported: 2012-10-08 06:43 UTC by Jan Friesse
Modified: 2013-02-21 07:50 UTC (History)
3 users (show)

Fixed In Version: corosync-1.4.1-13.el6
Doc Type: Bug Fix
Doc Text:
Cause: Corosync is stopped on multiple nodes. Consequence: Sometime it can happen, that corosync will segfault Fix: Patches ensures, that sync service doesn't call callbacks on unloaded services. Result: Corosync no longer segfaults.
Clone Of:
Environment:
Last Closed: 2013-02-21 07:50:56 UTC
Target Upstream Version:


Attachments (Terms of Use)
2012-10-18-0003-Make-service_build-contain-correct-number-of-msgs (1.02 KB, patch)
2012-10-18 08:38 UTC, Jan Friesse
no flags Details | Diff
2012-10-18-0002-Handle-sync-and-service-unload-correctly (9.05 KB, patch)
2012-10-18 08:39 UTC, Jan Friesse
no flags Details | Diff
2012-10-18-0001-Don-t-call-sync_-funcs-for-unloaded-services (4.82 KB, patch)
2012-10-18 08:39 UTC, Jan Friesse
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2013:0497 normal SHIPPED_LIVE corosync bug fix and enhancement update 2013-02-20 21:18:24 UTC

Description Jan Friesse 2012-10-08 06:43:40 UTC
Description of problem:
SSIA

Version-Release number of selected component (if applicable):
Everywhere

How reproducible:
0.01%

Steps to Reproduce:

First apply patches from https://bugzilla.redhat.com/show_bug.cgi?id=830799. Because it introduces additional traffic in sync_* functions, it is little slower and chance that call sync_* function on already unloaded service is higher resulting in SEGFAULT in CPG service.

Now run CTD StopAll test (or equivalent, so something what just start corosync on many nodes and stop corosync on many nodes in cycle).

  
Actual results:
Segfault when unloading service (corosync exit)

Expected results:
No Segfault.

Additional info:
We have patches fed7fc23e14e098dbb52842a4c79879a376f6ded and 6f6988afff632c6c5068becc855aa4a37a656183 in upstream. This must be backported.

Comment 1 Jan Friesse 2012-10-18 08:38:53 UTC
Created attachment 629238 [details]
2012-10-18-0003-Make-service_build-contain-correct-number-of-msgs


Make service_build contain correct number of msgs

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
(backported from commit a273be58ae97f192712661b2f5a19f8d89183065)

Comment 2 Jan Friesse 2012-10-18 08:39:22 UTC
Created attachment 629239 [details]
2012-10-18-0002-Handle-sync-and-service-unload-correctly


Handle sync and service unload correctly

When sync started and service is unloaded in meantime, it can happen that
sync will call sync_* functions on unloaded service.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
(backported from commit 6f6988afff632c6c5068becc855aa4a37a656183)

Comment 3 Jan Friesse 2012-10-18 08:39:41 UTC
Created attachment 629240 [details]
2012-10-18-0001-Don-t-call-sync_-funcs-for-unloaded-services


Don't call sync_* funcs for unloaded services

When service is unloaded, sync shouldn't call sync_init|process|activate
and abort functions. It happens very rare, but in process of unloading
all services, totem can recreate membership and bad things can happen
(service is unloaded, so there may be access to already freed memory,
...)

Solution is to fetch services sync handlers in every time when we are
building service list instead of using precreated one.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
(backported from commit fed7fc23e14e098dbb52842a4c79879a376f6ded)

Comment 4 Jan Friesse 2012-10-18 08:44:24 UTC
"Unit" test:
https://github.com/jfriesse/csts/commit/0ce085de54ddefb249a684baf2079bbd815f5135

Before unit test, apply https://bugzilla.redhat.com/show_bug.cgi?id=830799 patches to easily reproduce segfault. Test is quiet reliable, but depends on HW (no cores, ...) because it's race condition. On one HW (VMs) success rate is 50%, on other (again VM) success rate is ~ 10%.

Comment 8 errata-xmlrpc 2013-02-21 07:50:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0497.html


Note You need to log in before you can comment on or make changes to this bug.