Bug 863940

Summary: Don't call sync_* funcs for unloaded services OR currently unloading services
Product: Red Hat Enterprise Linux 6 Reporter: Jan Friesse <jfriesse>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.3CC: jkortus, sdake, tlavigne
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: corosync-1.4.1-13.el6 Doc Type: Bug Fix
Doc Text:
Cause: Corosync is stopped on multiple nodes. Consequence: Sometime it can happen, that corosync will segfault Fix: Patches ensures, that sync service doesn't call callbacks on unloaded services. Result: Corosync no longer segfaults.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-21 07:50:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 830799, 895654    
Attachments:
Description Flags
2012-10-18-0003-Make-service_build-contain-correct-number-of-msgs
none
2012-10-18-0002-Handle-sync-and-service-unload-correctly
none
2012-10-18-0001-Don-t-call-sync_-funcs-for-unloaded-services none

Description Jan Friesse 2012-10-08 06:43:40 UTC
Description of problem:
SSIA

Version-Release number of selected component (if applicable):
Everywhere

How reproducible:
0.01%

Steps to Reproduce:

First apply patches from https://bugzilla.redhat.com/show_bug.cgi?id=830799. Because it introduces additional traffic in sync_* functions, it is little slower and chance that call sync_* function on already unloaded service is higher resulting in SEGFAULT in CPG service.

Now run CTD StopAll test (or equivalent, so something what just start corosync on many nodes and stop corosync on many nodes in cycle).

  
Actual results:
Segfault when unloading service (corosync exit)

Expected results:
No Segfault.

Additional info:
We have patches fed7fc23e14e098dbb52842a4c79879a376f6ded and 6f6988afff632c6c5068becc855aa4a37a656183 in upstream. This must be backported.

Comment 1 Jan Friesse 2012-10-18 08:38:53 UTC
Created attachment 629238 [details]
2012-10-18-0003-Make-service_build-contain-correct-number-of-msgs


Make service_build contain correct number of msgs

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Fabio M. Di Nitto <fdinitto>
(backported from commit a273be58ae97f192712661b2f5a19f8d89183065)

Comment 2 Jan Friesse 2012-10-18 08:39:22 UTC
Created attachment 629239 [details]
2012-10-18-0002-Handle-sync-and-service-unload-correctly


Handle sync and service unload correctly

When sync started and service is unloaded in meantime, it can happen that
sync will call sync_* functions on unloaded service.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Fabio M. Di Nitto <fdinitto>
(backported from commit 6f6988afff632c6c5068becc855aa4a37a656183)

Comment 3 Jan Friesse 2012-10-18 08:39:41 UTC
Created attachment 629240 [details]
2012-10-18-0001-Don-t-call-sync_-funcs-for-unloaded-services


Don't call sync_* funcs for unloaded services

When service is unloaded, sync shouldn't call sync_init|process|activate
and abort functions. It happens very rare, but in process of unloading
all services, totem can recreate membership and bad things can happen
(service is unloaded, so there may be access to already freed memory,
...)

Solution is to fetch services sync handlers in every time when we are
building service list instead of using precreated one.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Steven Dake <sdake>
(backported from commit fed7fc23e14e098dbb52842a4c79879a376f6ded)

Comment 4 Jan Friesse 2012-10-18 08:44:24 UTC
"Unit" test:
https://github.com/jfriesse/csts/commit/0ce085de54ddefb249a684baf2079bbd815f5135

Before unit test, apply https://bugzilla.redhat.com/show_bug.cgi?id=830799 patches to easily reproduce segfault. Test is quiet reliable, but depends on HW (no cores, ...) because it's race condition. On one HW (VMs) success rate is 50%, on other (again VM) success rate is ~ 10%.

Comment 8 errata-xmlrpc 2013-02-21 07:50:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0497.html