Bug 1022050

Summary: libqb-0.16.0 breaks pacemaker-1.1.8
Product: Red Hat Enterprise Linux 6 Reporter: Matthew Mosesohn <mmosesohn>
Component: libqbAssignee: David Vossel <dvossel>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.4CC: cluster-maint, contact, fdinitto, toracat
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-09 14:33:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
crm_report data none

Description Matthew Mosesohn 2013-10-22 14:40:31 UTC
Description of problem:
libqb-0.16.0 was released accidentally, breaking corosync and pacemaker deployments

Version-Release number of selected component (if applicable):
libqb-0.16.0-1.el6_4.1.x86_64
pacemaker-1.1.8-7.el6.x86_64
corosync-1.4.1-15.el6_4.1.x86_64

Additional info:
Steps to reproduce are a bit tricky to share here, but I can post crm_report data. The fact is this may severely impact several deployments in production now if they accidentally yum update, preventing any changes to existing configuration. 


Devels Fabio Di Nitto and David Vossel confirmed this release was accidental and will remain a problem until either this package is recalled from RHEL HA repository or pacemaker and corosync updates get released.

Comment 1 Matthew Mosesohn 2013-10-22 14:45:43 UTC
Created attachment 815024 [details]
crm_report data

Comment 3 David Vossel 2013-10-22 14:58:39 UTC
The crm_report is lacking the pacemaker log portion.  I need to see that to understand why this is failing.

Comment 4 David Vossel 2013-10-22 20:04:10 UTC
I was able to reproduce this.

This issue has to do with not properly removing the ipc server's client connections from mainloop in pacemaker.  There was a race condition in pacemaker that caused mainloop dispatch a fd that libqb has already told us to remove.

Here are the related upstream patches that resolved this in pacemaker.
https://github.com/ClusterLabs/pacemaker/commit/0628021134835ea1c683d0d70cef2a4112e08404

Libqb's example server implementation received this same change.
https://github.com/ClusterLabs/libqb/commit/1e1397fb22c04e46197873b116c6798892a29ee3

I'm not entirely sure why this appeared stable with the old version of libqb.  There were a couple of libqb reference leaks that were discovered between 0.14.4 and 0.16.0.  My guess is that those leaks covered up the issue.  When I fixed the leaks, the mainloop issue likely appeared causing me to fix that as well.

The end result here is that libqb 0.16.0 should only be used with pacemaker 1.1.10 or greater.

-- Vossel

Comment 5 RHEL Program Management 2013-10-26 17:14:54 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.