Bug 689816

Summary: cmirror does not handle (POLLHUP|POLLERR|POLLINVAL)
Product: Red Hat Enterprise Linux 6 Reporter: Jonathan Earl Brassow <jbrassow>
Component: lvm2Assignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED WONTFIX QA Contact: Corey Marthaler <cmarthal>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.1CC: agk, dwysocha, heinzm, jbrassow, prajnoha, prockai, thornber, zkabelac
Target Milestone: rc   
Target Release: 6.2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-10-15 22:07:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 756082    

Description Jonathan Earl Brassow 2011-03-22 14:21:51 UTC
If cman/corosync is restarted while cmirror is running, cmirror will never reconnect.

Comment 2 RHEL Program Management 2011-04-04 01:47:49 UTC
Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 3 Corey Marthaler 2011-06-03 17:53:55 UTC
Adding QA ack for 6.2.

Devel will need to provide unit testing results however before this bug can be
ultimately verified by QA.

Comment 4 Corey Marthaler 2011-08-18 15:59:18 UTC
This issue still exists in the latest 6.2 build, however I see this isn't currently included in the 6.2 lvm2 errata.


[root@hayes-01 ~]# service clvmd start
Starting clvmd: 
Activating VG(s): 
[HANG]


hayes-01:
device-mapper: dm-log-userspace: [pERyFOX6] Request timed out: [5/201681] - retrying
Aug 18 10:54:33 hayes-01 kernel: device-mapper: dm-log-userspace: [pERyFOX6] Request timed out: [5/201681] - retrying
device-mapper: dm-log-userspace: [pERyFOX6] Request timed out: [5/201682] - retrying


hayes-03:
Aug 18 10:50:12 hayes-03 cmirrord[8772]: [pERyFOX6] Failed to open checkpoint for 1: SA_AIS_ERR_LIBRARY
Aug 18 10:50:11 hayes-03 kernel: device-mapper: dm-log-userspace: [pERyFOX6] Request timed out: [13/1919] - retrying
Aug 18 10:50:12 hayes-03 cmirrord[8772]: [pERyFOX6] Failed to export checkpoint for 1


2.6.32-188.el6.x86_64

lvm2-2.02.87-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
lvm2-libs-2.02.87-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
lvm2-cluster-2.02.87-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
udev-147-2.37.el6    BUILT: Wed Aug 10 07:48:15 CDT 2011
device-mapper-1.02.66-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
device-mapper-libs-1.02.66-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
device-mapper-event-1.02.66-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
device-mapper-event-libs-1.02.66-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
cmirror-2.02.87-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011

Comment 5 Jonathan Earl Brassow 2011-08-18 17:16:23 UTC
Hmmm, this is going to have to wait.  The changes are much more involved and intrusive than I thought.  It is not simply a matter of responding to SIGHUP and reconnecting to corosync.  There is live state on the system that must be transmitted - probably via checkpoint.  It is much different than the start-up scenario where in-coming nodes do not already have a live impression of the system.

Additionally, this bug was filed (by me) to see if we could better handle a situation that isn't allowed in the first place - that is, shutting down a service that cmirrord depends on without first shutting down cmirrord.

I'm pushing this out to 6.3 and changing the scope of this bug.  If we are trying to protect against the scenario where everything is shutdown except cmirrord and then those things are restarted again and cmirrord is expected to work, then I will simply check that there are no active logs and if not, reform the connection with corosync.  If there are active logs, then the reconnect will be refused.  This simplified handling should be more than sufficient for a scenario that is not allowed.