Bug 689816 - cmirror does not handle (POLLHUP|POLLERR|POLLINVAL)
Summary: cmirror does not handle (POLLHUP|POLLERR|POLLINVAL)
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2
Version: 6.1
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: 6.2
Assignee: Jonathan Earl Brassow
QA Contact: Corey Marthaler
URL:
Whiteboard:
Depends On:
Blocks: 756082
TreeView+ depends on / blocked
 
Reported: 2011-03-22 14:21 UTC by Jonathan Earl Brassow
Modified: 2012-10-15 22:07 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-10-15 22:07:13 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Jonathan Earl Brassow 2011-03-22 14:21:51 UTC
If cman/corosync is restarted while cmirror is running, cmirror will never reconnect.

Comment 2 RHEL Program Management 2011-04-04 01:47:49 UTC
Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 3 Corey Marthaler 2011-06-03 17:53:55 UTC
Adding QA ack for 6.2.

Devel will need to provide unit testing results however before this bug can be
ultimately verified by QA.

Comment 4 Corey Marthaler 2011-08-18 15:59:18 UTC
This issue still exists in the latest 6.2 build, however I see this isn't currently included in the 6.2 lvm2 errata.


[root@hayes-01 ~]# service clvmd start
Starting clvmd: 
Activating VG(s): 
[HANG]


hayes-01:
device-mapper: dm-log-userspace: [pERyFOX6] Request timed out: [5/201681] - retrying
Aug 18 10:54:33 hayes-01 kernel: device-mapper: dm-log-userspace: [pERyFOX6] Request timed out: [5/201681] - retrying
device-mapper: dm-log-userspace: [pERyFOX6] Request timed out: [5/201682] - retrying


hayes-03:
Aug 18 10:50:12 hayes-03 cmirrord[8772]: [pERyFOX6] Failed to open checkpoint for 1: SA_AIS_ERR_LIBRARY
Aug 18 10:50:11 hayes-03 kernel: device-mapper: dm-log-userspace: [pERyFOX6] Request timed out: [13/1919] - retrying
Aug 18 10:50:12 hayes-03 cmirrord[8772]: [pERyFOX6] Failed to export checkpoint for 1


2.6.32-188.el6.x86_64

lvm2-2.02.87-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
lvm2-libs-2.02.87-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
lvm2-cluster-2.02.87-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
udev-147-2.37.el6    BUILT: Wed Aug 10 07:48:15 CDT 2011
device-mapper-1.02.66-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
device-mapper-libs-1.02.66-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
device-mapper-event-1.02.66-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
device-mapper-event-libs-1.02.66-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011
cmirror-2.02.87-1.el6    BUILT: Fri Aug 12 06:11:57 CDT 2011

Comment 5 Jonathan Earl Brassow 2011-08-18 17:16:23 UTC
Hmmm, this is going to have to wait.  The changes are much more involved and intrusive than I thought.  It is not simply a matter of responding to SIGHUP and reconnecting to corosync.  There is live state on the system that must be transmitted - probably via checkpoint.  It is much different than the start-up scenario where in-coming nodes do not already have a live impression of the system.

Additionally, this bug was filed (by me) to see if we could better handle a situation that isn't allowed in the first place - that is, shutting down a service that cmirrord depends on without first shutting down cmirrord.

I'm pushing this out to 6.3 and changing the scope of this bug.  If we are trying to protect against the scenario where everything is shutdown except cmirrord and then those things are restarted again and cmirrord is expected to work, then I will simply check that there are no active logs and if not, reform the connection with corosync.  If there are active logs, then the reconnect will be refused.  This simplified handling should be more than sufficient for a scenario that is not allowed.


Note You need to log in before you can comment on or make changes to this bug.