Bug 1035418
Summary: | cmirrord: cpg_dispatch failed: CS_ERR_BAD_HANDLE | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Corey Marthaler <cmarthal> |
Component: | lvm2 | Assignee: | Jonathan Earl Brassow <jbrassow> |
lvm2 sub component: | Default / Unclassified | QA Contact: | Cluster QE <mspqa-list> |
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
Severity: | medium | ||
Priority: | unspecified | CC: | agk, heinzm, jbrassow, msnitzer, nperic, prajnoha, prockai, thornber, zkabelac |
Version: | 7.0 | ||
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | lvm2-2.02.105-14.el7 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-06-13 10:11:19 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1075262 |
Description
Corey Marthaler
2013-11-27 17:21:46 UTC
I believe I have a fix for this. cpg_dispatch is being run after the node has left the group. We can check the flags to see if LEAVING and INVALID are set and if so, simply skip the cpg_dispatch. This should eliminate the message. patch committed upstream: commit 52aa3dbcabe85b38b51c68c27fb2397eb0fb1efd Author: Jonathan Brassow <jbrassow> Date: Wed Mar 5 10:44:20 2014 -0600 cmirrord: Clean-up stray warning message cmirrord polls for messages on the kernel and cluster interfaces. Sometimes it is possible for messages to be received on the cluster interface and be waiting for processing while the node is in the process of leaving the cluster group. When this happens, the messages received on the cluster interface are attempted to be dispatched, but an error is returned because the connection is no longer valid. It is a harmless situation. So, if we get the specific error (CS_ERR_BAD_HANDLE) and we know that we have left the group, then simply don't print the message. This still exists in the latest rpms. [root@host-003 ~]# grep CS_ERR_BAD_HANDLE /var/log/messages Mar 17 18:48:24 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 17 18:48:28 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 17 18:48:34 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 17 18:48:52 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 17 18:49:51 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 17 18:50:33 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 17 18:50:43 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 17 18:51:14 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 17 18:51:48 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 17 18:52:22 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 17 18:52:45 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 17 18:53:06 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 17 18:53:18 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 17 18:53:26 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 17 18:54:00 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 17 18:54:26 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 17 18:54:40 host-003 cmirrord[17493]: cpg_dispatch failed: CS_ERR_BAD_HANDLE [root@host-003 ~]# /usr/tests/sts-rhel7.0/lvm2/bin/lvm_rpms 3.10.0-112.el7.x86_64 lvm2-2.02.105-12.el7 BUILT: Wed Mar 12 10:49:52 CDT 2014 lvm2-libs-2.02.105-12.el7 BUILT: Wed Mar 12 10:49:52 CDT 2014 lvm2-cluster-2.02.105-12.el7 BUILT: Wed Mar 12 10:49:52 CDT 2014 device-mapper-1.02.84-12.el7 BUILT: Wed Mar 12 10:49:52 CDT 2014 device-mapper-libs-1.02.84-12.el7 BUILT: Wed Mar 12 10:49:52 CDT 2014 device-mapper-event-1.02.84-12.el7 BUILT: Wed Mar 12 10:49:52 CDT 2014 device-mapper-event-libs-1.02.84-12.el7 BUILT: Wed Mar 12 10:49:52 CDT 2014 device-mapper-persistent-data-0.2.8-4.el7 BUILT: Fri Jan 24 14:28:55 CST 2014 cmirror-2.02.105-12.el7 BUILT: Wed Mar 12 10:49:52 CDT 2014 The previous commit did not cover all of the possible cases during the process of a node leaving the cluster. The following upstream commit finishes the job. commit bdd7baeab39f7ef61817b2c48008852aefdbcdb0 Author: Jonathan Brassow <jbrassow> Date: Wed Mar 19 14:43:00 2014 -0500 cmirrord: Clean-up stray warning message (attempt #2) There are two types of CPG communications in a corosync cluster: messages and state transitions. Cmirrord processes the state transitions first. When a cluster mirror issues a POSTSUSPEND, it signals the end of cluster communication with the rest of the nodes in the cluster. The POSTSUSPEND marks the last communication of the 'message' type that will go around the cluster. The node then calls cpg_leave which causes a final 'state transition' communication to all of the nodes. Once the out-going node receives its own state transition notice from the cluster, it finalizes the leave. At this point, the state of the log is 'INVALID'; but it is possible that there remains some cluster trafic that was queued up behind the state transition that still wants to be processed. It is harmless to attempt to dispatch any remaining messages - they won't be delivered because the node is no longer in the cluster. However, there was a warning message that was being printed in this case that is now removed by this patch. The failure of the dispatch created a false positive condition that triggered the message. I just ran 'revolution' test with lg/log policy of allocate, and am getting the errors on all the nodes: Mar 27 16:37:27 virt-010 lvm[31056]: No longer monitoring mirror device revolution_9-mirror_1 for events. Mar 27 16:37:27 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 27 16:37:30 virt-010 lvm[31056]: Monitoring mirror device revolution_9-mirror_1 for events. Mar 27 16:37:30 virt-010 lvm[31056]: revolution_9-mirror_1 is now in-sync. Mar 27 16:37:32 virt-010 lvm[31056]: No longer monitoring mirror device revolution_9-mirror_1 for events. Mar 27 16:37:32 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 27 16:37:42 virt-010 lvm[31056]: Monitoring mirror device revolution_9-mirror_1 for events. Mar 27 16:37:42 virt-010 lvm[31056]: revolution_9-mirror_1 is now in-sync. [root@virt-010 ~]# grep CS_ERR_BAD_HANDLE /var/log/messages Mar 27 16:33:21 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 27 16:37:26 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 27 16:37:27 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 27 16:37:32 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 27 16:39:51 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 27 16:39:52 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 27 16:39:58 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 27 16:42:25 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 27 16:42:25 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 27 16:42:25 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 27 16:43:33 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 27 16:43:35 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Mar 27 16:43:40 virt-010 cmirrord[30645]: cpg_dispatch failed: CS_ERR_BAD_HANDLE Versions tested on: 3.10.0-113.el7.x86_64 lvm2-2.02.105-14.el7 BUILT: Wed Mar 26 14:29:41 CET 2014 lvm2-libs-2.02.105-14.el7 BUILT: Wed Mar 26 14:29:41 CET 2014 lvm2-cluster-2.02.105-14.el7 BUILT: Wed Mar 26 14:29:41 CET 2014 device-mapper-1.02.84-14.el7 BUILT: Wed Mar 26 14:29:41 CET 2014 device-mapper-libs-1.02.84-14.el7 BUILT: Wed Mar 26 14:29:41 CET 2014 device-mapper-event-1.02.84-14.el7 BUILT: Wed Mar 26 14:29:41 CET 2014 device-mapper-event-libs-1.02.84-14.el7 BUILT: Wed Mar 26 14:29:41 CET 2014 device-mapper-persistent-data-0.2.8-5.el7 BUILT: Sat Mar 1 02:15:56 CET 2014 cmirror-2.02.105-14.el7 BUILT: Wed Mar 26 14:29:41 CET 2014 I wasn't receiving any of these messages when I was testing, but I didn't test both leg and log as "allocate". The only thing I can think of now is that a 'resume' command comes in before the state transition phase is processed as part of the preceding 'postsuspend'. If that were the case, I would have to move the time at which the kernel is given acknowledgement of the processing of the 'postsuspend' command. I'm not sure I'm willing to do that with such a short window in the RHEL7.0 timeframe and such a low priority for the bug. I will at least make sure I can reproduce with the same RPMs. After an hour of running revolution_9, I can't reproduce with the default fault policies. Also, please remember to post the command used to produce the bug. In this case, my unsuccessful attempt was from: ./revolution_9 -l /usr/tests/sts-rhel7.0/ -r /usr/tests/sts-rhel7.0/ -R ~/bp.xml I will try with the '-F' flag ("allocate" policies for leg and log). Ok, that's another hour with the following: ./revolution_9 -l /usr/tests/sts-rhel7.0/ -r /usr/tests/sts-rhel7.0/ -R ~/bp.xml -F Could you please retest? Are you sure you haven't upgraded but left an old cmirrord running? Here is what I am running with: [root]# /usr/tests/sts-rhel7.0/lvm2/bin/lvm_rpms 3.10.0-115.el7.x86_64 lvm2-2.02.105-14.el7 BUILT: Wed Mar 26 08:29:41 CDT 2014 lvm2-libs-2.02.105-14.el7 BUILT: Wed Mar 26 08:29:41 CDT 2014 lvm2-cluster-2.02.105-14.el7 BUILT: Wed Mar 26 08:29:41 CDT 2014 device-mapper-1.02.84-14.el7 BUILT: Wed Mar 26 08:29:41 CDT 2014 device-mapper-libs-1.02.84-14.el7 BUILT: Wed Mar 26 08:29:41 CDT 2014 device-mapper-event-1.02.84-14.el7 BUILT: Wed Mar 26 08:29:41 CDT 2014 device-mapper-event-libs-1.02.84-14.el7 BUILT: Wed Mar 26 08:29:41 CDT 2014 device-mapper-persistent-data-0.2.8-5.el7 BUILT: Fri Feb 28 19:15:56 CST 2014 cmirror-2.02.105-14.el7 BUILT: Wed Mar 26 08:29:41 CDT 2014 That as well may be the case. I did the upgrade but I cannot remember now if I restarted the cluster.. I will re-test now but after reboot (just in case something is left over). Will post back the results asap! We should probably add automatic restart on update like we have for clvmd already. The only question is what happens if we call restart just in the middle of cmirror operation - this must be done gracefully (or cmirror must be able to continue with the operation once it's restarted). The messages are no longer showing up, although I ran into different issues. However this bug can be closed since the fix works after the cluster restart after update (wasn't thinking that it did not restart/refresh the resource after the update). Yeah it would be good if we could somehow replace the cmirrod process.. Without cluster fencing off or causing some mirror corruption. Anyway, I can now mark this VERIFIED with the same packages as in Comment #8 This request was resolved in Red Hat Enterprise Linux 7.0. Contact your manager or support representative in case you have further questions about the request. |