Bug 1247846
| Summary: | epoll_ctl(del): Bad file descriptor (9) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Sam McLeod <mailinglists> | ||||
| Component: | libqb | Assignee: | Christine Caulfield <ccaulfie> | ||||
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 7.3 | CC: | ccaulfie, cluster-maint, jkortus, kgaillot, mailinglists, mkelly | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-11-04 04:40:27 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Interestingly some of the files in /dev/shm are owned by root and others by hacluster - not sure if this is expected or not. root@s1-san6:/dev/shm # ls -lah total 39M drwxrwxrwt 2 root root 1.9K Jul 29 15:03 . drwxr-xr-x 23 root root 3.9K Jul 29 09:28 .. -rw-rw---- 1 hacluster haclient 132K Jul 29 09:11 qb-attrd-event-2378-2381-9-data -rw-rw---- 1 hacluster haclient 8.1K Jul 29 09:11 qb-attrd-event-2378-2381-9-header -rw-rw---- 1 hacluster haclient 132K Jul 29 09:11 qb-attrd-request-2378-2381-9-data -rw-rw---- 1 hacluster haclient 8.1K Jul 29 09:11 qb-attrd-request-2378-2381-9-header -rw-rw---- 1 hacluster haclient 132K Jul 29 09:11 qb-attrd-response-2378-2381-9-data -rw-rw---- 1 hacluster haclient 8.1K Jul 29 09:11 qb-attrd-response-2378-2381-9-header -rw------- 1 root root 1.1M Jul 29 09:11 qb-cfg-event-1500-2281-18-data -rw------- 1 root root 8.1K Jul 29 09:11 qb-cfg-event-1500-2281-18-header -rw------- 1 root root 1.1M Jul 29 09:11 qb-cfg-request-1500-2281-18-data -rw------- 1 root root 8.1K Jul 29 09:11 qb-cfg-request-1500-2281-18-header -rw------- 1 root root 1.1M Jul 29 09:11 qb-cfg-response-1500-2281-18-data -rw------- 1 root root 8.1K Jul 29 09:11 qb-cfg-response-1500-2281-18-header -rw------- 1 hacluster haclient 516K Jul 29 09:11 qb-cib_rw-event-2372-2373-11-data -rw------- 1 hacluster haclient 8.1K Jul 29 09:11 qb-cib_rw-event-2372-2373-11-header -rw-rw---- 1 hacluster haclient 516K Jul 29 09:11 qb-cib_rw-event-2372-2378-12-data -rw-rw---- 1 hacluster haclient 8.1K Jul 29 09:11 qb-cib_rw-event-2372-2378-12-header -rw------- 1 hacluster haclient 516K Jul 29 09:11 qb-cib_rw-request-2372-2373-11-data -rw------- 1 hacluster haclient 8.1K Jul 29 09:11 qb-cib_rw-request-2372-2373-11-header -rw-rw---- 1 hacluster haclient 516K Jul 29 09:11 qb-cib_rw-request-2372-2378-12-data -rw-rw---- 1 hacluster haclient 8.1K Jul 29 09:11 qb-cib_rw-request-2372-2378-12-header -rw------- 1 hacluster haclient 516K Jul 29 09:11 qb-cib_rw-response-2372-2373-11-data -rw------- 1 hacluster haclient 8.1K Jul 29 09:11 qb-cib_rw-response-2372-2373-11-header -rw-rw---- 1 hacluster haclient 516K Jul 29 09:11 qb-cib_rw-response-2372-2378-12-data -rw-rw---- 1 hacluster haclient 8.1K Jul 29 09:11 qb-cib_rw-response-2372-2378-12-header -rw-rw---- 1 hacluster haclient 516K Jul 29 09:11 qb-cib_shm-event-2372-2381-10-data -rw-rw---- 1 hacluster haclient 8.1K Jul 29 09:11 qb-cib_shm-event-2372-2381-10-header -rw-rw---- 1 hacluster haclient 516K Jul 29 09:51 qb-cib_shm-request-2372-2381-10-data -rw-rw---- 1 hacluster haclient 8.1K Jul 29 09:11 qb-cib_shm-request-2372-2381-10-header -rw-rw---- 1 hacluster haclient 516K Jul 29 09:11 qb-cib_shm-response-2372-2381-10-data -rw-rw---- 1 hacluster haclient 8.1K Jul 29 09:11 qb-cib_shm-response-2372-2381-10-header -rw------- 1 root root 8.1M Jul 29 11:02 qb-corosync-1379-blackbox-data -rw------- 1 root root 8.1K Jul 29 09:11 qb-corosync-1379-blackbox-header -rw------- 1 root root 1.1M Jul 29 09:11 qb-cpg-event-1500-2281-19-data -rw------- 1 root root 8.1K Jul 29 09:11 qb-cpg-event-1500-2281-19-header -rw------- 1 hacluster haclient 1.1M Jul 29 09:11 qb-cpg-event-1500-2372-25-data -rw------- 1 hacluster haclient 8.1K Jul 29 09:11 qb-cpg-event-1500-2372-25-header -rw------- 1 root root 1.1M Jul 29 09:11 qb-cpg-event-1500-2373-21-data -rw------- 1 root root 8.1K Jul 29 09:11 qb-cpg-event-1500-2373-21-header -rw------- 1 hacluster haclient 1.1M Jul 29 09:11 qb-cpg-event-1500-2378-23-data -rw------- 1 hacluster haclient 8.1K Jul 29 09:11 qb-cpg-event-1500-2378-23-header -rw------- 1 hacluster haclient 1.1M Jul 29 09:11 qb-cpg-event-1500-2381-22-data -rw------- 1 hacluster haclient 8.1K Jul 29 09:11 qb-cpg-event-1500-2381-22-header -rw------- 1 root root 1.1M Jul 29 09:11 qb-cpg-request-1500-2281-19-data -rw------- 1 root root 8.1K Jul 29 09:11 qb-cpg-request-1500-2281-19-header -rw------- 1 hacluster haclient 1.1M Jul 29 09:51 qb-cpg-request-1500-2372-25-data -rw------- 1 hacluster haclient 8.1K Jul 29 09:11 qb-cpg-request-1500-2372-25-header -rw------- 1 root root 1.1M Jul 29 09:11 qb-cpg-request-1500-2373-21-data -rw------- 1 root root 8.1K Jul 29 09:11 qb-cpg-request-1500-2373-21-header -rw------- 1 hacluster haclient 1.1M Jul 29 10:26 qb-cpg-request-1500-2378-23-data -rw------- 1 hacluster haclient 8.1K Jul 29 09:11 qb-cpg-request-1500-2378-23-header -rw------- 1 hacluster haclient 1.1M Jul 29 09:51 qb-cpg-request-1500-2381-22-data -rw------- 1 hacluster haclient 8.1K Jul 29 09:11 qb-cpg-request-1500-2381-22-header -rw------- 1 root root 1.1M Jul 29 09:11 qb-cpg-response-1500-2281-19-data -rw------- 1 root root 8.1K Jul 29 09:11 qb-cpg-response-1500-2281-19-header -rw------- 1 hacluster haclient 1.1M Jul 29 09:11 qb-cpg-response-1500-2372-25-data -rw------- 1 hacluster haclient 8.1K Jul 29 09:11 qb-cpg-response-1500-2372-25-header -rw------- 1 root root 1.1M Jul 29 09:11 qb-cpg-response-1500-2373-21-data -rw------- 1 root root 8.1K Jul 29 09:11 qb-cpg-response-1500-2373-21-header -rw------- 1 hacluster haclient 1.1M Jul 29 09:11 qb-cpg-response-1500-2378-23-data -rw------- 1 hacluster haclient 8.1K Jul 29 09:11 qb-cpg-response-1500-2378-23-header -rw------- 1 hacluster haclient 1.1M Jul 29 09:11 qb-cpg-response-1500-2381-22-data -rw------- 1 hacluster haclient 8.1K Jul 29 09:11 qb-cpg-response-1500-2381-22-header -rw-rw---- 1 hacluster haclient 132K Jul 29 09:11 qb-lrmd-event-2377-2381-7-data -rw-rw---- 1 hacluster haclient 8.1K Jul 29 09:11 qb-lrmd-event-2377-2381-7-header -rw-rw---- 1 hacluster haclient 132K Jul 29 09:51 qb-lrmd-request-2377-2381-7-data -rw-rw---- 1 hacluster haclient 8.1K Jul 29 09:11 qb-lrmd-request-2377-2381-7-header -rw-rw---- 1 hacluster haclient 132K Jul 29 09:11 qb-lrmd-response-2377-2381-7-data -rw-rw---- 1 hacluster haclient 8.1K Jul 29 09:11 qb-lrmd-response-2377-2381-7-header -rw------- 1 root root 1.1M Jul 29 09:11 qb-quorum-event-1500-2281-20-data -rw------- 1 root root 8.1K Jul 29 09:11 qb-quorum-event-1500-2281-20-header -rw------- 1 hacluster haclient 1.1M Jul 29 09:11 qb-quorum-event-1500-2381-24-data -rw------- 1 hacluster haclient 8.1K Jul 29 09:11 qb-quorum-event-1500-2381-24-header -rw------- 1 root root 1.1M Jul 29 09:11 qb-quorum-request-1500-2281-20-data -rw------- 1 root root 8.1K Jul 29 09:11 qb-quorum-request-1500-2281-20-header -rw------- 1 hacluster haclient 1.1M Jul 29 09:11 qb-quorum-request-1500-2381-24-data -rw------- 1 hacluster haclient 8.1K Jul 29 09:11 qb-quorum-request-1500-2381-24-header -rw------- 1 root root 1.1M Jul 29 09:11 qb-quorum-response-1500-2281-20-data -rw------- 1 root root 8.1K Jul 29 09:11 qb-quorum-response-1500-2281-20-header -rw------- 1 hacluster haclient 1.1M Jul 29 09:11 qb-quorum-response-1500-2381-24-data -rw------- 1 hacluster haclient 8.1K Jul 29 09:11 qb-quorum-response-1500-2381-24-header -rw------- 1 root root 132K Jul 29 09:11 qb-stonith-ng-event-2373-2377-10-data -rw------- 1 root root 8.1K Jul 29 09:11 qb-stonith-ng-event-2373-2377-10-header -rw-rw---- 1 hacluster haclient 132K Jul 29 09:11 qb-stonith-ng-event-2373-2381-9-data -rw-rw---- 1 hacluster haclient 8.1K Jul 29 09:11 qb-stonith-ng-event-2373-2381-9-header -rw------- 1 root root 132K Jul 29 09:11 qb-stonith-ng-request-2373-2377-10-data -rw------- 1 root root 8.1K Jul 29 09:11 qb-stonith-ng-request-2373-2377-10-header -rw-rw---- 1 hacluster haclient 132K Jul 29 09:11 qb-stonith-ng-request-2373-2381-9-data -rw-rw---- 1 hacluster haclient 8.1K Jul 29 09:11 qb-stonith-ng-request-2373-2381-9-header -rw------- 1 root root 132K Jul 29 09:11 qb-stonith-ng-response-2373-2377-10-data -rw------- 1 root root 8.1K Jul 29 09:11 qb-stonith-ng-response-2373-2377-10-header -rw-rw---- 1 hacluster haclient 132K Jul 29 09:11 qb-stonith-ng-response-2373-2381-9-data -rw-rw---- 1 hacluster haclient 8.1K Jul 29 09:11 qb-stonith-ng-response-2373-2381-9-header Different owners (hacluster vs root) are expected and it's not a problem (file is created by client, so if client is running as hacluster, file is owned by hacluster). epollctl is also not a big deal, but may be confusing. Sadly corosync has no way to handle this error messages, so reassigning to libqb developers to propose solution. This certainly seems to be a libqb bug. libqb closes the fd before removing it from the epoll list.
This is a patch I'm testing. It just swaps over the close and epoll_del:
diff --git a/lib/ipc_shm.c b/lib/ipc_shm.c
index 5eaf245..a038d8d 100644
--- a/lib/ipc_shm.c
+++ b/lib/ipc_shm.c
@@ -228,8 +228,8 @@ qb_ipcs_shm_disconnect(struct qb_ipcs_connection *c)
if (c->state == QB_IPCS_CONNECTION_ESTABLISHED ||
c->state == QB_IPCS_CONNECTION_ACTIVE) {
if (c->setup.u.us.sock > 0) {
- qb_ipcc_us_sock_close(c->setup.u.us.sock);
(void)c->service->poll_fns.dispatch_del(c->setup.u.us.sock);
+ qb_ipcc_us_sock_close(c->setup.u.us.sock);
c->setup.u.us.sock = -1;
}
}
Patch looks correct. The dispatch_del() function is provided by the client app, and it's a bit of a cruel prank to pass them a closed file descriptor. :) commit 0f68c8c8283786c271624a3955c8adac878191f9
Author: Christine Caulfield <ccaulfie>
Date: Thu Sep 24 16:04:28 2015 +0100
ipc: Don't send the dispatch_del() function a closed fd
This will be covered by the rebase BZ #1299968 This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2363.html |
Created attachment 1057213 [details] corosync config and logs Description of problem: Corosync logs errors: epoll_ctl(del): Bad file descriptor (9) when trying to unallocated shared memory when closing ringbuffer Example: 739467:Jul 29 14:09:52 [12727] s1-san6 crm_node: debug: qb_rb_open_2: shm size:1048589; real_size:1052672; rb->word_size:263168 739468:Jul 29 14:09:52 [1379] s1-san6 corosync debug [QB ] qb_ipcs_disconnect(1500-12725-26) state:2 739469:Jul 29 14:09:52 [12725] s1-san6 crm_node: debug: qb_rb_close: Closing ringbuffer: /dev/shm/qb-cpg-response-1500-12725-26-header 739470:Jul 29 14:09:52 [1379] s1-san6 corosync debug [QB ] epoll_ctl(del): Bad file descriptor (9) 739471:Jul 29 14:09:52 [1379] s1-san6 corosync debug [MAIN ] cs_ipcs_connection_closed() 739472:Jul 29 14:09:52 [1379] s1-san6 corosync debug [CPG ] exit_fn for conn=0x5654f588a0c0 739473:Jul 29 14:09:52 [12725] s1-san6 crm_node: debug: qb_rb_close: Closing ringbuffer: /dev/shm/qb-cpg-event-1500-12725-26-header 739474:Jul 29 14:09:52 [1379] s1-san6 corosync debug [MAIN ] cs_ipcs_connection_destroyed() Version-Release number of selected component (if applicable): - Pacemaker 1.1.12 - Corosync 2.3.4 - CentOS Linux release 7.1.1503 (Core) - STONITH disabled for debugging purposes - System is not in production (yet) How reproducible: - Can reproduce on all four nodes on two separate clusters. - Already occurred 12,240 times today: root@s1-san6:~ # zgrep -ni "descriptor" /var/log/cluster/corosync.log|wc -l 12240