Bug 1483730
Summary: | [GSS] glusterfsd (brick) process crashed | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Raghavendra Bhat <rabhat> | |
Component: | rpc | Assignee: | Raghavendra G <rgowdapp> | |
Status: | CLOSED ERRATA | QA Contact: | Vinayak Papnoi <vpapnoi> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhgs-3.1 | CC: | amukherj, mchangir, nbalacha, rgowdapp, rhs-bugs, sheggodu, storage-qa-internal, vdas | |
Target Milestone: | --- | |||
Target Release: | RHGS 3.4.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | rebase | |||
Fixed In Version: | glusterfs-3.12.2-1 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1486134 (view as bug list) | Environment: | ||
Last Closed: | 2018-09-04 06:35:11 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1486134, 1489296, 1489297, 1489298, 1503135 |
Comment 4
Raghavendra G
2017-08-24 03:40:42 UTC
Failure of assertion (slot->fd == fd) means that slot is allocated to a different fd. Slot deallocation happens only after receiving a POLLERR event. This means we got at least one another parallel POLLERR event on the socket. However, we register socket with EPOLLONESHOT, which means it has to be explicitly added back through epoll_ctl to receive more events. Normally we do this once the handler completes processing of current event. But event_select_on_epoll is one asynchronous codepath where socket can be added back for polling while an event on the same socket is being processed. event_select_on_epoll has a check whether an event is being processed in the form of slot->in_handler. But this check is not sufficient enough to prevent parallel events as slot->in_handler is not atomically incremented with respect to reception of the event. This means following imaginary sequence of events can happen: * epoll_wait returns with a POLLIN - say POLLIN1 - on a socket (sock1) associated with slot s1. * an event_select_on called from __socket_ioq_churn which was called in request/reply/msg submission codepath (as opposed to __socket_ioq_churn called as part of POLLOUT handling - we cannot receive a POLLOUT due to EPOLLONESHOT) adds back sock1 for polling. * since sock1 was added back for polling in step 2 and our polling is level-triggered, another thread picks up a POLLIN - say POLLIN2 - event. Similarly we can argue for more than one POLLERR events too. However, every event process gets a reference on slot and slot is deallocated only when refcount goes to 0 and I cannot think of way this can happen even with parallel POLLERR events. So, as of now RCA is unknown. While making event_unregister_epoll_common returning doing nothing when (slot->fd != fd) seems to be the only sane fix, I would like to figure out the RCA before doing so. It would be better if we can avoid running into such a situation than handling it. Some observations by reading code: * A slot won't be deallocated and hence the fd associated with it won't change till there is a positive non-zero refcount. Since we increment refcount by one before calling handler, slot's fd won't be changed till handler returns. * From __socket_reset, event_unregister_close (this->ctx->event_pool, priv->sock, priv->idx); priv->sock = -1; priv->idx = -1; priv->connected = -1; As can be seen above, priv->sock that is passed as argument to event_unregister_close and it is set to -1 post that. So, I guess we got more than one POLLERR event on the socket. The first event set priv->sock to -1 and the second event resulted in this crash during event_unregister_close as slot->fd != -1. Since socket_event_handler logs in DEBUG log-level, logs cannot help me to confirm this hypothesis. As to parallel POLLERR events, comment #5 explains a possible scenario where this can happen. A more comprehensive analysis can be found at https://bugzilla.redhat.com/show_bug.cgi?id=1486134#c3 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607 |