Bug 1438966
| Summary: | Multiple bricks WILL crash after TCP port probing | |||
|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Skyler Vock <skyler.vock> | |
| Component: | rpc | Assignee: | Milind Changire <mchangir> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | ||
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | mainline | CC: | amukherj, bugs, kaushal, khiremat, nbalacha, ndevos, oleksandr, pasik, pkarampu, rcyriac, rgowdapp, skoduri, skyler.vock | |
| Target Milestone: | --- | Keywords: | Triaged | |
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-3.12.0 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | 1353561 | |||
| : | 1442535 1449169 1449191 (view as bug list) | Environment: | ||
| Last Closed: | 2017-09-05 17:26:58 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | fuse | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1442535, 1449169, 1449191 | |||
|
Description
Skyler Vock
2017-04-04 22:40:45 UTC
I've been running a test on glusterfs-3.10.1-1.el7.x86_64 for over half a day now, and have not noticed any problem yet. This is what is running (and will be running for a longer period): - single brick volume - mounted with fuse, and recursively cp'ing and rm'ing /usr in a loop - running 'nmap -p49152 127.0.0.1' in a loop Is this bug still reproducible with the current version of Gluster for you? Niels, I assume this is the same bug I've already described in 1353561. Backtrace looks the same. My setup there was distributed-replicated, 2 replicas, 5 bricks on each replica. Sad it was closed without proper investigation :(. Unfortunately, I do not have access to my setup anymore. (In reply to Niels de Vos from comment #1) > I've been running a test on glusterfs-3.10.1-1.el7.x86_64 for over half a > day now, and have not noticed any problem yet. This is what is running (and > will be running for a longer period): > > - single brick volume > - mounted with fuse, and recursively cp'ing and rm'ing /usr in a loop > - running 'nmap -p49152 127.0.0.1' in a loop > > Is this bug still reproducible with the current version of Gluster for you? Niels, This is a difficult bug to replicate. We see the outage in production anywhere between once a week and once every 3 weeks. This outage, for us, is ONLY during a 'find <path> -depth -type f -name "*" -mtime +31 -delete' command in a backup cron. To speed up the test process, we have a recursive script building the environment, running the backup cron, then destroying the environment. While this is running on a gluster client, we are recursively executing 'while true; do nmap -Pn -sT -p49150-49160 <ip>; done' on a node outside of the gluster architecture, without glusterfs installed. This is similar to a polling system ie OpenNMS. - one volume - replicated - two clients - three servers, one brick per server - mounted with fuse Our test runs anywhere between an hour and 24 hours to reproduce the issue. We can consistently reproduce and continue to see this bug in production. Let me know if you have other questions. REVIEW: https://review.gluster.org/17139 (rpc: fix transport add/remove race on port probing) posted (#1) for review on master by Milind Changire (mchangir) REVIEW: https://review.gluster.org/17139 (rpc: fix transport add/remove race on port probing) posted (#2) for review on master by Milind Changire (mchangir) REVIEW: https://review.gluster.org/17139 (rpc: fix transport add/remove race on port probing) posted (#3) for review on master by Milind Changire (mchangir) REVIEW: https://review.gluster.org/17139 (rpc: fix transport add/remove race on port probing) posted (#4) for review on master by Milind Changire (mchangir) REVIEW: https://review.gluster.org/17139 (rpc: fix transport add/remove race on port probing) posted (#5) for review on master by Milind Changire (mchangir) REVIEW: https://review.gluster.org/17139 (rpc: fix transport add/remove race on port probing) posted (#6) for review on master by Milind Changire (mchangir) REVIEW: https://review.gluster.org/17139 (rpc: fix transport add/remove race on port probing) posted (#7) for review on master by Milind Changire (mchangir) REVIEW: https://review.gluster.org/17139 (rpc: fix transport add/remove race on port probing) posted (#8) for review on master by Milind Changire (mchangir) COMMIT: https://review.gluster.org/17139 committed in master by Jeff Darcy (jeff.us) ------ commit 4f7ef3020edcc75cdeb22d8da8a1484f9db77ac9 Author: Milind Changire <mchangir> Date: Wed May 3 10:51:16 2017 +0530 rpc: fix transport add/remove race on port probing Problem: Spurious __gf_free() assertion failures seen all over the place with header->magic being overwritten when running port probing tests with 'nmap' Solution: Fix sequence of: 1. add accept()ed socket connection fd to epoll set 2. add newly created rpc_transport_t object in RPCSVC service list Correct sequence is #2 followed by #1. Reason: Adding new fd returned by accept() to epoll set causes an epoll_wait() to return immediately with a POLLIN event. This races ahead to a readv() which returms with errno:104 (Connection reset by peer) during port probing using 'nmap'. The error is then handled by POLLERR code to remove the new transport object from RPCSVC service list and later unref and destroy the rpc transport object. socket_server_event_handler() then catches up with registering the unref'd/destroyed rpc transport object. This is later manifest as assertion failures in __gf_free() with the header->magic field botched due to invalid address references. All this does not result in a Segmentation Fault since the address space continues to be mapped into the process and pages still being referenced elsewhere. As a further note: This race happens only in accept() codepath. Only in this codepath, the notify will be referring to two transports: 1, listener transport and 2. newly accepted transport All other notify refer to only one transport i.e., the transport/socket on which the event is received. Since epoll is ONE_SHOT another event won't arrive on the same socket till the current event is processed. However, in the accept() codepath, the current event - ACCEPT - and the new event - POLLIN/POLLER - arrive on two different sockets: 1. ACCEPT on listener socket and 2. POLLIN/POLLERR on newly registered socket. Also, note that these two events are handled different thread contexts. Cleanup: Critical section in socket_server_event_handler() has been removed. Instead, an additional ref on new_trans has been used to avoid ref/unref race when notifying RPCSVC. Change-Id: I4417924bc9e6277d24bd1a1c5bcb7445bcb226a3 BUG: 1438966 Signed-off-by: Milind Changire <mchangir> Reviewed-on: https://review.gluster.org/17139 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Amar Tumballi <amarts> Reviewed-by: Oleksandr Natalenko <oleksandr> Reviewed-by: Jeff Darcy <jeff.us> This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report. glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html [2] https://www.gluster.org/pipermail/gluster-users/ |