Bug 1488354
| Summary: | gluster-blockd process crashed and core generated | |||
|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Pranith Kumar K <pkarampu> | |
| Component: | sharding | Assignee: | Pranith Kumar K <pkarampu> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | bugs <bugs> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | mainline | CC: | amukherj, bugs, kdhananj, knarra, kramdoss, rhs-bugs, storage-qa-internal | |
| Target Milestone: | --- | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-3.13.0 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | 1488152 | |||
| : | 1488387 (view as bug list) | Environment: | ||
| Last Closed: | 2017-12-08 17:39:41 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1488152, 1488381 | |||
| Bug Blocks: | 1488387, 1488391 | |||
REVIEW: https://review.gluster.org/18203 (features/shard: Increment counts in locks) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu) COMMIT: https://review.gluster.org/18203 committed in master by Pranith Kumar Karampuri (pkarampu) ------ commit e50fc8f4e7eb51386f47bea9e6ca8d8490c09003 Author: Pranith Kumar K <pkarampu> Date: Tue Sep 5 13:30:53 2017 +0530 features/shard: Increment counts in locks Problem: Because create_count/eexist_count are incremented without locks, all the shards may not be created because call_count will be lesser than what it needs to be. This can lead to crash in shard_common_inode_write_do() because inode on which we want to do fd_anonymous() is NULL Fix: Increment the counts in frame->lock Change-Id: Ibc87dcb1021e9f4ac2929f662da07aa7662ab0d6 BUG: 1488354 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: https://review.gluster.org/18203 Smoke: Gluster Build System <jenkins.org> Reviewed-by: Krutika Dhananjay <kdhananj> CentOS-regression: Gluster Build System <jenkins.org> This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.13.0, please open a new bug report. glusterfs-3.13.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-December/000087.html [2] https://www.gluster.org/pipermail/gluster-users/ |
(gdb) bt #0 0x00007f549eaf3c30 in pthread_mutex_lock () from /lib64/libpthread.so.0 #1 0x00007f549e207f15 in fd_anonymous () from /lib64/libglusterfs.so.0 #2 0x00007f54869d1927 in shard_common_inode_write_do () from /usr/lib64/glusterfs/3.8.4/xlator/features/shard.so #3 0x00007f54869d1c7d in shard_common_inode_write_post_mknod_handler () from /usr/lib64/glusterfs/3.8.4/xlator/features/shard.so #4 0x00007f54869ca77f in shard_common_mknod_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/shard.so #5 0x00007f5486c1164b in dht_newfile_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so #6 0x00007f5486e71ab1 in afr_mknod_unwind () from /usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so #7 0x00007f5486e73eeb in __afr_dir_write_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so #8 0x00007f5486e7482d in afr_mknod_wind_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so #9 0x00007f54870f6168 in client3_3_mknod_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/protocol/client.so #10 0x00007f549dfac840 in rpc_clnt_handle_reply () from /lib64/libgfrpc.so.0 #11 0x00007f549dfacb27 in rpc_clnt_notify () from /lib64/libgfrpc.so.0 #12 0x00007f549dfa89e3 in rpc_transport_notify () from /lib64/libgfrpc.so.0 #13 0x00007f5490be63d6 in socket_event_poll_in () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so ---Type <return> to continue, or q <return> to quit--- #14 0x00007f5490be897c in socket_event_handler () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so #15 0x00007f549e23e1e6 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0 #16 0x00007f549eaf1e25 in start_thread () from /lib64/libpthread.so.0 #17 0x00007f549d8b034d in clone () from /lib64/libc.so.6 Based on the core-file, the only way it can happen is because it doesn't create all the shards. (gdb) fr 2 #2 0x00007f54869d1927 in shard_common_inode_write_do (frame=0x7f548c0dbbe0, this=0x7f54800120d0) at shard.c:3883 3883 anon_fd = fd_anonymous (local->inode_list[i]); (gdb) p i $1 = 255 (gdb) p local->inode_list[i] $2 = (inode_t *) 0x0 (gdb) p lical->inode_list[i-1] No symbol "lical" in current context. (gdb) p local->inode_list[i-1] $3 = (inode_t *) 0x7f5474765440 (gdb) p local->offset $4 = 0 (gdb) p local->num_blocks $5 = 256 Based on this data, I went through the code and I see two races: 1) In shard_common_mknod_cbk() local->eexist_count is incremented without frame->lock 2) In shard_common_lookup_shards_cbk() local->create_count is incremented without frame->lock This can lead to the counts being less than what they need to be, so mknod is done just on 255 shards instead of 256 shards.