+++ This bug was initially created as a clone of Bug #1188242 +++ Description of problem: ======================= Fuse mounted on the client and tried to run iozone for 10 files in parallel using below command. The gluster process has crashed and when tried to cd to the mount point it gives "Transport end point not connected" message. for i in `seq 1 10`; do /opt/iozone3_430/src/current/iozone -az -i0 -i1 & done Version-Release number of selected component (if applicable): ============================================================= glusterfs 3.7dev built on Feb 2 2015 01:04:49 Package Information: ==================== Downloaded from : http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs/epel-6-x86_64/glusterfs-3.7dev-0.555.gite927623.autobuild/ How reproducible: 100% Steps to Reproduce: =================== 1. Create a fuse mount 2. Run iozone. as for i in `seq 1 10`; do ./iozone3_430/src/current/iozone -az -i0 -i1 & done Number of volumes : =================== 1 Volume Names: ============= testvol Volume on which the particular issue is seen [ if applicable ] : ================================================================ testvol Type of volumes : ================= Disperse (1x(4+2)) Volume options if available : ============================= [root@dhcp37-178 ~]# gluster volume get testvol all Option Value ------ ----- cluster.lookup-unhashed on cluster.min-free-disk 10% cluster.min-free-inodes 5% cluster.rebalance-stats off cluster.subvols-per-directory (null) cluster.readdir-optimize off cluster.rsync-hash-regex (null) cluster.extra-hash-regex (null) cluster.dht-xattr-name trusted.glusterfs.dht cluster.randomize-hash-range-by-gfid off cluster.local-volume-name (null) cluster.weighted-rebalance on cluster.switch-pattern (null) cluster.entry-change-log on cluster.read-subvolume (null) cluster.read-subvolume-index -1 cluster.read-hash-mode 1 cluster.background-self-heal-count 16 cluster.metadata-self-heal on cluster.data-self-heal on cluster.entry-self-heal on cluster.self-heal-daemon on cluster.self-heal-window-size 1 cluster.data-change-log on cluster.metadata-change-log on cluster.data-self-heal-algorithm (null) cluster.eager-lock on cluster.quorum-type none cluster.quorum-count (null) cluster.choose-local true cluster.self-heal-readdir-size 1KB cluster.post-op-delay-secs 1 cluster.ensure-durability on cluster.stripe-block-size 128KB cluster.stripe-coalesce true diagnostics.latency-measurement off diagnostics.dump-fd-stats off diagnostics.count-fop-hits off diagnostics.brick-log-level INFO diagnostics.client-log-level INFO diagnostics.brick-sys-log-level CRITICAL diagnostics.client-sys-log-level CRITICAL diagnostics.brick-logger (null) diagnostics.client-logger (null) diagnostics.brick-log-format (null) diagnostics.client-log-format (null) diagnostics.brick-log-buf-size 5 diagnostics.client-log-buf-size 5 diagnostics.brick-log-flush-timeout 120 diagnostics.client-log-flush-timeout 120 performance.cache-max-file-size 0 performance.cache-min-file-size 0 performance.cache-refresh-timeout 1 performance.cache-priority performance.cache-size 32MB performance.io-thread-count 16 performance.high-prio-threads 16 performance.normal-prio-threads 16 performance.low-prio-threads 16 performance.least-prio-threads 1 performance.enable-least-priority on performance.least-rate-limit 0 performance.cache-size 128MB performance.flush-behind on performance.nfs.flush-behind on performance.write-behind-window-size 1MB performance.nfs.write-behind-window-size1MB performance.strict-o-direct off performance.nfs.strict-o-direct off performance.strict-write-ordering off performance.nfs.strict-write-ordering off performance.lazy-open yes performance.read-after-open no performance.read-ahead-page-count 4 performance.md-cache-timeout 1 features.encryption off encryption.master-key (null) encryption.data-key-size 256 encryption.block-size 4096 network.frame-timeout 1800 network.ping-timeout 42 network.tcp-window-size (null) features.lock-heal off features.grace-timeout 10 network.remote-dio disable network.tcp-window-size (null) network.inode-lru-limit 16384 auth.allow * auth.reject (null) transport.keepalive (null) server.allow-insecure (null) server.root-squash off server.anonuid 65534 server.anongid 65534 server.statedump-path /var/run/gluster server.outstanding-rpc-limit 64 features.lock-heal off features.grace-timeout (null) server.ssl (null) auth.ssl-allow * server.manage-gids off client.send-gids on server.gid-timeout 2 server.own-thread (null) performance.write-behind on performance.read-ahead on performance.readdir-ahead off performance.io-cache on performance.quick-read on performance.open-behind on performance.stat-prefetch on performance.client-io-threads off performance.nfs.write-behind on performance.nfs.read-ahead off performance.nfs.io-cache off performance.nfs.quick-read off performance.nfs.stat-prefetch off performance.nfs.io-threads off performance.force-readdirp true features.file-snapshot off features.uss off features.snapshot-directory .snaps features.show-snapshot-directory off network.compression off network.compression.window-size -15 network.compression.mem-level 8 network.compression.min-size 0 network.compression.compression-level -1 network.compression.debug false features.limit-usage (null) features.quota-timeout 0 features.default-soft-limit 80% features.soft-timeout 60 features.hard-timeout 5 features.alert-time 86400 features.quota-deem-statfs off geo-replication.indexing off geo-replication.indexing off geo-replication.ignore-pid-check off geo-replication.ignore-pid-check off features.quota on debug.trace off debug.log-history no debug.log-file no debug.exclude-ops (null) debug.include-ops (null) debug.error-gen off debug.error-failure (null) debug.error-number (null) debug.random-failure off debug.error-fops (null) nfs.enable-ino32 no nfs.mem-factor 15 nfs.export-dirs on nfs.export-volumes on nfs.addr-namelookup off nfs.dynamic-volumes off nfs.register-with-portmap on nfs.outstanding-rpc-limit 16 nfs.port 2049 nfs.rpc-auth-unix on nfs.rpc-auth-null on nfs.rpc-auth-allow all nfs.rpc-auth-reject none nfs.ports-insecure off nfs.trusted-sync off nfs.trusted-write off nfs.volume-access read-write nfs.export-dir nfs.disable false nfs.nlm on nfs.acl on nfs.mount-udp off nfs.mount-rmtab /var/lib/glusterd/nfs/rmtab nfs.rpc-statd /sbin/rpc.statd nfs.server-aux-gids off nfs.drc off nfs.drc-size 0x20000 nfs.read-size (1 * 1048576ULL) nfs.write-size (1 * 1048576ULL) nfs.readdir-size (1 * 1048576ULL) features.read-only off features.worm off storage.linux-aio off storage.batch-fsync-mode reverse-fsync storage.batch-fsync-delay-usec 0 storage.owner-uid -1 storage.owner-gid -1 storage.node-uuid-pathinfo off storage.health-check-interval 30 storage.build-pgfid off storage.bd-aio off cluster.server-quorum-type off cluster.server-quorum-ratio 0 changelog.changelog off changelog.changelog-dir (null) changelog.encoding ascii changelog.rollover-time 15 changelog.fsync-interval 5 changelog.changelog-barrier-timeout 120 features.barrier disable features.barrier-timeout 120 locks.trace disable cluster.disperse-self-heal-daemon enable [root@dhcp37-178 ~]# Output of gluster volume info : ================================ [root@dhcp37-178 ~]# gluster v info Volume Name: testvol Type: Disperse Volume ID: ad1a31fb-2e69-4d5d-9ae0-d057879b8fd5 Status: Started Number of Bricks: 1 x (4 + 2) = 6 Transport-type: tcp Bricks: Brick1: dhcp37-120:/var/run/gluster/snaps/1e9ced492e2048cf9f906f45a4869238/brick1/b1 Brick2: dhcp37-208:/var/run/gluster/snaps/1e9ced492e2048cf9f906f45a4869238/brick2/b1 Brick3: dhcp37-178:/var/run/gluster/snaps/1e9ced492e2048cf9f906f45a4869238/brick3/b1 Brick4: dhcp37-183:/var/run/gluster/snaps/1e9ced492e2048cf9f906f45a4869238/brick4/b1 Brick5: dhcp37-120:/var/run/gluster/snaps/1e9ced492e2048cf9f906f45a4869238/brick5/b2 Brick6: dhcp37-208:/var/run/gluster/snaps/1e9ced492e2048cf9f906f45a4869238/brick6/b2 Options Reconfigured: features.uss: off features.quota: on [root@dhcp37-178 ~]# Output of gluster volume status : ================================= [root@dhcp37-178 ~]# gluster v status Status of volume: testvol Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick dhcp37-120:/var/run/gluster/snaps/1e9ced492e2048c f9f906f45a4869238/brick1/b1 49156 Y 3225 Brick dhcp37-208:/var/run/gluster/snaps/1e9ced492e2048c f9f906f45a4869238/brick2/b1 49167 Y 3238 Brick dhcp37-178:/var/run/gluster/snaps/1e9ced492e2048c f9f906f45a4869238/brick3/b1 49166 Y 3192 Brick dhcp37-183:/var/run/gluster/snaps/1e9ced492e2048c f9f906f45a4869238/brick4/b1 49166 Y 3173 Brick dhcp37-120:/var/run/gluster/snaps/1e9ced492e2048c f9f906f45a4869238/brick5/b2 49157 Y 3236 Brick dhcp37-208:/var/run/gluster/snaps/1e9ced492e2048c f9f906f45a4869238/brick6/b2 49168 Y 3249 NFS Server on localhost 2049 Y 3206 Quota Daemon on localhost N/A Y 3221 NFS Server on dhcp37-208 2049 Y 3262 Quota Daemon on dhcp37-208 N/A Y 3276 NFS Server on dhcp37-183 2049 Y 3186 Quota Daemon on dhcp37-183 N/A Y 3199 NFS Server on 10.70.37.120 2049 Y 3250 Quota Daemon on 10.70.37.120 N/A Y 3263 Task Status of Volume testvol ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp37-178 ~]# Actual results: ================ Gluster client crashed Expected results: ================ It should not be crashed Additional info: ================ Attaching the client mount log. --- Additional comment from Bhaskarakiran on 2015-02-24 06:33:12 EST --- --- Additional comment from Bhaskarakiran on 2015-02-24 06:34:39 EST --- Log snippet: ============ pending frames: frame : type(1) op(LOOKUP) frame : type(1) op(LOOKUP) frame : type(1) op(FTRUNCATE) frame : type(0) op(0) frame : type(1) op(UNLINK) frame : type(0) op(0) frame : type(0) op(0) frame : type(1) op(FLUSH) frame : type(1) op(STAT) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2015-02-24 11:41:47 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.7dev /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x306ae20aa6] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x306ae3bdcf] /lib64/libc.so.6[0x342d4326a0] /usr/lib64/glusterfs/3.7dev/xlator/cluster/distribute.so(dht_writev_cbk+0x268)[0x7f300993cbf8] /usr/lib64/libglusterfs.so.0(default_writev_cbk+0xcc)[0x306ae2e5ec] /usr/lib64/glusterfs/3.7dev/xlator/cluster/disperse.so(ec_manager_writev+0x10d)[0x7f3009b8647d] /usr/lib64/glusterfs/3.7dev/xlator/cluster/disperse.so(__ec_manager+0x34)[0x7f3009b6a654] /usr/lib64/glusterfs/3.7dev/xlator/cluster/disperse.so(ec_resume+0x91)[0x7f3009b6a461] /usr/lib64/glusterfs/3.7dev/xlator/cluster/disperse.so(ec_combine+0x196)[0x7f3009b88fa6] /usr/lib64/glusterfs/3.7dev/xlator/cluster/disperse.so(ec_writev_cbk+0x27b)[0x7f3009b844bb] /usr/lib64/glusterfs/3.7dev/xlator/protocol/client.so(client3_3_writev_cbk+0x6cc)[0x7f3009de301c] /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x306aa0ea65] /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x142)[0x306aa0ff02] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x306aa0b5f8] /usr/lib64/glusterfs/3.7dev/rpc-transport/socket.so(+0x9759)[0x7f30103fc759] /usr/lib64/glusterfs/3.7dev/rpc-transport/socket.so(+0xb1bd)[0x7f30103fe1bd] /usr/lib64/libglusterfs.so.0[0x306ae78ffc] /lib64/libpthread.so.0[0x342d8079d1] /lib64/libc.so.6(clone+0x6d)[0x342d4e89dd] --------- --- Additional comment from Ashish Pandey on 2015-03-03 23:59:25 EST --- dht_fsync_cbk() function is being called with op_ret = -1, op_errno = 2 (ENOENT) and postbuf and prebuff is NULL. Inside the function dht_fsync_cbk, skipping the error handling of op_errno = ENOENT ( if (op_ret == -1 && !dht_inode_missing(op_errno)) ) which causes control to go to - if (IS_DHT_MIGRATION_PHASE1 (postbuf)) Macro IS_DHT_MIGRATION_PHASE1 trying to access the attributes of file using postbuf pointer which is NULL. This leads to crash. Bug id 960843 made some changes to not to include op_errno = ENOENT in error handling. Need to investigate the reason to skip op_errno = ENOENT case and also modify marco definitions to handle NULL pointers properly. --- Additional comment from Pranith Kumar K on 2015-03-09 02:44:05 EDT --- Ashish, I just realized, on an active fd, fsync should never give ESTALE/ENOENT as the fd is already opened on the file. Why is EC returning this error? This could be ec bug after all? Pranith --- Additional comment from Anand Avati on 2015-04-09 08:21:47 EDT --- REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for get_size_version) posted (#1) for review on master by Ashish Pandey (aspandey) --- Additional comment from Anand Avati on 2015-04-13 07:19:29 EDT --- REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for get_size_version) posted (#2) for review on master by Ashish Pandey (aspandey) --- Additional comment from Anand Avati on 2015-04-13 07:19:32 EDT --- REVIEW: http://review.gluster.org/10218 (Comments implemeted) posted (#1) for review on master by Ashish Pandey (aspandey) --- Additional comment from Anand Avati on 2015-04-14 05:45:23 EDT --- REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for get_size_version) posted (#3) for review on master by Ashish Pandey (aspandey) --- Additional comment from Anand Avati on 2015-04-28 02:06:23 EDT --- REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for get_size_version) posted (#4) for review on master by Ashish Pandey (aspandey) --- Additional comment from Anand Avati on 2015-05-01 11:04:55 EDT --- REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for get_size_version) posted (#5) for review on master by Ashish Pandey (aspandey) --- Additional comment from Anand Avati on 2015-05-03 07:46:40 EDT --- REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for get_size_version) posted (#6) for review on master by Ashish Pandey (aspandey) --- Additional comment from Anand Avati on 2015-05-04 07:37:02 EDT --- REVIEW: http://review.gluster.org/10176 (cluster/ec: Use fd instead of loc for get_size_version) posted (#7) for review on master by Ashish Pandey (aspandey) --- Additional comment from Anand Avati on 2015-05-04 22:43:51 EDT --- COMMIT: http://review.gluster.org/10176 committed in master by Pranith Kumar Karampuri (pkarampu) ------ commit 582b252e3a418ee332cf3d4b1a415520e242b599 Author: Ashish Pandey <aspandey> Date: Thu Apr 9 17:27:46 2015 +0530 cluster/ec: Use fd instead of loc for get_size_version Change-Id: Ia7d43cb3b222db34ecb0e35424f1766715ed8e6a BUG: 1188242 Signed-off-by: Ashish Pandey <aspandey> Reviewed-on: http://review.gluster.org/10176 Reviewed-by: Xavier Hernandez <xhernandez> Tested-by: Gluster Build System <jenkins.com>
REVIEW: http://review.gluster.org/10626 (Adding 64 bits in "version" key of extended attributes. First 64 bits (Left) represents Data version. Last 64 bits (right) represents Meta Data version.) posted (#1) for review on release-3.7 by Ashish Pandey (aspandey)
REVIEW: http://review.gluster.org/10625 (cluster/ec: Use fd instead of loc for get_size_version) posted (#3) for review on release-3.7 by Ashish Pandey (aspandey)
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report. glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user
Patch http://review.gluster.org/#/c/11326/ fixes this issue in quota-xlator
COMMIT: http://review.gluster.org/11326 committed in release-3.7 by Raghavendra G (rgowdapp) ------ commit 4673b50ecf8ed55b7d8bde55e9580cfde748ef0a Author: vmallika <vmallika> Date: Thu Jun 18 12:02:50 2015 +0530 quota: allow writes when with ENOENT/ESTALE on active fd This is a backport of http://review.gluster.org/#/c/11307/ > We may get ENOENT/ESTALE in case of below scenario > fd = open file.txt > unlink file.txt > write on fd > Here build_ancestry can fail as the file is removed. > For now ignore ENOENT/ESTALE on active fd with > writev and fallocate. > We need to re-visit this code once we understand > how other file-system behave in this scenario > > Below patch fixes the issue in DHT: > http://review.gluster.org/#/c/11097 > > Change-Id: I7be683583b808c280e3ea2ddd036c1558a6d53e5 > BUG: 1188242 > Signed-off-by: vmallika <vmallika> Change-Id: Ic836d200689fe6f27d4675bc0ff89063b7dc3882 BUG: 1219358 Signed-off-by: vmallika <vmallika> Reviewed-on: http://review.gluster.org/11326 Tested-by: NetBSD Build System <jenkins.org> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Raghavendra G <rgowdapp> Tested-by: Raghavendra G <rgowdapp>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.3, please open a new bug report. glusterfs-3.7.3 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/12078 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user