Description of problem: Docker container going zombie (Zsl) after undetermined issue with glusterfs/fuse. Container is bind mounting a directory on a gluster distributed-replicated volume. Version-Release number of selected component (if applicable): glusterfs 3.10.1 / Docker version 1.12.6 / Kubernetes 1.6.1 How reproducible: Bind-mount (host-mount) a volume into a container pointing to a directory residing on a glusterfs mounted filesystem. Steps to Reproduce: 1. Launch container in Docker/Kubernetes 2. hostmount from a gluster filesystem on the host. 3. produce IO load on host. Actual results: Container with bind-mount will eventually go into zombified state. The following trace is outputted into kernel log: [220525.696482] Call Trace: [220525.697342] [<ffffffff8168bbb9>] schedule+0x29/0x70 [220525.698210] [<ffffffffa07e653d>] __fuse_request_send+0x13d/0x2c0 [fuse] [220525.699081] [<ffffffff810b17d0>] ? wake_up_atomic_t+0x30/0x30 [220525.699947] [<ffffffffa07e66d2>] fuse_request_send+0x12/0x20 [fuse] [220525.700813] [<ffffffffa07f09c2>] fuse_fsync_common+0x1e2/0x230 [fuse] [220525.701689] [<ffffffffa07f0a21>] fuse_fsync+0x11/0x20 [fuse] [220525.702560] [<ffffffff8122ffb5>] do_fsync+0x65/0xa0 [220525.703428] [<ffffffff812302a3>] SyS_fdatasync+0x13/0x20 [220525.704296] [<ffffffff81696b09>] system_call_fastpath+0x16/0x1b [220645.685650] INFO: task etcd:4083 blocked for more than 120 seconds. [220645.686660] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [220645.687532] etcd D 0000000000000000 0 4083 4011 0x00000084 [220645.688395] ffff881fdf7bfe20 0000000000000086 ffff881ffefdaf10 ffff881fdf7bffd8 [220645.689272] ffff881fdf7bffd8 ffff881fdf7bffd8 ffff881ffefdaf10 ffff881726280320 [220645.690145] ffff883ffd69f000 ffff881fdf7bfe50 ffff881726280400 0000000000000000 Gluster volume and glusterd logs are mostly unremarkable, the last entries that relate to the volume in question would be: [2017-04-23 00:05:34.188124] I [MSGID: 106143] [glusterd-pmap.c:277:pmap_registry_bind] 0-pmap: adding brick /data/glust-bricks/sdi-mnt/brickmnt on port 49153 [2017-04-23 13:30:57.919250] W [MSGID: 101095] [xlator.c:162:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.10.1/xlator/features/ganesha.so: cannot open shared object file: No such file or directory Expected results: Container using gluster storage doesn't freeze and go to Zsl state. Additional info: Volume config: Volume Name: hostmnt Type: Striped-Replicate Volume ID: 70ae0467-8ac5-414f-9634-d831aebbda59 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: infc1001:/data/glust-bricks/sdh-mnt/brickmnt Brick2: infc1002:/data/glust-bricks/sdh-mnt/brickmnt Brick3: infc1003:/data/glust-bricks/sdh-mnt/brickmnt Brick4: infc1004:/data/glust-bricks/sdh-mnt/brickmnt Brick5: infc1001:/data/glust-bricks/sdi-mnt/brickmnt Brick6: infc1002:/data/glust-bricks/sdi-mnt/brickmnt Brick7: infc1003:/data/glust-bricks/sdi-mnt/brickmnt Brick8: infc1004:/data/glust-bricks/sdi-mnt/brickmnt Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on nfs.disable: on transport.address-family: inet performance.cache-size: 256MB Volume appears to be fine on examination after the container enters a crashed state.
Upon further examination, commands run against this volume appear to stall (cp / rsync etc) after performing ~100k of IO. A reboot seems to be the only way to restore it to normal operation.
This bug reported is against a version of Gluster that is no longer maintained (or has been EOL'd). See https://www.gluster.org/release-schedule/ for the versions currently maintained. As a result this bug is being closed. If the bug persists on a maintained version of gluster or against the mainline gluster repository, request that it be reopened and the Version field be marked appropriately.