Bug 1445401 - Bind-mounting a GlusterFS into a docker container leads to zombified container and call trace when the host sees IO load.
Summary: Bind-mounting a GlusterFS into a docker container leads to zombified containe...
Keywords:
Status: CLOSED EOL
Alias: None
Product: GlusterFS
Classification: Community
Component: fuse
Version: 3.10
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Csaba Henk
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-04-25 15:14 UTC by Will Boege
Modified: 2018-06-20 18:27 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-20 18:27:03 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Will Boege 2017-04-25 15:14:23 UTC
Description of problem:

Docker container going zombie (Zsl) after undetermined issue with glusterfs/fuse.  Container is bind mounting a directory on a gluster distributed-replicated volume.

Version-Release number of selected component (if applicable):

glusterfs 3.10.1 / Docker version 1.12.6 / Kubernetes 1.6.1

How reproducible:

Bind-mount (host-mount) a volume into a container pointing to a directory residing on a glusterfs mounted filesystem. 

Steps to Reproduce:
1. Launch container in Docker/Kubernetes
2. hostmount from a gluster filesystem on the host.
3. produce IO load on host.

Actual results:

Container with bind-mount will eventually go into zombified state.  The following trace is outputted into kernel log:

[220525.696482] Call Trace:
[220525.697342]  [<ffffffff8168bbb9>] schedule+0x29/0x70
[220525.698210]  [<ffffffffa07e653d>] __fuse_request_send+0x13d/0x2c0 [fuse]
[220525.699081]  [<ffffffff810b17d0>] ? wake_up_atomic_t+0x30/0x30
[220525.699947]  [<ffffffffa07e66d2>] fuse_request_send+0x12/0x20 [fuse]
[220525.700813]  [<ffffffffa07f09c2>] fuse_fsync_common+0x1e2/0x230 [fuse]
[220525.701689]  [<ffffffffa07f0a21>] fuse_fsync+0x11/0x20 [fuse]
[220525.702560]  [<ffffffff8122ffb5>] do_fsync+0x65/0xa0
[220525.703428]  [<ffffffff812302a3>] SyS_fdatasync+0x13/0x20
[220525.704296]  [<ffffffff81696b09>] system_call_fastpath+0x16/0x1b
[220645.685650] INFO: task etcd:4083 blocked for more than 120 seconds.
[220645.686660] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[220645.687532] etcd            D 0000000000000000     0  4083   4011 0x00000084
[220645.688395]  ffff881fdf7bfe20 0000000000000086 ffff881ffefdaf10 ffff881fdf7bffd8
[220645.689272]  ffff881fdf7bffd8 ffff881fdf7bffd8 ffff881ffefdaf10 ffff881726280320
[220645.690145]  ffff883ffd69f000 ffff881fdf7bfe50 ffff881726280400 0000000000000000

Gluster volume and glusterd logs are mostly unremarkable, the last entries that relate to the volume in question would be:

[2017-04-23 00:05:34.188124] I [MSGID: 106143] [glusterd-pmap.c:277:pmap_registry_bind] 0-pmap: adding brick /data/glust-bricks/sdi-mnt/brickmnt on port 49153
[2017-04-23 13:30:57.919250] W [MSGID: 101095] [xlator.c:162:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.10.1/xlator/features/ganesha.so: cannot open shared object file: No such file or directory

Expected results:

Container using gluster storage doesn't freeze and go to Zsl state. 

Additional info:

Volume config:

Volume Name: hostmnt
Type: Striped-Replicate
Volume ID: 70ae0467-8ac5-414f-9634-d831aebbda59
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: infc1001:/data/glust-bricks/sdh-mnt/brickmnt
Brick2: infc1002:/data/glust-bricks/sdh-mnt/brickmnt
Brick3: infc1003:/data/glust-bricks/sdh-mnt/brickmnt
Brick4: infc1004:/data/glust-bricks/sdh-mnt/brickmnt
Brick5: infc1001:/data/glust-bricks/sdi-mnt/brickmnt
Brick6: infc1002:/data/glust-bricks/sdi-mnt/brickmnt
Brick7: infc1003:/data/glust-bricks/sdi-mnt/brickmnt
Brick8: infc1004:/data/glust-bricks/sdi-mnt/brickmnt
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.disable: on
transport.address-family: inet
performance.cache-size: 256MB


Volume appears to be fine on examination after the container enters a crashed state.

Comment 2 Will Boege 2017-04-25 16:20:35 UTC
Upon further examination, commands run against this volume appear to stall (cp / rsync etc) after performing ~100k of IO. A reboot seems to be the only way to restore it to normal operation.

Comment 3 Shyamsundar 2018-06-20 18:27:03 UTC
This bug reported is against a version of Gluster that is no longer maintained (or has been EOL'd). See https://www.gluster.org/release-schedule/ for the versions currently maintained.

As a result this bug is being closed.

If the bug persists on a maintained version of gluster or against the mainline gluster repository, request that it be reopened and the Version field be marked appropriately.

Comment 4 Shyamsundar 2018-06-20 18:27:40 UTC
This bug reported is against a version of Gluster that is no longer maintained
(or has been EOL'd). See https://www.gluster.org/release-schedule/ for the
versions currently maintained.

As a result this bug is being closed.

If the bug persists on a maintained version of gluster or against the mainline
gluster repository, request that it be reopened and the Version field be marked
appropriately.


Note You need to log in before you can comment on or make changes to this bug.