Description of problem: adding redundancy to a distributed volume blows up mountpoint (fuse). note: i've started with a distributed volume and never tried whether this works for a single volume) additionally until that point ingesting files into the mountpoint significantly reduces speed (from XX MB/s to YY KB/s). looking at the resources of the machines involved, doesn't show any bottleneck (so this is not simply healing bandwidth/cpu) How reproducible: not 100% deterministic, but it will fail eventually. just access some files. (find helps). Steps to Reproduce: 1. create distributed volume STORAGE with two servers each having one brick 2. add some files to it (everything is smooth) 3. 'add-brick STORAGE replica 2 ...' with two additional servers each having one brick Actual results: mountpoint becomes slow and blows up eventually with: 'Transport endpoint is not connected (107)' Expected results: ingest continues uninterrupted (mountpoint stays alive). slight slowdown would be ok, b/c of added replication (more bandwidth). Additional info: please specify which logfiles would be helpful. i had this issue on 2/2 setups so i bet you could replicate it. the seems to be no significant healing traffic between the nodes (Z MBit/s) and no high load. a 'gluster volume heal STORAGE info' timeouts after 10min with return code 146. when i mount this volume again, it's missing complete directories. those are still on the brick fs.
doing a 'rebalance fix-layout' helps bringing back (probably) all the files. however - i can't see any progress for hours (file/traffic counters stay at zero) and the mountpoint is still unreliable and blows up very often. over the last night i was able to see at least some amount of traffic between the nodes and the new bricks are filling slowly. will it be flaky for the whole time?
Please provide the logs of the mount point (/var/log/glusterfs/<path-to-mnt>.log) and the logs of the bricks (/var/log/glusterfs/bricks/...) from all the storage servers. From the steps to reproduce, can you explain if there was much I/O happening on the volume? Were new files being added constantly while converting the volume from distribute to distribute-replicate?
sent logs and server metrics to niels in private.
i did some more research on that. there seem to be some broken files in the gluster-mountpoint. whenever you stat them, the mountpoint disconnects: # ls -lisa ./8634: ls: cannot access ./8634/copy: Software caused connection abort ls: cannot access ./8634/random.part: Transport endpoint is not connected ls: reading directory ./8634: Transport endpoint is not connected total 215040 ?????????? ? ? ? ? ? copy -rw-r--r-- 3 root root 104857600 Jan 4 2013 file -rw-r--r-- 1 root root 10485760 Jan 4 2013 file.1st_copy -rw-r--r-- 3 root root 104857600 Jan 4 2013 file.2nd_copy ?????????? ? ? ? ? ? random.part on the bricks this directory looks like: storage01: total 419840 25769803936 0 drwxr-xr-x 2 root root 110 Sep 7 16:46 . 99 0 drwxr-xr-x 6 root root 165 Aug 29 08:52 .. 25769804012 102400 -rw-r--r-- 4 root root 104857600 Jan 4 2013 copy 25769804012 102400 -rw-r--r-- 4 root root 104857600 Jan 4 2013 file 25769804011 10240 -rw-r--r-- 2 root root 10485760 Jan 4 2013 file.1st_copy 25769804012 102400 -rw-r--r-- 4 root root 104857600 Jan 4 2013 file.2nd_copy 25769804013 102400 -rw-r--r-- 2 root root 104857600 Jan 4 2013 random.part storage02: total 215040 21474836640 0 drwxr-xr-x 2 root root 87 Sep 7 16:46 . 85899346016 0 drwxr-xr-x 6 root root 73 Aug 29 08:52 .. 21474836716 0 ---------T 2 root root 0 Jan 19 21:25 copy 21474836715 102400 -rw-r--r-- 2 root root 104857600 Jan 4 2013 file.big 21474836718 10240 -rw-r--r-- 2 root root 10485760 Jan 4 2013 file.small 21474836719 102400 -rw-r--r-- 2 root root 104857600 Jan 4 2013 random.full storage05: total 0 77322450916 0 drwxr-xr-x 2 root root 110 Jan 28 07:45 . 62290717518 0 drwxr-xr-x 6 root root 165 Jan 27 15:29 .. 77322572228 0 -rw-r--r-- 4 root root 0 Jan 27 15:30 copy 77322572228 0 -rw-r--r-- 4 root root 0 Jan 27 15:30 file 77322572227 0 -rw-r--r-- 2 root root 0 Jan 27 15:30 file.1st_copy 77322572228 0 -rw-r--r-- 4 root root 0 Jan 27 15:30 file.2nd_copy 77322572237 0 -rw-r--r-- 2 root root 0 Jan 28 07:45 random.part storage06: total 0 75162023687 0 drwxr-xr-x 2 root root 87 Jan 28 07:45 . 53687280280 0 drwxr-xr-x 6 root root 73 Jan 27 10:04 .. 75163018369 0 ---------T 2 root root 0 Jan 27 15:30 copy 75163018372 0 -rw-r--r-- 2 root root 0 Jan 28 07:45 file.big 75163018373 0 -rw-r--r-- 2 root root 0 Jan 28 07:45 file.small 75163018374 0 -rw-r--r-- 2 root root 0 Jan 28 07:45 random.full for comparison, this is the directory on the source i rsynced this from: total 634888 4294967424 4 drwxr-xr-x 2 root root 4096 Sep 7 16:46 . 60467001539 4 drwxr-xr-x 6 root root 4096 Aug 29 08:52 .. 4294968218 102400 -rw-r--r-- 3 root root 104857600 Jan 4 2013 copy 4294968218 102400 -rw-r--r-- 3 root root 104857600 Jan 4 2013 file 4294968221 10240 -rw-r--r-- 1 root root 10485760 Jan 4 2013 file.1st_copy 4294968218 102400 -rw-r--r-- 3 root root 104857600 Jan 4 2013 file.2nd_copy 4294968223 102400 -rw-r--r-- 1 root root 104857600 Jan 4 2013 file.big 4294968225 10240 -rw-r--r-- 1 root root 10485760 Jan 4 2013 file.small 4294968227 102400 -rw-r--r-- 1 root root 104857600 Jan 4 2013 random.full 4294968235 102400 -rw-r--r-- 1 root root 104857600 Jan 4 2013 random.part there a lots of locations, where those unreadable stat-crashing files are.
so some things happened behind the curtain and i'd like to sum them up here for future reference. apparently the issue seems to be solved by upgrading to 3.6.2. unfortunately that broke the volume completely - it didn't start unless i manually upgraded the volume-config on all nodes using 'glusterd --xlator-option *.upgrade=on -N' after killing glusterd. the error log message for the later problem was: 09:48:04.312800] E [glusterd-handshake.c:771:__server_getspec] 0-glusterd: Unable to stat /var/lib/glusterd/vols/... (No such file or directory) thanks to pranith for the help.