Bug 1185950

Summary: adding replication to a distributed volume makes the volume unavailable
Product: [Community] GlusterFS Reporter: pille <pille+redhat+bugzilla>
Component: replicateAssignee: bugs <bugs>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: high    
Version: 3.6.1CC: bugs, gluster-bugs, jbyers, ndevos, pille+redhat+bugzilla
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-18 09:32:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description pille 2015-01-26 16:36:47 UTC
Description of problem:
adding redundancy to a distributed volume blows up mountpoint (fuse).
note: i've started with a distributed volume and never tried whether this works for a single volume)
additionally until that point ingesting files into the mountpoint significantly reduces speed (from XX MB/s to YY KB/s). looking at the resources of the machines involved, doesn't show any bottleneck (so this is not simply healing bandwidth/cpu)

How reproducible:
not 100% deterministic, but it will fail eventually. just access some files. (find helps).

Steps to Reproduce:
1. create distributed volume STORAGE with two servers each having one brick
2. add some files to it (everything is smooth)
3. 'add-brick STORAGE replica 2 ...' with two additional servers each having one brick

Actual results:
mountpoint becomes slow and blows up eventually with: 'Transport endpoint is not connected (107)'

Expected results:
ingest continues uninterrupted (mountpoint stays alive).
slight slowdown would be ok, b/c of added replication (more bandwidth).

Additional info:
please specify which logfiles would be helpful.
i had this issue on 2/2 setups so i bet you could replicate it.
the seems to be no significant healing traffic between the nodes (Z MBit/s) and no high load.
a 'gluster volume heal STORAGE info' timeouts after 10min with return code 146.
when i mount this volume again, it's missing complete directories. those are still on the brick fs.

Comment 1 pille 2015-01-27 11:30:00 UTC
doing a 'rebalance fix-layout' helps bringing back (probably) all the files. however - i can't see any progress for hours (file/traffic counters stay at zero) and the mountpoint is still unreliable and blows up very often.
over the last night i was able to see at least some amount of traffic between the nodes and the new bricks are filling slowly. will it be flaky for the whole time?

Comment 2 Niels de Vos 2015-01-27 12:30:28 UTC
Please provide the logs of the mount point (/var/log/glusterfs/<path-to-mnt>.log) and the logs of the bricks (/var/log/glusterfs/bricks/...) from all the storage servers.

From the steps to reproduce, can you explain if there was much I/O happening on the volume? Were new files being added constantly while converting the volume from distribute to distribute-replicate?

Comment 3 pille 2015-01-28 14:57:23 UTC
sent logs and server metrics to niels in private.

Comment 5 pille 2015-01-29 15:10:22 UTC
i did some more research on that. there seem to be some broken files in the gluster-mountpoint. whenever you stat them, the mountpoint disconnects:

# ls -lisa

./8634:
ls: cannot access ./8634/copy: Software caused connection abort
ls: cannot access ./8634/random.part: Transport endpoint is not connected
ls: reading directory ./8634: Transport endpoint is not connected
total 215040
?????????? ? ?    ?            ?            ? copy
-rw-r--r-- 3 root root 104857600 Jan  4  2013 file
-rw-r--r-- 1 root root  10485760 Jan  4  2013 file.1st_copy 
-rw-r--r-- 3 root root 104857600 Jan  4  2013 file.2nd_copy 
?????????? ? ?    ?            ?            ? random.part

on the bricks this directory looks like:
storage01:
    total 419840
    25769803936      0 drwxr-xr-x 2 root root       110 Sep  7 16:46 .
             99      0 drwxr-xr-x 6 root root       165 Aug 29 08:52 ..
    25769804012 102400 -rw-r--r-- 4 root root 104857600 Jan  4  2013 copy
    25769804012 102400 -rw-r--r-- 4 root root 104857600 Jan  4  2013 file
    25769804011  10240 -rw-r--r-- 2 root root  10485760 Jan  4  2013 file.1st_copy
    25769804012 102400 -rw-r--r-- 4 root root 104857600 Jan  4  2013 file.2nd_copy
    25769804013 102400 -rw-r--r-- 2 root root 104857600 Jan  4  2013 random.part
storage02:
    total 215040
    21474836640      0 drwxr-xr-x 2 root root        87 Sep  7 16:46 .
    85899346016      0 drwxr-xr-x 6 root root        73 Aug 29 08:52 ..
    21474836716      0 ---------T 2 root root         0 Jan 19 21:25 copy
    21474836715 102400 -rw-r--r-- 2 root root 104857600 Jan  4  2013 file.big
    21474836718  10240 -rw-r--r-- 2 root root  10485760 Jan  4  2013 file.small
    21474836719 102400 -rw-r--r-- 2 root root 104857600 Jan  4  2013 random.full
storage05:
    total 0
    77322450916 0 drwxr-xr-x 2 root root 110 Jan 28 07:45 .
    62290717518 0 drwxr-xr-x 6 root root 165 Jan 27 15:29 ..
    77322572228 0 -rw-r--r-- 4 root root   0 Jan 27 15:30 copy
    77322572228 0 -rw-r--r-- 4 root root   0 Jan 27 15:30 file
    77322572227 0 -rw-r--r-- 2 root root   0 Jan 27 15:30 file.1st_copy
    77322572228 0 -rw-r--r-- 4 root root   0 Jan 27 15:30 file.2nd_copy
    77322572237 0 -rw-r--r-- 2 root root   0 Jan 28 07:45 random.part
storage06:
    total 0
    75162023687 0 drwxr-xr-x 2 root root 87 Jan 28 07:45 .
    53687280280 0 drwxr-xr-x 6 root root 73 Jan 27 10:04 ..
    75163018369 0 ---------T 2 root root  0 Jan 27 15:30 copy
    75163018372 0 -rw-r--r-- 2 root root  0 Jan 28 07:45 file.big
    75163018373 0 -rw-r--r-- 2 root root  0 Jan 28 07:45 file.small
    75163018374 0 -rw-r--r-- 2 root root  0 Jan 28 07:45 random.full

for comparison, this is the directory on the source i rsynced this from:
total 634888
 4294967424      4 drwxr-xr-x 2 root root      4096 Sep  7 16:46 .
60467001539      4 drwxr-xr-x 6 root root      4096 Aug 29 08:52 ..
 4294968218 102400 -rw-r--r-- 3 root root 104857600 Jan  4  2013 copy
 4294968218 102400 -rw-r--r-- 3 root root 104857600 Jan  4  2013 file
 4294968221  10240 -rw-r--r-- 1 root root  10485760 Jan  4  2013 file.1st_copy
 4294968218 102400 -rw-r--r-- 3 root root 104857600 Jan  4  2013 file.2nd_copy
 4294968223 102400 -rw-r--r-- 1 root root 104857600 Jan  4  2013 file.big
 4294968225  10240 -rw-r--r-- 1 root root  10485760 Jan  4  2013 file.small
 4294968227 102400 -rw-r--r-- 1 root root 104857600 Jan  4  2013 random.full
 4294968235 102400 -rw-r--r-- 1 root root 104857600 Jan  4  2013 random.part


there a lots of locations, where those unreadable stat-crashing files are.

Comment 7 pille 2015-02-18 09:32:44 UTC
so some things happened behind the curtain and i'd like to sum them up here for future reference.

apparently the issue seems to be solved by upgrading to 3.6.2.
unfortunately that broke the volume completely - it didn't start unless i manually upgraded the volume-config on all nodes using 'glusterd --xlator-option *.upgrade=on -N' after killing glusterd.

the error log message for the later problem was:
09:48:04.312800] E [glusterd-handshake.c:771:__server_getspec] 0-glusterd: Unable to stat /var/lib/glusterd/vols/... (No such file or directory)

thanks to pranith for the help.