Bug 1185950
| Summary: | adding replication to a distributed volume makes the volume unavailable | ||
|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | pille <pille+redhat+bugzilla> |
| Component: | replicate | Assignee: | bugs <bugs> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |
| Severity: | urgent | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.6.1 | CC: | bugs, gluster-bugs, jbyers, ndevos, pille+redhat+bugzilla |
| Target Milestone: | --- | Keywords: | Triaged |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-02-18 09:32:44 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
pille
2015-01-26 16:36:47 UTC
doing a 'rebalance fix-layout' helps bringing back (probably) all the files. however - i can't see any progress for hours (file/traffic counters stay at zero) and the mountpoint is still unreliable and blows up very often. over the last night i was able to see at least some amount of traffic between the nodes and the new bricks are filling slowly. will it be flaky for the whole time? Please provide the logs of the mount point (/var/log/glusterfs/<path-to-mnt>.log) and the logs of the bricks (/var/log/glusterfs/bricks/...) from all the storage servers. From the steps to reproduce, can you explain if there was much I/O happening on the volume? Were new files being added constantly while converting the volume from distribute to distribute-replicate? sent logs and server metrics to niels in private. i did some more research on that. there seem to be some broken files in the gluster-mountpoint. whenever you stat them, the mountpoint disconnects:
# ls -lisa
./8634:
ls: cannot access ./8634/copy: Software caused connection abort
ls: cannot access ./8634/random.part: Transport endpoint is not connected
ls: reading directory ./8634: Transport endpoint is not connected
total 215040
?????????? ? ? ? ? ? copy
-rw-r--r-- 3 root root 104857600 Jan 4 2013 file
-rw-r--r-- 1 root root 10485760 Jan 4 2013 file.1st_copy
-rw-r--r-- 3 root root 104857600 Jan 4 2013 file.2nd_copy
?????????? ? ? ? ? ? random.part
on the bricks this directory looks like:
storage01:
total 419840
25769803936 0 drwxr-xr-x 2 root root 110 Sep 7 16:46 .
99 0 drwxr-xr-x 6 root root 165 Aug 29 08:52 ..
25769804012 102400 -rw-r--r-- 4 root root 104857600 Jan 4 2013 copy
25769804012 102400 -rw-r--r-- 4 root root 104857600 Jan 4 2013 file
25769804011 10240 -rw-r--r-- 2 root root 10485760 Jan 4 2013 file.1st_copy
25769804012 102400 -rw-r--r-- 4 root root 104857600 Jan 4 2013 file.2nd_copy
25769804013 102400 -rw-r--r-- 2 root root 104857600 Jan 4 2013 random.part
storage02:
total 215040
21474836640 0 drwxr-xr-x 2 root root 87 Sep 7 16:46 .
85899346016 0 drwxr-xr-x 6 root root 73 Aug 29 08:52 ..
21474836716 0 ---------T 2 root root 0 Jan 19 21:25 copy
21474836715 102400 -rw-r--r-- 2 root root 104857600 Jan 4 2013 file.big
21474836718 10240 -rw-r--r-- 2 root root 10485760 Jan 4 2013 file.small
21474836719 102400 -rw-r--r-- 2 root root 104857600 Jan 4 2013 random.full
storage05:
total 0
77322450916 0 drwxr-xr-x 2 root root 110 Jan 28 07:45 .
62290717518 0 drwxr-xr-x 6 root root 165 Jan 27 15:29 ..
77322572228 0 -rw-r--r-- 4 root root 0 Jan 27 15:30 copy
77322572228 0 -rw-r--r-- 4 root root 0 Jan 27 15:30 file
77322572227 0 -rw-r--r-- 2 root root 0 Jan 27 15:30 file.1st_copy
77322572228 0 -rw-r--r-- 4 root root 0 Jan 27 15:30 file.2nd_copy
77322572237 0 -rw-r--r-- 2 root root 0 Jan 28 07:45 random.part
storage06:
total 0
75162023687 0 drwxr-xr-x 2 root root 87 Jan 28 07:45 .
53687280280 0 drwxr-xr-x 6 root root 73 Jan 27 10:04 ..
75163018369 0 ---------T 2 root root 0 Jan 27 15:30 copy
75163018372 0 -rw-r--r-- 2 root root 0 Jan 28 07:45 file.big
75163018373 0 -rw-r--r-- 2 root root 0 Jan 28 07:45 file.small
75163018374 0 -rw-r--r-- 2 root root 0 Jan 28 07:45 random.full
for comparison, this is the directory on the source i rsynced this from:
total 634888
4294967424 4 drwxr-xr-x 2 root root 4096 Sep 7 16:46 .
60467001539 4 drwxr-xr-x 6 root root 4096 Aug 29 08:52 ..
4294968218 102400 -rw-r--r-- 3 root root 104857600 Jan 4 2013 copy
4294968218 102400 -rw-r--r-- 3 root root 104857600 Jan 4 2013 file
4294968221 10240 -rw-r--r-- 1 root root 10485760 Jan 4 2013 file.1st_copy
4294968218 102400 -rw-r--r-- 3 root root 104857600 Jan 4 2013 file.2nd_copy
4294968223 102400 -rw-r--r-- 1 root root 104857600 Jan 4 2013 file.big
4294968225 10240 -rw-r--r-- 1 root root 10485760 Jan 4 2013 file.small
4294968227 102400 -rw-r--r-- 1 root root 104857600 Jan 4 2013 random.full
4294968235 102400 -rw-r--r-- 1 root root 104857600 Jan 4 2013 random.part
there a lots of locations, where those unreadable stat-crashing files are.
so some things happened behind the curtain and i'd like to sum them up here for future reference. apparently the issue seems to be solved by upgrading to 3.6.2. unfortunately that broke the volume completely - it didn't start unless i manually upgraded the volume-config on all nodes using 'glusterd --xlator-option *.upgrade=on -N' after killing glusterd. the error log message for the later problem was: 09:48:04.312800] E [glusterd-handshake.c:771:__server_getspec] 0-glusterd: Unable to stat /var/lib/glusterd/vols/... (No such file or directory) thanks to pranith for the help. |