Bug 1185950
Summary: | adding replication to a distributed volume makes the volume unavailable | ||
---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | pille <pille+redhat+bugzilla> |
Component: | replicate | Assignee: | bugs <bugs> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 3.6.1 | CC: | bugs, gluster-bugs, jbyers, ndevos, pille+redhat+bugzilla |
Target Milestone: | --- | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-02-18 09:32:44 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
pille
2015-01-26 16:36:47 UTC
doing a 'rebalance fix-layout' helps bringing back (probably) all the files. however - i can't see any progress for hours (file/traffic counters stay at zero) and the mountpoint is still unreliable and blows up very often. over the last night i was able to see at least some amount of traffic between the nodes and the new bricks are filling slowly. will it be flaky for the whole time? Please provide the logs of the mount point (/var/log/glusterfs/<path-to-mnt>.log) and the logs of the bricks (/var/log/glusterfs/bricks/...) from all the storage servers. From the steps to reproduce, can you explain if there was much I/O happening on the volume? Were new files being added constantly while converting the volume from distribute to distribute-replicate? sent logs and server metrics to niels in private. i did some more research on that. there seem to be some broken files in the gluster-mountpoint. whenever you stat them, the mountpoint disconnects: # ls -lisa ./8634: ls: cannot access ./8634/copy: Software caused connection abort ls: cannot access ./8634/random.part: Transport endpoint is not connected ls: reading directory ./8634: Transport endpoint is not connected total 215040 ?????????? ? ? ? ? ? copy -rw-r--r-- 3 root root 104857600 Jan 4 2013 file -rw-r--r-- 1 root root 10485760 Jan 4 2013 file.1st_copy -rw-r--r-- 3 root root 104857600 Jan 4 2013 file.2nd_copy ?????????? ? ? ? ? ? random.part on the bricks this directory looks like: storage01: total 419840 25769803936 0 drwxr-xr-x 2 root root 110 Sep 7 16:46 . 99 0 drwxr-xr-x 6 root root 165 Aug 29 08:52 .. 25769804012 102400 -rw-r--r-- 4 root root 104857600 Jan 4 2013 copy 25769804012 102400 -rw-r--r-- 4 root root 104857600 Jan 4 2013 file 25769804011 10240 -rw-r--r-- 2 root root 10485760 Jan 4 2013 file.1st_copy 25769804012 102400 -rw-r--r-- 4 root root 104857600 Jan 4 2013 file.2nd_copy 25769804013 102400 -rw-r--r-- 2 root root 104857600 Jan 4 2013 random.part storage02: total 215040 21474836640 0 drwxr-xr-x 2 root root 87 Sep 7 16:46 . 85899346016 0 drwxr-xr-x 6 root root 73 Aug 29 08:52 .. 21474836716 0 ---------T 2 root root 0 Jan 19 21:25 copy 21474836715 102400 -rw-r--r-- 2 root root 104857600 Jan 4 2013 file.big 21474836718 10240 -rw-r--r-- 2 root root 10485760 Jan 4 2013 file.small 21474836719 102400 -rw-r--r-- 2 root root 104857600 Jan 4 2013 random.full storage05: total 0 77322450916 0 drwxr-xr-x 2 root root 110 Jan 28 07:45 . 62290717518 0 drwxr-xr-x 6 root root 165 Jan 27 15:29 .. 77322572228 0 -rw-r--r-- 4 root root 0 Jan 27 15:30 copy 77322572228 0 -rw-r--r-- 4 root root 0 Jan 27 15:30 file 77322572227 0 -rw-r--r-- 2 root root 0 Jan 27 15:30 file.1st_copy 77322572228 0 -rw-r--r-- 4 root root 0 Jan 27 15:30 file.2nd_copy 77322572237 0 -rw-r--r-- 2 root root 0 Jan 28 07:45 random.part storage06: total 0 75162023687 0 drwxr-xr-x 2 root root 87 Jan 28 07:45 . 53687280280 0 drwxr-xr-x 6 root root 73 Jan 27 10:04 .. 75163018369 0 ---------T 2 root root 0 Jan 27 15:30 copy 75163018372 0 -rw-r--r-- 2 root root 0 Jan 28 07:45 file.big 75163018373 0 -rw-r--r-- 2 root root 0 Jan 28 07:45 file.small 75163018374 0 -rw-r--r-- 2 root root 0 Jan 28 07:45 random.full for comparison, this is the directory on the source i rsynced this from: total 634888 4294967424 4 drwxr-xr-x 2 root root 4096 Sep 7 16:46 . 60467001539 4 drwxr-xr-x 6 root root 4096 Aug 29 08:52 .. 4294968218 102400 -rw-r--r-- 3 root root 104857600 Jan 4 2013 copy 4294968218 102400 -rw-r--r-- 3 root root 104857600 Jan 4 2013 file 4294968221 10240 -rw-r--r-- 1 root root 10485760 Jan 4 2013 file.1st_copy 4294968218 102400 -rw-r--r-- 3 root root 104857600 Jan 4 2013 file.2nd_copy 4294968223 102400 -rw-r--r-- 1 root root 104857600 Jan 4 2013 file.big 4294968225 10240 -rw-r--r-- 1 root root 10485760 Jan 4 2013 file.small 4294968227 102400 -rw-r--r-- 1 root root 104857600 Jan 4 2013 random.full 4294968235 102400 -rw-r--r-- 1 root root 104857600 Jan 4 2013 random.part there a lots of locations, where those unreadable stat-crashing files are. so some things happened behind the curtain and i'd like to sum them up here for future reference. apparently the issue seems to be solved by upgrading to 3.6.2. unfortunately that broke the volume completely - it didn't start unless i manually upgraded the volume-config on all nodes using 'glusterd --xlator-option *.upgrade=on -N' after killing glusterd. the error log message for the later problem was: 09:48:04.312800] E [glusterd-handshake.c:771:__server_getspec] 0-glusterd: Unable to stat /var/lib/glusterd/vols/... (No such file or directory) thanks to pranith for the help. |