Bug 770478 - volume replace-brick unstable
volume replace-brick unstable
Product: GlusterFS
Classification: Community
Component: glusterd (Show other bugs)
x86_64 Linux
high Severity high
: ---
: ---
Assigned To: krishnan parthasarathi
Depends On:
  Show dependency treegraph
Reported: 2011-12-26 18:06 EST by Matt Harris
Modified: 2015-11-03 18:04 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2012-07-24 06:21:29 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:

Attachments (Terms of Use)
Logs from all four servers, as well as some commands on cloud0 that are representative of the environment and the problem. (47 bytes, text/plain)
2012-01-03 18:54 EST, Matt Harris
no flags Details
Comment (70.01 KB, text/plain)
2011-12-26 18:06 EST, Matt Harris
no flags Details

  None (edit)
Description Matt Harris 2011-12-26 18:06:17 EST
Created attachment 915393 [details]

(This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).
Comment 1 krishnan parthasarathi 2011-12-26 23:49:11 EST
Could you attach the source brick log, destination brick log, log files of glusterd where the command was issued and glusterd present in the machine where source brick is? 
Having the entire logfile helps building context while analysing the issue.
Comment 2 Matt Harris 2012-01-03 18:54:37 EST
Created attachment 550577 [details]
Logs from all four servers, as well as some commands on cloud0 that are representative of the environment and the problem.
Comment 3 krishnan parthasarathi 2012-01-05 02:11:01 EST
Matt, I am unable to access the logs from the link you have provided. Could you check that and let me know.
Comment 4 Matt Harris 2012-01-05 13:01:32 EST
Krishnan, please try again.  I have adjusted our firewall and you should now be able to download the files.

Comment 5 krishnan parthasarathi 2012-01-10 11:46:49 EST
Matt, from the glusterd log files it appears that you are using version 3.2.4. The issue you are facing was fixed in version 3.2.5 (tracked by another bug raised by you). See https://bugzilla.redhat.com/show_bug.cgi?id=GLUSTER-3709  
Let us know if you are seeing the same issue in 3.2.5 as well.
Comment 6 Matt Harris 2012-01-10 12:17:20 EST
I am running 3.2.5 on every brick.  The issue I am reporting here is not the same as the issue that was fixed in 3709, although it appears that the issue reported in 3709 has not been fixed either.

The issue I am reporting here is that the replace-brick operation does not complete.  I have gone to great lengths to isolate possible network instability causes, including tracking down and replacing a bad patch cable.  The issue I am reporting and the logs I have attached are all from after upgrading from 3.1.6 to 3.2.5, re-creating the volume, and verifying that the network layer is functioning properly.

I downloaded the ubuntu .deb files from http://download.gluster.com/pub/gluster/glusterfs/3.2/3.2.5/Ubuntu/11.10/

gluster -v reports 3.2.5 on every brick.
Comment 7 krishnan parthasarathi 2012-01-11 02:41:30 EST

Let me clarify, I am not suggesting that the issue you are facing is the same as 3709. I suspect it is coming in the way of determining what the problem here is. From the description of this problem, it appears that the glusterd peers begin to disagree about the current 'state' of replace-brick (inferred from Step 7 of "Steps to reproduce").

I see the following lines in the logs that you have attached to the bug. This leads to me believe that your glusterd version is 3.2.4. The fix for 3709 is available in glusterd version 3.2.5 onwards. We need to find why an 'older' version of glusterd is running. Once that is identified, we can check if the issue is seen when using _glusterd_ version 3.2.5. (Try glusterd -v)

root@trantor:~/rb_unstable/gluster# grep "3.2.4" fs{1,2,3}/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log 
fs1/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log:[2011-11-10 11:57:40.849643] I [glusterfsd.c:1493:main] 0-/opt/gluster3/sbin/glusterd: Started Running /opt/gluster3/sbin/glusterd version 3.2.4
fs1/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log:[2011-11-10 11:57:40.854528] E [rpc-transport.c:677:rpc_transport_load] 0-rpc-transport: /opt/gluster3/lib/glusterfs/3.2.4/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
fs2/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log:[2011-11-10 11:57:29.724708] I [glusterfsd.c:1493:main] 0-/opt/gluster3/sbin/glusterd: Started Running /opt/gluster3/sbin/glusterd version 3.2.4
fs2/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log:[2011-11-10 11:57:29.736698] E [rpc-transport.c:677:rpc_transport_load] 0-rpc-transport: /opt/gluster3/lib/glusterfs/3.2.4/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
fs3/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log:[2011-11-10 11:48:31.250081] I [glusterfsd.c:1493:main] 0-/opt/gluster3/sbin/glusterd: Started Running /opt/gluster3/sbin/glusterd version 3.2.4
fs3/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log:[2011-11-10 11:48:31.259753] E [rpc-transport.c:677:rpc_transport_load] 0-rpc-transport: /opt/gluster3/lib/glusterfs/3.2.4/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
Comment 8 Matt Harris 2012-01-11 09:19:42 EST

The lines you're quoting above appear nowhere in the logs I sent.  Please verify that you are working from the same files I am.

Comment 9 krishnan parthasarathi 2012-01-14 03:08:03 EST
Sorry, let me check that again.

Comment 10 krishnan parthasarathi 2012-01-24 08:43:24 EST
The source brick has crashed after migrating some amount of data, due to same problem as seen in https://bugzilla.redhat.com/show_bug.cgi?id=GLUSTER-3790

Today, glusterd doesn't change its replace-brick running state even if source brick crashed. An alternative to the removal rb* files would be to restart the source brick using "gluster volume start <volname> force". Now you can issue the replace-brick subcommands as you wish. I agree that this is still an inconvenience. We could probably have an abort force like commit force, which would allow the user to abort replace-brick operation even when the source brick is unreachable. Will update the bug on how we proceed with this.
Comment 11 krishnan parthasarathi 2012-07-24 06:21:29 EDT
A lot of fixes have gone into afr (xlator on which replace-brick heavily relies on) and in glusterd from the time this bug was raised. Closing the bug as replace-brick is working fine on master branch. If found to be happening in HEAD of master, please re-open.

Note You need to log in before you can comment on or make changes to this bug.