Bug 770478

Summary: volume replace-brick unstable
Product: [Community] GlusterFS Reporter: Matt Harris <matthaeus.harris>
Component: glusterdAssignee: krishnan parthasarathi <kparthas>
Status: CLOSED WORKSFORME QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 3.2.5CC: gluster-bugs, kparthas, nsathyan
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-07-24 06:21:29 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
Logs from all four servers, as well as some commands on cloud0 that are representative of the environment and the problem.
none
Comment none

Description Matt Harris 2011-12-26 18:06:17 EST
Created attachment 915393 [details]
Comment

(This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).
Comment 1 krishnan parthasarathi 2011-12-26 23:49:11 EST
Matt, 
Could you attach the source brick log, destination brick log, log files of glusterd where the command was issued and glusterd present in the machine where source brick is? 
Having the entire logfile helps building context while analysing the issue.
Comment 2 Matt Harris 2012-01-03 18:54:37 EST
Created attachment 550577 [details]
Logs from all four servers, as well as some commands on cloud0 that are representative of the environment and the problem.
Comment 3 krishnan parthasarathi 2012-01-05 02:11:01 EST
Matt, I am unable to access the logs from the link you have provided. Could you check that and let me know.
Comment 4 Matt Harris 2012-01-05 13:01:32 EST
Krishnan, please try again.  I have adjusted our firewall and you should now be able to download the files.

Thanks!
Comment 5 krishnan parthasarathi 2012-01-10 11:46:49 EST
Matt, from the glusterd log files it appears that you are using version 3.2.4. The issue you are facing was fixed in version 3.2.5 (tracked by another bug raised by you). See https://bugzilla.redhat.com/show_bug.cgi?id=GLUSTER-3709  
Let us know if you are seeing the same issue in 3.2.5 as well.
Comment 6 Matt Harris 2012-01-10 12:17:20 EST
I am running 3.2.5 on every brick.  The issue I am reporting here is not the same as the issue that was fixed in 3709, although it appears that the issue reported in 3709 has not been fixed either.

The issue I am reporting here is that the replace-brick operation does not complete.  I have gone to great lengths to isolate possible network instability causes, including tracking down and replacing a bad patch cable.  The issue I am reporting and the logs I have attached are all from after upgrading from 3.1.6 to 3.2.5, re-creating the volume, and verifying that the network layer is functioning properly.

I downloaded the ubuntu .deb files from http://download.gluster.com/pub/gluster/glusterfs/3.2/3.2.5/Ubuntu/11.10/

gluster -v reports 3.2.5 on every brick.
Comment 7 krishnan parthasarathi 2012-01-11 02:41:30 EST
Matt,

Let me clarify, I am not suggesting that the issue you are facing is the same as 3709. I suspect it is coming in the way of determining what the problem here is. From the description of this problem, it appears that the glusterd peers begin to disagree about the current 'state' of replace-brick (inferred from Step 7 of "Steps to reproduce").

I see the following lines in the logs that you have attached to the bug. This leads to me believe that your glusterd version is 3.2.4. The fix for 3709 is available in glusterd version 3.2.5 onwards. We need to find why an 'older' version of glusterd is running. Once that is identified, we can check if the issue is seen when using _glusterd_ version 3.2.5. (Try glusterd -v)

<log-output>
root@trantor:~/rb_unstable/gluster# grep "3.2.4" fs{1,2,3}/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log 
fs1/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log:[2011-11-10 11:57:40.849643] I [glusterfsd.c:1493:main] 0-/opt/gluster3/sbin/glusterd: Started Running /opt/gluster3/sbin/glusterd version 3.2.4
fs1/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log:[2011-11-10 11:57:40.854528] E [rpc-transport.c:677:rpc_transport_load] 0-rpc-transport: /opt/gluster3/lib/glusterfs/3.2.4/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
fs2/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log:[2011-11-10 11:57:29.724708] I [glusterfsd.c:1493:main] 0-/opt/gluster3/sbin/glusterd: Started Running /opt/gluster3/sbin/glusterd version 3.2.4
fs2/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log:[2011-11-10 11:57:29.736698] E [rpc-transport.c:677:rpc_transport_load] 0-rpc-transport: /opt/gluster3/lib/glusterfs/3.2.4/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
fs3/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log:[2011-11-10 11:48:31.250081] I [glusterfsd.c:1493:main] 0-/opt/gluster3/sbin/glusterd: Started Running /opt/gluster3/sbin/glusterd version 3.2.4
fs3/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log:[2011-11-10 11:48:31.259753] E [rpc-transport.c:677:rpc_transport_load] 0-rpc-transport: /opt/gluster3/lib/glusterfs/3.2.4/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
</log-output>
Comment 8 Matt Harris 2012-01-11 09:19:42 EST
Krishnan,

The lines you're quoting above appear nowhere in the logs I sent.  Please verify that you are working from the same files I am.

Matt
Comment 9 krishnan parthasarathi 2012-01-14 03:08:03 EST
Matt,
Sorry, let me check that again.

Krish
Comment 10 krishnan parthasarathi 2012-01-24 08:43:24 EST
Matt,
The source brick has crashed after migrating some amount of data, due to same problem as seen in https://bugzilla.redhat.com/show_bug.cgi?id=GLUSTER-3790

Today, glusterd doesn't change its replace-brick running state even if source brick crashed. An alternative to the removal rb* files would be to restart the source brick using "gluster volume start <volname> force". Now you can issue the replace-brick subcommands as you wish. I agree that this is still an inconvenience. We could probably have an abort force like commit force, which would allow the user to abort replace-brick operation even when the source brick is unreachable. Will update the bug on how we proceed with this.
Comment 11 krishnan parthasarathi 2012-07-24 06:21:29 EDT
A lot of fixes have gone into afr (xlator on which replace-brick heavily relies on) and in glusterd from the time this bug was raised. Closing the bug as replace-brick is working fine on master branch. If found to be happening in HEAD of master, please re-open.