770478 – volume replace-brick unstable

Bug 770478 - volume replace-brick unstable

Summary: volume replace-brick unstable

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	3.2.5
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	krishnan parthasarathi
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-12-26 23:06 UTC by Matt Harris
Modified:	2015-11-03 23:04 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-07-24 10:21:29 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Logs from all four servers, as well as some commands on cloud0 that are representative of the environment and the problem. (47 bytes, text/plain) 2012-01-03 23:54 UTC, Matt Harris	no flags	Details
Comment (70.01 KB, text/plain) 2011-12-26 23:06 UTC, Matt Harris	no flags	Details
View All

Description Matt Harris 2011-12-26 23:06:17 UTC

Created attachment 915393 [details]
Comment

(This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).

Comment 1 krishnan parthasarathi 2011-12-27 04:49:11 UTC

Matt, 
Could you attach the source brick log, destination brick log, log files of glusterd where the command was issued and glusterd present in the machine where source brick is? 
Having the entire logfile helps building context while analysing the issue.

Comment 2 Matt Harris 2012-01-03 23:54:37 UTC

Created attachment 550577 [details]
Logs from all four servers, as well as some commands on cloud0 that are representative of the environment and the problem.

Comment 3 krishnan parthasarathi 2012-01-05 07:11:01 UTC

Matt, I am unable to access the logs from the link you have provided. Could you check that and let me know.

Comment 4 Matt Harris 2012-01-05 18:01:32 UTC

Krishnan, please try again.  I have adjusted our firewall and you should now be able to download the files.

Thanks!

Comment 5 krishnan parthasarathi 2012-01-10 16:46:49 UTC

Matt, from the glusterd log files it appears that you are using version 3.2.4. The issue you are facing was fixed in version 3.2.5 (tracked by another bug raised by you). See https://bugzilla.redhat.com/show_bug.cgi?id=GLUSTER-3709  
Let us know if you are seeing the same issue in 3.2.5 as well.

Comment 6 Matt Harris 2012-01-10 17:17:20 UTC

I am running 3.2.5 on every brick.  The issue I am reporting here is not the same as the issue that was fixed in 3709, although it appears that the issue reported in 3709 has not been fixed either.

The issue I am reporting here is that the replace-brick operation does not complete.  I have gone to great lengths to isolate possible network instability causes, including tracking down and replacing a bad patch cable.  The issue I am reporting and the logs I have attached are all from after upgrading from 3.1.6 to 3.2.5, re-creating the volume, and verifying that the network layer is functioning properly.

I downloaded the ubuntu .deb files from http://download.gluster.com/pub/gluster/glusterfs/3.2/3.2.5/Ubuntu/11.10/

gluster -v reports 3.2.5 on every brick.

Comment 7 krishnan parthasarathi 2012-01-11 07:41:30 UTC

Matt,

Let me clarify, I am not suggesting that the issue you are facing is the same as 3709. I suspect it is coming in the way of determining what the problem here is. From the description of this problem, it appears that the glusterd peers begin to disagree about the current 'state' of replace-brick (inferred from Step 7 of "Steps to reproduce").

I see the following lines in the logs that you have attached to the bug. This leads to me believe that your glusterd version is 3.2.4. The fix for 3709 is available in glusterd version 3.2.5 onwards. We need to find why an 'older' version of glusterd is running. Once that is identified, we can check if the issue is seen when using _glusterd_ version 3.2.5. (Try glusterd -v)

<log-output>
root@trantor:~/rb_unstable/gluster# grep "3.2.4" fs{1,2,3}/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log 
fs1/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log:[2011-11-10 11:57:40.849643] I [glusterfsd.c:1493:main] 0-/opt/gluster3/sbin/glusterd: Started Running /opt/gluster3/sbin/glusterd version 3.2.4
fs1/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log:[2011-11-10 11:57:40.854528] E [rpc-transport.c:677:rpc_transport_load] 0-rpc-transport: /opt/gluster3/lib/glusterfs/3.2.4/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
fs2/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log:[2011-11-10 11:57:29.724708] I [glusterfsd.c:1493:main] 0-/opt/gluster3/sbin/glusterd: Started Running /opt/gluster3/sbin/glusterd version 3.2.4
fs2/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log:[2011-11-10 11:57:29.736698] E [rpc-transport.c:677:rpc_transport_load] 0-rpc-transport: /opt/gluster3/lib/glusterfs/3.2.4/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
fs3/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log:[2011-11-10 11:48:31.250081] I [glusterfsd.c:1493:main] 0-/opt/gluster3/sbin/glusterd: Started Running /opt/gluster3/sbin/glusterd version 3.2.4
fs3/opt/gluster3/var/log/glusterfs/opt-gluster3-etc-glusterfs-glusterd.vol.log:[2011-11-10 11:48:31.259753] E [rpc-transport.c:677:rpc_transport_load] 0-rpc-transport: /opt/gluster3/lib/glusterfs/3.2.4/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
</log-output>

Comment 8 Matt Harris 2012-01-11 14:19:42 UTC

Krishnan,

The lines you're quoting above appear nowhere in the logs I sent.  Please verify that you are working from the same files I am.

Matt

Comment 9 krishnan parthasarathi 2012-01-14 08:08:03 UTC

Matt,
Sorry, let me check that again.

Krish

Comment 10 krishnan parthasarathi 2012-01-24 13:43:24 UTC

Matt,
The source brick has crashed after migrating some amount of data, due to same problem as seen in https://bugzilla.redhat.com/show_bug.cgi?id=GLUSTER-3790

Today, glusterd doesn't change its replace-brick running state even if source brick crashed. An alternative to the removal rb* files would be to restart the source brick using "gluster volume start <volname> force". Now you can issue the replace-brick subcommands as you wish. I agree that this is still an inconvenience. We could probably have an abort force like commit force, which would allow the user to abort replace-brick operation even when the source brick is unreachable. Will update the bug on how we proceed with this.

Comment 11 krishnan parthasarathi 2012-07-24 10:21:29 UTC

A lot of fixes have gone into afr (xlator on which replace-brick heavily relies on) and in glusterd from the time this bug was raised. Closing the bug as replace-brick is working fine on master branch. If found to be happening in HEAD of master, please re-open.

Note You need to log in before you can comment on or make changes to this bug.