Bug 1262964

Summary: Cannot access volume when network down
Product: [Community] GlusterFS Reporter: Huy VU <huy.vu>
Component: replicateAssignee: Ravishankar N <ravishankar>
Status: CLOSED INSUFFICIENT_DATA QA Contact:
Severity: low Docs Contact:
Priority: low    
Version: 3.6.3CC: amukherj, bugs, huy.vu, pkarampu, ravishankar
Target Milestone: ---Keywords: Reopened, Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-22 10:31:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
gluster logs from node1
none
gluster logs from node2 none

Description Huy VU 2015-09-14 18:56:31 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Atin Mukherjee 2015-09-15 04:40:31 UTC
Please provide description of the problem along with the logs, having a one liner statement in the bug doesn't help developers to get to know the exact issue.

Comment 2 Kaleb KEITHLEY 2015-09-15 12:30:32 UTC
If the network is down, pretty much nothing can work.

Closing this bug. If you wish to provide the requested information and/or can explain why you still think it's a bug in gluster when the network is down the you may reopen this bug by changing the Status to New.

Comment 3 Huy VU 2015-09-15 13:23:11 UTC
Created attachment 1073653 [details]
gluster logs from node1

gluster log from node1

Comment 4 Huy VU 2015-09-15 13:24:07 UTC
Created attachment 1073654 [details]
gluster logs from node2

gluster logs from node2. NIC was brought down manually at around 9am.

Comment 5 Huy VU 2015-09-15 13:31:24 UTC
Description of problem:


Version-Release number of selected component (if applicable):
glusterfs-api-3.6.5-1.el6.x86_64
glusterfs-server-3.6.5-1.el6.x86_64
glusterfs-3.6.5-1.el6.x86_64
glusterfs-cli-3.6.5-1.el6.x86_64
glusterfs-fuse-3.6.5-1.el6.x86_64
glusterfs-libs-3.6.5-1.el6.x86_64


How reproducible:


Steps to Reproduce:
1.Create a 2-node to replicate a volume.
2.Verify that replication works both ways
3.Bring down the NIC on node 2 using the command: ifconfig eth0 down
4.Access the gluster volume on either node

Actual results:
Any command line command that accesses the gluster volume freezes for about 30 seconds.
After about 30 seconds, the command proceeds as expected.
Any change to a file on the volume on any node while the NIC of node 2 is down can cause the volume to be split brained
Split brain behaviour did not resolve itself after the NIC of node 2 was returned to service (even after 1 hour of doing so)


Expected results:
Access of the volume on either node should not be impeded when The NIC of node 2 was brought down
Changes to a file on either node that do not result in conflicts should not cause split brain after the NIC returns to service


Additional info:

Comment 6 Huy VU 2015-09-15 13:32:58 UTC
(In reply to Huy VU from comment #0)
> Description of problem:
> 
> 
> Version-Release number of selected component (if applicable):
> 
> 
> How reproducible:
> 
> 
> Steps to Reproduce:
> 1.
> 2.
> 3.
> 
> Actual results:
> 
> 
> Expected results:
> 
> 
> Additional info:

Sorry about lack of info. I didn't think I pressed the Save Changes button. Please see additional info below.

Comment 7 Huy VU 2015-09-15 13:40:35 UTC
Information has been provided. Please review.

Comment 8 Huy VU 2015-09-15 13:42:42 UTC
Manually launching heal on node 1:
[root@huysnpmvm10 glusterd]# gluster volume heal gv0
Launching heal operation to perform index self heal on volume gv0 has been successful
Use heal info commands to check status
[root@huysnpmvm10 glusterd]# gluster volume heal gv0 info
Brick huysnpmvm10:/data/brick1/gv0/
/ - Is in split-brain

/testfile.txt
Number of entries: 2

Brick huysnpmvm11:/data/brick1/gv0/
/ - Is in split-brain

Number of entries: 1

[root@huysnpmvm10 glusterd]# gluster volume heal gv0
Launching heal operation to perform index self heal on volume gv0 has been successful
Use heal info commands to check status
[root@huysnpmvm10 glusterd]# gluster volume heal gv0 info
Brick huysnpmvm10:/data/brick1/gv0/
/ - Is in split-brain

/testfile.txt
Number of entries: 2

Brick huysnpmvm11:/data/brick1/gv0/
/ - Is in split-brain

Number of entries: 1


Logs from glfsheal-gv0.log:


[2015-09-15 13:38:51.181240] I [dht-shared.c:337:dht_init_regex] 0-gv0-dht: using regex rsync-hash-regex = ^\.(.+)\.[^.]+$
[2015-09-15 13:38:51.185147] I [glfs-master.c:93:notify] 0-gfapi: New graph 68757973-6e70-6d76-6d31-302d34353433 (0) coming up
[2015-09-15 13:38:51.185284] I [client.c:2280:notify] 0-gv0-client-0: parent translators are ready, attempting connect on transport
[2015-09-15 13:38:51.185725] I [client.c:2280:notify] 0-gv0-client-1: parent translators are ready, attempting connect on transport
[2015-09-15 13:38:51.186722] I [rpc-clnt.c:1761:rpc_clnt_reconfig] 0-gv0-client-0: changing port to 49152 (from 0)
[2015-09-15 13:38:51.187540] I [client-handshake.c:1413:select_server_supported_programs] 0-gv0-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-15 13:38:51.187632] I [rpc-clnt.c:1761:rpc_clnt_reconfig] 0-gv0-client-1: changing port to 49152 (from 0)
[2015-09-15 13:38:51.188093] I [client-handshake.c:1200:client_setvolume_cbk] 0-gv0-client-0: Connected to gv0-client-0, attached to remote volume '/data/brick1/gv0'.
[2015-09-15 13:38:51.188150] I [client-handshake.c:1210:client_setvolume_cbk] 0-gv0-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-15 13:38:51.188294] I [MSGID: 108005] [afr-common.c:3686:afr_notify] 0-gv0-replicate-0: Subvolume 'gv0-client-0' came back up; going online.
[2015-09-15 13:38:51.188598] I [client-handshake.c:188:client_set_lk_version_cbk] 0-gv0-client-0: Server lk version = 1
[2015-09-15 13:38:51.188871] I [client-handshake.c:1413:select_server_supported_programs] 0-gv0-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-15 13:38:51.189454] I [client-handshake.c:1200:client_setvolume_cbk] 0-gv0-client-1: Connected to gv0-client-1, attached to remote volume '/data/brick1/gv0'.
[2015-09-15 13:38:51.189692] I [client-handshake.c:1210:client_setvolume_cbk] 0-gv0-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2015-09-15 13:38:51.198011] I [client-handshake.c:188:client_set_lk_version_cbk] 0-gv0-client-1: Server lk version = 1
[2015-09-15 13:38:53.210926] I [afr-self-heal-entry.c:561:afr_selfheal_entry_do] 0-gv0-replicate-0: performing entry selfheal on 00000000-0000-0000-0000-000000000001
[2015-09-15 13:38:53.215174] E [afr-self-heal-entry.c:246:afr_selfheal_detect_gfid_and_type_mismatch] 0-gv0-replicate-0: Gfid mismatch detected for <00000000-0000-0000-0000-000000000001/testfile.txt>, e2a9f622-2038-40fd-b133-7093c9953db5 on gv0-client-1 and 241e0aea-c6b3-43f1-bd3b-6c59bfd40bcf on gv0-client-0. Skipping conservative merge on the file.
[2015-09-15 13:38:53.218606] W [afr-common.c:1803:afr_discover_done] 0-gv0-replicate-0: no read subvols for /
[2015-09-15 13:38:53.218811] I [afr-common.c:1491:afr_local_discovery_cbk] 0-gv0-replicate-0: selecting local read_child gv0-client-0
[2015-09-15 13:38:53.219146] W [afr-common.c:1803:afr_discover_done] 0-gv0-replicate-0: no read subvols for /
[2015-09-15 13:38:53.219551] I [glfs-resolve.c:836:__glfs_active_subvol] 0-gv0: switched to graph 68757973-6e70-6d76-6d31-302d34353433 (0)

Comment 9 Huy VU 2015-09-15 14:39:27 UTC
(In reply to Huy VU from comment #5)
> Description of problem:
> 
> 
> Version-Release number of selected component (if applicable):
> glusterfs-api-3.6.5-1.el6.x86_64
> glusterfs-server-3.6.5-1.el6.x86_64
> glusterfs-3.6.5-1.el6.x86_64
> glusterfs-cli-3.6.5-1.el6.x86_64
> glusterfs-fuse-3.6.5-1.el6.x86_64
> glusterfs-libs-3.6.5-1.el6.x86_64
> 
> 
> How reproducible:
> 
> 
> Steps to Reproduce:
> 1.Create a 2-node to replicate a volume.
> 2.Verify that replication works both ways
> 3.Bring down the NIC on node 2 using the command: ifconfig eth0 down
> 4.Access the gluster volume on either node
> 
Volume info:
Node 1:
[root@huysnpmvm10 glusterd]# gluster volume info

Volume Name: gv0
Type: Replicate
Volume ID: 2d189bdb-d657-4d7c-9556-6e7676b35ea3
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.35.29.222:/data/brick1/gv0
Brick2: 10.35.29.223:/data/brick1/gv0

Node 2:
[root@huysnpmvm11 glusterd]# gluster volume info

Volume Name: gv0
Type: Replicate
Volume ID: 2d189bdb-d657-4d7c-9556-6e7676b35ea3
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.35.29.222:/data/brick1/gv0
Brick2: 10.35.29.223:/data/brick1/gv0


> Actual results:
> Any command line command that accesses the gluster volume freezes for about
> 30 seconds.
> After about 30 seconds, the command proceeds as expected.
> Any change to a file on the volume on any node while the NIC of node 2 is
> down can cause the volume to be split brained
> Split brain behaviour did not resolve itself after the NIC of node 2 was
> returned to service (even after 1 hour of doing so)
> 
> 
> Expected results:
> Access of the volume on either node should not be impeded when The NIC of
> node 2 was brought down
> Changes to a file on either node that do not result in conflicts should not
> cause split brain after the NIC returns to service
> 
> 
> Additional info:

Comment 10 Atin Mukherjee 2015-09-16 04:17:02 UTC
Ravi,

Could you provide your inputs here?

Thanks,
Atin

Comment 11 Ravishankar N 2015-09-16 05:52:13 UTC
1. Regarding "Changes to a file on either node that do not result in conflicts should not cause split brain after the NIC returns to service" :

The steps that are described are the very steps that result in the split-brain of a file because the mount on each node can see itself and not the other. Once a file gets into a split-brained state, there is no way of automagically getting out of it. You need manual intervention to resolve split-brains.

For gluster 3.6 or lower, use https://github.com/gluster/glusterdocs/blob/master/Troubleshooting/split-brain.md to resolve split-brains from the back-end bricks.

For 3.7 upwards, you can use  the gluster CLI commands from the server (or) a combination of get/seetfattr commands from the mount to resolve split-brain. Usage is documented at https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md


2. Regarding the 30 seconds hang, I think that is expected behaviour (ping-timeout?). Feel free to re-assign to appropriate component if it is a bug.

Comment 12 Atin Mukherjee 2015-09-16 06:57:09 UTC
(In reply to Ravishankar N from comment #11)
> 1. Regarding "Changes to a file on either node that do not result in
> conflicts should not cause split brain after the NIC returns to service" :
> 
> The steps that are described are the very steps that result in the
> split-brain of a file because the mount on each node can see itself and not
> the other. Once a file gets into a split-brained state, there is no way of
> automagically getting out of it. You need manual intervention to resolve
> split-brains.
> 
> For gluster 3.6 or lower, use
> https://github.com/gluster/glusterdocs/blob/master/Troubleshooting/split-
> brain.md to resolve split-brains from the back-end bricks.
> 
> For 3.7 upwards, you can use  the gluster CLI commands from the server (or)
> a combination of get/seetfattr commands from the mount to resolve
> split-brain. Usage is documented at
> https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-
> info-and-split-brain-resolution.md
> 
> 
> 2. Regarding the 30 seconds hang, I think that is expected behaviour
> (ping-timeout?). Feel free to re-assign to appropriate component if it is a
> bug.

Thanks Ravi for the explanation. Closing this bug as this is an expected behaviour.

Comment 13 Huy VU 2015-09-16 12:32:18 UTC
Ravi,
Thank you for your explanation.

(In reply to Ravishankar N from comment #11)
> 1. Regarding "Changes to a file on either node that do not result in
> conflicts should not cause split brain after the NIC returns to service" :
> 
> The steps that are described are the very steps that result in the
> split-brain of a file because the mount on each node can see itself and not

So that I understand this clearly: A single change to a file while one node in the cluster is down will result in split-brain when the node recovers. Is this true?

> the other. Once a file gets into a split-brained state, there is no way of
> automagically getting out of it. You need manual intervention to resolve
> split-brains.
> 
> For gluster 3.6 or lower, use
> https://github.com/gluster/glusterdocs/blob/master/Troubleshooting/split-
> brain.md to resolve split-brains from the back-end bricks.
> 
> For 3.7 upwards, you can use  the gluster CLI commands from the server (or)
> a combination of get/seetfattr commands from the mount to resolve
> split-brain. Usage is documented at
> https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-
> info-and-split-brain-resolution.md
> 
> 
> 2. Regarding the 30 seconds hang, I think that is expected behaviour
> (ping-timeout?). Feel free to re-assign to appropriate component if it is a
> bug.

Again what we are saying then is that if a node in the cluster goes down, all clients will see 30 second hang? Is this true?

Comment 14 Ravishankar N 2015-09-16 12:53:48 UTC
Hi VU,

Writing to the same file from the clients on both nodes where the node can only see itself and not the other can result in split-brains. Is that not what you did? If not I may have misunderstood the steps.

Comment 15 Huy VU 2015-09-16 13:06:15 UTC
(In reply to Ravishankar N from comment #14)
> Hi VU,
> 
> Writing to the same file from the clients on both nodes where the node can
> only see itself and not the other can result in split-brains. Is that not
> what you did? If not I may have misunderstood the steps.

Ravi,
I am sorry for not making the steps clearer.

I used vi to add a few lines to the file on one node while the NIC card of the other node was forced down. Then I brought the NIC card up. That was enough to cause split brain.

I am also interested in knowing why there was a 30 second hang on both nodes when the NIC card was brought down.

NOTE: I tested directly on the two nodes. i.e. the vi command was run directly on node 1. I don't think this should have any bearing on the behaviour.

Comment 16 Ravishankar N 2015-09-16 16:17:16 UTC
(In reply to Huy VU from comment #15)
> (In reply to Ravishankar N from comment #14)
> > Hi VU,
> > 
> > Writing to the same file from the clients on both nodes where the node can
> > only see itself and not the other can result in split-brains. Is that not
> > what you did? If not I may have misunderstood the steps.
> 
> Ravi,
> I am sorry for not making the steps clearer.
> 
> I used vi to add a few lines to the file on one node while the NIC card of
> the other node was forced down. Then I brought the NIC card up. That was
> enough to cause split brain.

Ah when you edit using vi, it creates a new swap file (with different gfid) and renames it to the original file. But when node2 comes up, it should be healed from node 1. But instead it is trying to do a conservative merge, which means some kind of modification was done from the mount on node2 when its eth0 was down. But you say that isn't the case. Let me see the logs and figure out.

> 
> I am also interested in knowing why there was a 30 second hang on both nodes
> when the NIC card was brought down.

When you brought the interface down, I'm guessing the mount on node 1 is not notified immediately (unlike a case when the brick process is killed etc in which case the mount immediately gets a disconnect event for that brick), So it waits until network,ping-timeout value (42 seconds by default).
> 
> NOTE: I tested directly on the two nodes. i.e. the vi command was run
> directly on node 1. I don't think this should have any bearing on the
> behaviour.

Comment 17 Ravishankar N 2015-09-17 03:23:47 UTC
So I see that the file has been edited from node 2 as well. 

`grep -rne "renaming /testfile.txt"  mnt-glusterd.log` shows entries from the mount log of both nodes, leading to gfid split-brain. Is that right?

Comment 18 Atin Mukherjee 2015-09-17 11:52:29 UTC
(In reply to Huy VU from comment #15)
> (In reply to Ravishankar N from comment #14)
> > Hi VU,
> > 
> > Writing to the same file from the clients on both nodes where the node can
> > only see itself and not the other can result in split-brains. Is that not
> > what you did? If not I may have misunderstood the steps.
> 
> Ravi,
> I am sorry for not making the steps clearer.
> 
> I used vi to add a few lines to the file on one node while the NIC card of
> the other node was forced down. Then I brought the NIC card up. That was
> enough to cause split brain.
> 
> I am also interested in knowing why there was a 30 second hang on both nodes
> when the NIC card was brought down.
glusterd's ping time out value is 30 secs. So in case of any node/network going faulty the other parties may not get a disconnect notification back as tcp keep alive never guarantees that. Its the application's responsibilities to have heart beats to detect such failures. In this case after 30 secs the time out happened and during that interval you observed the hang which is expected.
Hope this clarifies the point here.

Ravi,

since I do not see this behaviour as an issue. moving the component back to replicate
> 
> NOTE: I tested directly on the two nodes. i.e. the vi command was run
> directly on node 1. I don't think this should have any bearing on the
> behaviour.

Comment 19 Huy VU 2015-09-17 12:40:09 UTC
(In reply to Ravishankar N from comment #17)
> So I see that the file has been edited from node 2 as well. 
> 
> `grep -rne "renaming /testfile.txt"  mnt-glusterd.log` shows entries from
> the mount log of both nodes, leading to gfid split-brain. Is that right?

Hello Ravi,

I did a number of tests at different times. Some tests had me editing the file on node 1; some one node 2; some on both. When you grep for 'renaming', please do so on both sets of logs (node 1's and node 2's) and compare the timestamps of the logs.

Comment 20 Huy VU 2015-09-17 12:49:12 UTC
(In reply to Atin Mukherjee from comment #18)
> (In reply to Huy VU from comment #15)
> > (In reply to Ravishankar N from comment #14)
> > > Hi VU,
> > > 
> > > Writing to the same file from the clients on both nodes where the node can
> > > only see itself and not the other can result in split-brains. Is that not
> > > what you did? If not I may have misunderstood the steps.
> > 
> > Ravi,
> > I am sorry for not making the steps clearer.
> > 
> > I used vi to add a few lines to the file on one node while the NIC card of
> > the other node was forced down. Then I brought the NIC card up. That was
> > enough to cause split brain.
> > 
> > I am also interested in knowing why there was a 30 second hang on both nodes
> > when the NIC card was brought down.
> glusterd's ping time out value is 30 secs. So in case of any node/network
> going faulty the other parties may not get a disconnect notification back as
> tcp keep alive never guarantees that. Its the application's responsibilities
> to have heart beats to detect such failures. In this case after 30 secs the
> time out happened and during that interval you observed the hang which is
> expected.
> Hope this clarifies the point here.
> 
> Ravi,
> 
> since I do not see this behaviour as an issue. moving the component back to
> replicate
> > 
> > NOTE: I tested directly on the two nodes. i.e. the vi command was run
> > directly on node 1. I don't think this should have any bearing on the
> > behaviour.
Atin,

If this behaviour is as intended then so be it; you can close this bug. However, I would think it's an area of improvement. The current behaviour can be described as synchronous replication. If there is a control flag for us to choose asynchronous replication that would improve performance tremendously.

Comment 21 Ravishankar N 2015-09-17 15:00:04 UTC
(In reply to Huy VU from comment #19)
> (In reply to Ravishankar N from comment #17)
> > So I see that the file has been edited from node 2 as well. 
> > 
> > `grep -rne "renaming /testfile.txt"  mnt-glusterd.log` shows entries from
> > the mount log of both nodes, leading to gfid split-brain. Is that right?
> 
> Hello Ravi,
> 
> I did a number of tests at different times. Some tests had me editing the
> file on node 1; some one node 2; some on both.

Hi VU,

I know that the time stamps are different. If you do modification operations from different nodes while each node cannot see the other, it will result in  split-brain.

Are you consistently able to repro the issue with the steps you described? (I am not). If yes then please upload logs from that fresh test set-up (makes debugging easier). Here is what I tried:

1. Create a 1x2 volume and mount on both nodes, create a file on the mount from any mount.
2. Bring eth0 down on node2
3. Edit the file from node1's mount
4. Bring back node2.
5. Launch heal.

No split-brain observed.

Note that if you do any modifications on node-2 (even to another file) between steps 2 and 4, the parent directories end up in entry split-brain and a conservative merge is attempted, which fails to heal the file edited on step 3 due to gfid mismatch. This is what I think happened in your case.

 When you grep for 'renaming',
> please do so on both sets of logs (node 1's and node 2's) and compare the
> timestamps of the logs.

Comment 22 Red Hat Bugzilla 2023-09-14 03:05:17 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days