Bug 1089758

Summary:

KVM+Qemu + libgfapi: problem dealing with failover of replica bricks causing disk corruption and vm failure.

Product:

[Community] GlusterFS

Reporter:

Paul <ppquant>

Component:

replicate

Assignee:

Pranith Kumar K <pkarampu>

Status:

CLOSED EOL

QA Contact:

SATHEESARAN <sasundar>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

3.4.3

CC:

bugs, c.affolter, dietmar, dshetty, garret.cook, gluster-bugs, joe, sasundar

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-10-07 13:59:36 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
log from the brick process with "DEBUG"	none
qemu libgfapi log	none
log from brick on server 14	none
log from brick on server 15	none

Description Paul 2014-04-21 22:01:09 UTC

Description of problem:

libgfapi (used by KVM/Qemu) fails to deal properly with failure of bricks in a replica (2) volume. The result is that KVM looses access to the disk images causing lock-up and disk corruption at the VM level.

Version-Release number of selected component (if applicable):
Gluster fs 3.4.3
KVM 1.7.1

How reproducible:
Create a replica 2 volume with KVM disk image(s). Start KVM, then fail first brick, recover, fail the second brick. At this point KVM looses access to disks.

Steps to Reproduce:
1. Make a replica 2 gluster volume. use it for holding virtual machine disk images
2. Start KVM with libgfapi backend (gluster://....)
3. Kill (or reboot) first brick from the replica set.
4. Bring replica 1 back into the volume and issue a heal operation via gluster vol heal <volume>
5. Wait for the self-heal to complete
6. Check gfid on the bricks to make sure they're healed:

brick 1:

# getfattr -m . -d -e hex vm-1002-disk-1.qcow2 
# file: vm-1002-disk-1.qcow2
trusted.afr.gtest-client-0=0x000000000000000000000000
trusted.afr.gtest-client-1=0x000000000000000000000000
trusted.gfid=0x898f359f15014896a3d7673c438f79d1

brick 2: 

# getfattr -m . -d -e hex vm-1002-disk-1.qcow2
# file: vm-1002-disk-1.qcow2
trusted.afr.gtest-client-0=0x000000000000000000000000
trusted.afr.gtest-client-1=0x000000000000000000000000
trusted.gfid=0x898f359f15014896a3d7673c438f79d1

7. Not stop the glusterfsd (or kill it) on the second brick.

8. Try using the KVM virtual machine. Disks will start throwing errors.


Actual results: 
KVM machine looses access to disks.


Expected results:
KVM machine should continue to have access to disks.


Additional info:
Attached are logs from libgfapi (the client) running in KVM.

Start KVM 

[2014-04-21 20:18:09.219715] I [socket.c:3480:socket_init] 0-gfapi: SSL support is NOT enabled
[2014-04-21 20:18:09.219758] I [socket.c:3495:socket_init] 0-gfapi: using system polling thread
[2014-04-21 20:18:09.227508] I [socket.c:3480:socket_init] 0-gtest-client-1: SSL support is NOT enabled
[2014-04-21 20:18:09.227530] I [socket.c:3495:socket_init] 0-gtest-client-1: using system polling thread
[2014-04-21 20:18:09.228142] I [socket.c:3480:socket_init] 0-gtest-client-0: SSL support is NOT enabled
[2014-04-21 20:18:09.228155] I [socket.c:3495:socket_init] 0-gtest-client-0: using system polling thread
[2014-04-21 20:18:09.228178] I [glfs-master.c:92:notify] 0-gfapi: New graph 7631342d-3334-3636-3131-2d323031342f (0) coming up
[2014-04-21 20:18:09.228193] I [client.c:2155:notify] 0-gtest-client-0: parent translators are ready, attempting connect on transport
[2014-04-21 20:18:09.228646] I [client.c:2155:notify] 0-gtest-client-1: parent translators are ready, attempting connect on transport
[2014-04-21 20:18:09.229356] I [rpc-clnt.c:1676:rpc_clnt_reconfig] 0-gtest-client-0: changing port to 49153 (from 0)
[2014-04-21 20:18:09.229416] W [socket.c:514:__socket_rwv] 0-gtest-client-0: readv failed (No data available)
[2014-04-21 20:18:09.229960] I [client-handshake.c:1659:select_server_supported_programs] 0-gtest-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2014-04-21 20:18:09.230060] I [rpc-clnt.c:1676:rpc_clnt_reconfig] 0-gtest-client-1: changing port to 49153 (from 0)
[2014-04-21 20:18:09.230099] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-21 20:18:09.230380] I [client-handshake.c:1456:client_setvolume_cbk] 0-gtest-client-0: Connected to 10.0.0.14:49153, attached to remote volume '/var/gtest/brick'.
[2014-04-21 20:18:09.230395] I [client-handshake.c:1468:client_setvolume_cbk] 0-gtest-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2014-04-21 20:18:09.230443] I [afr-common.c:3698:afr_notify] 0-gtest-replicate-0: Subvolume 'gtest-client-0' came back up; going online.
[2014-04-21 20:18:09.230598] I [client-handshake.c:450:client_set_lk_version_cbk] 0-gtest-client-0: Server lk version = 1
[2014-04-21 20:18:09.230922] I [client-handshake.c:1659:select_server_supported_programs] 0-gtest-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2014-04-21 20:18:09.231290] I [client-handshake.c:1456:client_setvolume_cbk] 0-gtest-client-1: Connected to 10.0.0.15:49153, attached to remote volume '/var/gtest/brick'.
[2014-04-21 20:18:09.231311] I [client-handshake.c:1468:client_setvolume_cbk] 0-gtest-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2014-04-21 20:18:09.247979] I [client-handshake.c:450:client_set_lk_version_cbk] 0-gtest-client-1: Server lk version = 1
[2014-04-21 20:18:09.248451] I [afr-common.c:2057:afr_set_root_inode_on_first_lookup] 0-gtest-replicate-0: added root inode
[2014-04-21 20:18:09.249045] I [afr-common.c:2120:afr_discovery_cbk] 0-gtest-replicate-0: selecting local read_child gtest-client-0
[2014-04-21 20:18:09.249125] I [glfs-resolve.c:788:__glfs_active_subvol] 0-gtest: switched to graph 7631342d-3334-3636-3131-2d323031342f (0)


Kill glusterfsd:

[2014-04-21 20:21:35.291871] W [socket.c:514:__socket_rwv] 0-gtest-client-0: readv failed (No data available)
[2014-04-21 20:21:35.291914] W [socket.c:1962:__socket_proto_state_machine] 0-gtest-client-0: reading from socket failed. Error (No data available), peer (10.0.0.14:49153)
[2014-04-21 20:21:35.291972] I [client.c:2098:client_rpc_notify] 0-gtest-client-0: disconnected
[2014-04-21 20:21:46.242636] I [rpc-clnt.c:1676:rpc_clnt_reconfig] 0-gtest-client-0: changing port to 49153 (from 0)
[2014-04-21 20:21:46.242721] W [socket.c:514:__socket_rwv] 0-gtest-client-0: readv failed (No data available)
[2014-04-21 20:21:46.243047] E [socket.c:2157:socket_connect_finish] 0-gtest-client-0: connection to 10.0.0.14:49153 failed (Connection refused)
[2014-04-21 20:21:46.243073] W [socket.c:514:__socket_rwv] 0-gtest-client-0: readv failed (No data available)
[2014-04-21 20:21:50.243299] W [socket.c:514:__socket_rwv] 0-gtest-client-0: readv failed (No data available)

On restart of gluster server: 

[2014-04-21 20:23:26.266552] W [socket.c:514:__socket_rwv] 0-gtest-client-0: readv failed (No data available)
[2014-04-21 20:23:27.566449] W [socket.c:514:__socket_rwv] 0-gfapi: readv failed (No data available)
[2014-04-21 20:23:27.566483] W [socket.c:1962:__socket_proto_state_machine] 0-gfapi: reading from socket failed. Error (No data available), peer (127.0.0.1:24007)
[2014-04-21 20:23:29.267245] W [socket.c:514:__socket_rwv] 0-gtest-client-0: readv failed (No data available)
[2014-04-21 20:23:32.268085] W [socket.c:514:__socket_rwv] 0-gtest-client-0: readv failed (No data available)
[2014-04-21 20:23:35.269082] I [client-handshake.c:1659:select_server_supported_programs] 0-gtest-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2014-04-21 20:23:35.269343] I [client-handshake.c:1456:client_setvolume_cbk] 0-gtest-client-0: Connected to 10.0.0.14:49153, attached to remote volume '/var/gtest/brick'.
[2014-04-21 20:23:35.269373] I [client-handshake.c:1468:client_setvolume_cbk] 0-gtest-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2014-04-21 20:23:35.269391] I [client-handshake.c:1308:client_post_handshake] 0-gtest-client-0: 1 fds open - Delaying child_up until they are re-opened
[2014-04-21 20:23:35.269854] I [client-handshake.c:930:client_child_up_reopen_done] 0-gtest-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2014-04-21 20:23:35.269997] I [client-handshake.c:450:client_set_lk_version_cbk] 0-gtest-client-0: Server lk version = 1
[2014-04-21 20:23:38.269808] I [glfs-mgmt.c:429:mgmt_getspec_cbk] 0-gfapi: No change in volfile, continuing

On killing the second replica (after heal completed). Causes KVM to throw disk errors.

[2014-04-21 20:25:51.542494] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-21 20:25:51.542546] W [socket.c:1962:__socket_proto_state_machine] 0-gtest-client-1: reading from socket failed. Error (No data available), peer (10.0.0.15:49153)
[2014-04-21 20:25:51.542591] I [client.c:2098:client_rpc_notify] 0-gtest-client-1: disconnected
[2014-04-21 20:26:02.283871] E [client-handshake.c:1742:client_query_portmap_cbk] 0-gtest-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2014-04-21 20:26:02.283925] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-21 20:26:02.283945] I [client.c:2098:client_rpc_notify] 0-gtest-client-1: disconnected
[2014-04-21 20:26:05.284560] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-21 20:26:05.284607] I [client.c:2098:client_rpc_notify] 0-gtest-client-1: disconnected
[2014-04-21 20:26:08.285137] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-21 20:26:08.285185] I [client.c:2098:client_rpc_notify] 0-gtest-client-1: disconnected
[2014-04-21 20:26:11.285524] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-21 20:26:11.285591] I [client.c:2098:client_rpc_notify] 0-gtest-client-1: disconnected
[2014-04-21 20:26:14.286501] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-21 20:26:14.286552] I [client.c:2098:client_rpc_notify] 0-gtest-client-1: disconnected
[2014-04-21 20:26:17.287190] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-21 20:26:17.287238] I [client.c:2098:client_rpc_notify] 0-gtest-client-1: disconnected
[2014-04-21 20:26:20.287539] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-21 20:26:20.287589] I [client.c:2098:client_rpc_notify] 0-gtest-client-1: disconnected
[2014-04-21 20:26:22.735155] E [afr-self-heal-common.c:2260:afr_self_heal_completion_cbk] 0-gtest-replicate-0: background  meta-data data self-heal failed on /images/1002/vm-1002-disk-1.qcow2
[2014-04-21 20:26:23.288242] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-21 20:26:23.288281] I [client.c:2098:client_rpc_notify] 0-gtest-client-1: disconnected

Comment 1 Pranith Kumar K 2014-04-23 04:48:45 UTC

Paul,
     I don't have kvm which uses libgfapi on my machine (I don't want to disturb the setup I have :-( ). Would it be possible for you to recreate this issue with some instrumentation (logs which could indicate what the problem is)?

Pranith

Comment 2 Joe Julian 2014-04-23 05:00:15 UTC

I don't understand how it's hitting a heal completion callback without starting a background self heal, especially since the xattrs are clean.

Comment 3 Pranith Kumar K 2014-04-23 05:23:46 UTC

Joe,
    Even I am not sure. Unfortunately 3.4.2 self-heal completion log is at DEBUG level. I wanted to make sure this issue is not related to fresh-copy cache maintained in afr didn't go bad for that file. As part of the instrumentation I plan to move that log above to INFO and add some logs about what is the fresh-copy cache now.

Pranith

Comment 4 Paul 2014-04-24 12:36:50 UTC

Pranith,

if you want, I can replicate the issue with logs at DEBUG level. I made a custom kvm+qemu build where I can raise the level higher just for this purpose.

This is happening with gluster 3.4.3.

It is not related to this particular file. I can make it happen each time on different vm (both linux and windows).

Comment 5 Paul 2014-04-24 17:56:04 UTC

I'm adding logs made with level 8 in libgfapi logging (DEBUG):

On the "client" qemu I see:

RESTART FIRST
[2014-04-24 17:44:23.172864] W [socket.c:514:__socket_rwv] 0-gtest-client-0: readv failed (No data available)
[2014-04-24 17:44:26.173499] W [socket.c:514:__socket_rwv] 0-gtest-client-0: readv failed (No data available)
[2014-04-24 17:44:29.174208] W [socket.c:514:__socket_rwv] 0-gtest-client-0: readv failed (No data available)
[2014-04-24 17:44:32.174848] W [socket.c:514:__socket_rwv] 0-gtest-client-0: readv failed (No data available)
[2014-04-24 17:44:32.344589] W [socket.c:514:__socket_rwv] 0-gfapi: readv failed (No data available)
[2014-04-24 17:44:32.344619] W [socket.c:1962:__socket_proto_state_machine] 0-gfapi: reading from socket failed. Error (No data available), peer (127.0.0.1:24007)
[2014-04-24 17:44:35.175559] W [socket.c:514:__socket_rwv] 0-gtest-client-0: readv failed (No data available)
[2014-04-24 17:44:38.176280] W [socket.c:514:__socket_rwv] 0-gtest-client-0: readv failed (No data available)
[2014-04-24 17:44:41.177154] I [client-handshake.c:1659:select_server_supported_programs] 0-gtest-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2014-04-24 17:44:41.177506] I [client-handshake.c:1456:client_setvolume_cbk] 0-gtest-client-0: Connected to 10.0.0.234:49153, attached to remote volume '/var/gtest/brick'.
[2014-04-24 17:44:41.177530] I [client-handshake.c:1468:client_setvolume_cbk] 0-gtest-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2014-04-24 17:44:41.177545] I [client-handshake.c:1308:client_post_handshake] 0-gtest-client-0: 1 fds open - Delaying child_up until they are re-opened
[2014-04-24 17:44:41.177930] I [client-handshake.c:930:client_child_up_reopen_done] 0-gtest-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2014-04-24 17:44:41.178017] I [client-handshake.c:450:client_set_lk_version_cbk] 0-gtest-client-0: Server lk version = 1
[2014-04-24 17:44:43.177790] I [glfs-mgmt.c:429:mgmt_getspec_cbk] 0-gfapi: No change in volfile, continuing
KILL SECOND
[2014-04-24 17:46:24.523825] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-24 17:46:24.523860] W [socket.c:1962:__socket_proto_state_machine] 0-gtest-client-1: reading from socket failed. Error (No data available), peer (10.0.0.235:49153)
[2014-04-24 17:46:24.523914] I [client.c:2098:client_rpc_notify] 0-gtest-client-1: disconnected
[2014-04-24 17:46:29.896721] E [afr-self-heal-data.c:986:afr_sh_data_fxattrop_cbk] 0-gtest-replicate-0: /images/1002/vm-1002-disk-1.qcow2, inspecting change log succeeded on < 2 children
[2014-04-24 17:46:29.896898] E [afr-self-heal-common.c:2260:afr_self_heal_completion_cbk] 0-gtest-replicate-0: background  meta-data data self-heal failed on /images/1002/vm-1002-disk-1.qcow2
[2014-04-24 17:46:35.189287] I [rpc-clnt.c:1676:rpc_clnt_reconfig] 0-gtest-client-1: changing port to 49153 (from 0)
[2014-04-24 17:46:35.189345] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-24 17:46:35.189880] E [socket.c:2157:socket_connect_finish] 0-gtest-client-1: connection to 10.0.0.235:49153 failed (Connection refused)
[2014-04-24 17:46:35.189909] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-24 17:46:39.189475] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-24 17:46:42.190320] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-24 17:46:45.191116] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-24 17:46:48.191872] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-24 17:46:51.192616] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-24 17:46:54.193382] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-24 17:46:57.194083] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-24 17:47:00.194876] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-24 17:47:03.195461] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-24 17:47:06.196200] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-24 17:47:09.196922] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-24 17:47:12.197596] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-24 17:47:15.198298] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-24 17:47:18.199093] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-24 17:47:21.199767] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)
[2014-04-24 17:47:24.200578] W [socket.c:514:__socket_rwv] 0-gtest-client-1: readv failed (No data available)


On the "second brick" in the brick log I am attaching to this the file 'var-gtest-brick.log.save2.gz' as it's big for cut and paste.

Comment 6 Paul 2014-04-24 17:56:57 UTC

Created attachment 889400 [details]
log from the brick process with "DEBUG"

Comment 7 Pranith Kumar K 2014-04-24 18:00:33 UTC

Paul,
   Would it be possible to attach all the logs please? just compress them and add it to the bug. I will take a look at them in the morning and respond.

Thanks a lot for your help :-)
Pranith

Comment 8 Joe Julian 2014-04-24 20:07:36 UTC

That doesn't look like debug logging on the client. Try "gluster volume set $vol diagnostics.client-log-level DEBUG"

Comment 9 Paul 2014-04-24 20:58:45 UTC

Created attachment 889439 [details]
qemu libgfapi log

Comment 10 Paul 2014-04-24 20:59:49 UTC

Created attachment 889440 [details]
log from brick on server 14

Comment 11 Paul 2014-04-24 21:00:17 UTC

Created attachment 889441 [details]
log from brick on server 15

Comment 12 Paul 2014-04-24 21:02:55 UTC

Joe,

thank you for pointing out the DEBUG issue of logs. I was under the impression that upping the LEVEL of the call to libgfapi would be honoured.

Instead the volume also needed the DEBUG logging enabled. 

Now there is a much better log for the client, and perhaps we have a winner

[2014-04-24 20:50:46.521040] D [afr-transaction.c:440:afr_transaction_rm_stale_children] 0-gtest-replicate-0: Possible split-brain for 898f359f-1501-4896-a3d7-673c438f79d1
[2014-04-24 20:50:46.603189] D [afr-transaction.c:440:afr_transaction_rm_stale_children] 0-gtest-replicate-0: Possible split-brain for 898f359f-1501-4896-a3d7-673c438f79d1
[2014-04-24 20:50:52.743692] D [afr-transaction.c:440:afr_transaction_rm_stale_children] 0-gtest-replicate-0: Possible split-brain for 898f359f-1501-4896-a3d7-673c438f79d1
[2014-04-24 20:50:52.782858] D [afr-transaction.c:440:afr_transaction_rm_stale_children] 0-gtest-replicate-0: Possible split-brain for 898f359f-1501-4896-a3d7-673c438f79d1
[2014-04-24 20:50:52.782939] D [afr-transaction.c:440:afr_transaction_rm_stale_children] 0-gtest-replicate-0: Possible split-brain for 898f359f-1501-4896-a3d7-673c438f79d1
[2014-04-24 20:50:52.784064] D [afr-transaction.c:440:afr_transaction_rm_stale_children] 0-gtest-replicate-0: Possible split-brain for 898f359f-1501-4896-a3d7-673c438f79d1
[2014-04-24 20:50:52.785425] D [afr-transaction.c:440:afr_transaction_rm_stale_children] 0-gtest-replicate-0: Possible split-brain for 898f359f-1501-4896-a3d7-673c438f79d1
[2014-04-24 20:50:52.857037] D [afr-transaction.c:440:afr_transaction_rm_stale_children] 0-gtest-replicate-0: Possible split-brain for 898f359f-1501-4896-a3d7-673c438f79d1
[2014-04-24 20:50:52.934445] D [afr-transaction.c:440:afr_transaction_rm_stale_children] 0-gtest-replicate-0: Possible split-brain for 898f359f-1501-4896-a3d7-673c438f79d1
[2014-04-24 20:51:06.140865] D [afr-transaction.c:440:afr_transaction_rm_stale_children] 0-gtest-replicate-0: Possible split-brain for 898f359f-1501-4896-a3d7-673c438f79d1
[2014-04-24 20:51:06.212544] D [afr-transaction.c:440:afr_transaction_rm_stale_children] 0-gtest-replicate-0: Possible split-brain for 898f359f-1501-4896-a3d7-673c438f79d1
[2014-04-24 20:51:20.625001] D [afr-transaction.c:440:afr_transaction_rm_stale_children] 0-gtest-replicate-0: Possible split-brain for 898f359f-1501-4896-a3d7-673c438f79d1


It seems that AFR thinks there is a split brain, even tough the heal process was finished at that time (as can be seen from the brick logs).

I hope that we're getting closer to the solution now.

Comment 13 Paul 2014-04-25 08:21:39 UTC

Comment on attachment 889439 [details]
qemu libgfapi log

This is a log from the client (libgfapi) made with 

diagnostics.client-log-level DEBUG and log level GF_LOG_DEBUG (8) from inside qemu.

Comment 14 Pranith Kumar K 2014-04-26 20:09:26 UTC

paul,
    I think I found the root cause. In 3.4 afr's self-heal couldn't hold a blocking lock on the entire file which would block the writes from the mount. This is what is causing the problem where the mount (here it is gfapi) is not able to acquire latest changelogs on the file due to which it's cache of fresh-children is not updated. This I think is fixed in 3.5.

Following logs prove that:
[2014-04-24 20:49:34.738769] D [afr-self-heal-data.c:1145:afr_sh_data_post_nonblocking_inodelk_cbk] 0-gtest-replicate-0: Non Blocking data inodelks failed for /images/1002/vm-1002-disk-1.qcow2. by 44a1631e8d7f0000
[2014-04-24 20:49:34.738813] D [afr-self-heal-data.c:336:afr_sh_data_fail] 0-gtest-replicate-0: finishing failed data selfheal of /images/1002/vm-1002-disk-1.qcow2

You can read more about it at:
http://review.gluster.org/5100

I need to check what other patches needed to be backported for this to work.
This I found is a must: http://review.gluster.org/5099

Let me also ask Vijay/Avati (maintainers) about any plans for next 3.4 release.
I will do the backporting work on Monday (India time).

Do you think it is possible for you to verify the changes?

Pranith

Comment 15 Paul 2014-04-30 06:39:23 UTC

Once you have a clean patch set to apply on 3.4.3, I can review the changes and report back.

Comment 16 Paul 2014-04-30 07:56:40 UTC

Pranith,

I did a quick test with gluster 3.5.0 and the problem is present in that too. The client will fail in the same exact manner as under 3.4.3

Maybe this will help you in searching for the problem.

Comment 17 Pranith Kumar K 2014-04-30 08:54:37 UTC

Thanks paul. I think I should create the setup and try to re-create it. Give me a day. I will do.

Comment 18 Paul 2014-05-10 10:47:02 UTC

Any news on this bug?

Comment 19 Pranith Kumar K 2014-05-13 06:02:44 UTC

I am able to re-create the bug even using non gfapi. It gives the same logs in debug as if the fd went into split-brain:
[2014-04-24 20:51:41.220990] D [afr-common.c:745:afr_get_call_child] 0-gtest-replicate-0: Returning -5, call_child: 0, last_index: 0

5 is errno for Input/Output error.

Xattrs suggest everything is fine:
tests/bugs/../include.rc: line 119: read: read error: 5: Input/output error
not ok 17 
getfattr: Removing leading '/' from absolute path names
# file: d/backends/patchy0/a
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.patchy-client-0=0x000000000000000000000000
trusted.afr.patchy-client-1=0x000000000000000000000000
trusted.gfid=0x0f9053882b364dfabf06bcad1adbf295

# file: d/backends/patchy1/a
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.patchy-client-0=0x000000000000000000000000
trusted.afr.patchy-client-1=0x000000000000000000000000
trusted.gfid=0x0f9053882b364dfabf06bcad1adbf295

This is the script:
#!/bin/bash

. $(dirname $0)/../include.rc
. $(dirname $0)/../volume.rc

cleanup;
TEST glusterd
TEST pidof glusterd

TEST $CLI volume create $V0 replica 2 $H0:$B0/${V0}{0,1}
TEST $CLI volume set $V0 performance.quick-read off
TEST $CLI volume set $V0 performance.io-cache off
TEST $CLI volume set $V0 performance.write-behind off
TEST $CLI volume set $V0 performance.stat-prefetch off
TEST $CLI volume set $V0 performance.read-ahead off
TEST $CLI volume set $V0 cluster.background-self-heal-count 0
TEST $CLI volume set $V0 client-log-level DEBUG
TEST $CLI volume start $V0
TEST glusterfs --entry-timeout=0 --attribute-timeout=0 -s $H0 --volfile-id=$V0 $M0 --direct-io-mode=enable
touch $M0/a

exec 5<>$M0/a #Open file in read/write mode

kill_brick $V0 $H0 $B0/${V0}0
echo "hi" >&5 #Make sure fd witnessed failure of brick
TEST $CLI volume start $V0 force
EXPECT_WITHIN 20 "1" afr_child_up_status $V0 0

sleep 10 #Make sure file is healed by self-heal-daemon and not mount
kill_brick $V0 $H0 $B0/${V0}1
echo "bye" >&5 #Make sure fd witnessed failure of second brick
TEST $CLI volume start $V0 force
EXPECT_WITHIN 20 "1" afr_child_up_status $V0 1

TEST read -u 5 line #Now mount things there is no good copy, this should fail.
echo $line
exec 5<&-

getfattr -d -m. -e hex $B0/${V0}{0,1}/a

cleanup;

Comment 20 Pranith Kumar K 2014-05-13 06:11:13 UTC

Tested the same case with 3.5 and it fails with same error. Will see if it is there in master as well and update.

tests/bugs/../include.rc: line 149: read: read error: 5: Input/output error
not ok 17 
getfattr: Removing leading '/' from absolute path names
# file: d/backends/patchy0/a
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.patchy-client-0=0x000000000000000000000000
trusted.afr.patchy-client-1=0x000000000000000000000000
trusted.gfid=0x5c77313ce6694b06a6a9706e3a137ebc

# file: d/backends/patchy1/a
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.patchy-client-0=0x000000000000000000000000
trusted.afr.patchy-client-1=0x000000000000000000000000
trusted.gfid=0x5c77313ce6694b06a6a9706e3a137ebc
Failed 1/17 subtests 

Test Summary Report
-------------------
tests/bugs/bug-1089758.t (Wstat: 0 Tests: 17 Failed: 1)
  Failed test:  17
Files=1, Tests=17, 47 wallclock secs ( 0.03 usr  0.01 sys +  0.90 cusr  0.55 csys =  1.49 CPU)
Result: FAIL

root@pranithk-laptop - /home/pk1/workspace/gerrit-repo ((detached from v3.5.0))

Comment 21 Pranith Kumar K 2014-05-13 07:21:40 UTC

On master the test fails with EOF which is what happens normally. Which is the write behavior. So I will start debugging the issue now that I have a consistent reproducer.

Comment 22 Pranith Kumar K 2014-05-13 07:22:33 UTC

(In reply to Pranith Kumar K from comment #21)
> On master the test fails with EOF which is what happens normally. Which is
> the write behavior. So I will start debugging the issue now that I have a
> consistent reproducer.

s/write behavior/'read' behavior' of the open file.

Comment 23 Pranith Kumar K 2014-05-14 04:33:45 UTC

Paul,
    Could you try the test case on 3.5 by disabling self-heal-daemon?
Command to do that is:
gluster volume set <volname> cluster.self-heal-daemon off

Pranith

Comment 24 Pranith Kumar K 2014-05-14 07:08:47 UTC

Paul,
   Wait. Don't. I need to do some more testing.

Pranith

Comment 25 Paul 2014-06-10 22:33:44 UTC

Pranith, 

any news on how your testing/fixing is going? I hope you have not given up on this bug.

Comment 26 Pranith Kumar K 2014-06-11 02:32:36 UTC

Paul,
    Could you try the same test with setting self-heal-daemon to off please?

Pranith

Comment 27 Pranith Kumar K 2014-06-11 15:19:04 UTC

You need to use 3.5.0 for performing this test.

Comment 28 Niels de Vos 2015-05-17 21:58:38 UTC

GlusterFS 3.7.0 has been released (http://www.gluster.org/pipermail/gluster-users/2015-May/021901.html), and the Gluster project maintains N-2 supported releases. The last two releases before 3.7 are still maintained, at the moment these are 3.6 and 3.5.

This bug has been filed against the 3,4 release, and will not get fixed in a 3.4 version any more. Please verify if newer versions are affected with the reported problem. If that is the case, update the bug with a note, and update the version if you can. In case updating the version is not possible, leave a comment in this bug report with the version you tested, and set the "Need additional information the selected bugs from" below the comment box to "bugs".

If there is no response by the end of the month, this bug will get automatically closed.

Comment 29 Kaleb KEITHLEY 2015-10-07 13:59:36 UTC

GlusterFS 3.4.x has reached end-of-life.

If this bug still exists in a later release please reopen this and change the version or open a new bug.