1181048 – lockless lookup cause disk to be kicked out

Bug 1181048 - lockless lookup cause disk to be kicked out

Summary: lockless lookup cause disk to be kicked out

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	disperse
Sub Component:
Version:	3.6.1
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Xavi Hernandez
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-01-12 09:45 UTC by lidi
Modified:	2015-12-01 16:45 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-09-18 03:13:31 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description lidi 2015-01-12 09:45:50 UTC

Version-Release number of selected component (if applicable):
mainline and any version with disperse

How reproducible:


Steps to Reproduce:
1. create a disperse volume with 3 disks, mount it to /cluster2/test
2. mkdir -p /cluster2/a/b/c
3. on one console execute 'for ((;;));do dd if=/dev/zero of=/cluster2/test/a/b/c/aa bs=1M count=200;done'
4. on another execute 'ls /cluster2/test/a/b/c/aa' many times

Actual results:
one disk was kicked out



Additional info:
[2015-01-12 09:08:02.050709] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-12 09:08:02.052371] W [ec-common.c:162:ec_check_status] 0-test-disperse-0: Operation failed on some subvolumes (up=7, mask=7, remaining=0, good=5, bad=2)
[2015-01-12 09:08:02.052479] I [ec-heal.c:546:ec_heal_init] 0-ec: Healing '/a/b/c/aa', gfid 658bbea7-c18d-48e5-9a54-21549bdb31eb
[2015-01-12 09:08:02.054985] W [ec-combine.c:76:ec_iatt_combine] 0-test-disperse-0: Failed to combine iatt (inode: 11120550027160400363-11120550027160400363, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 104857600-0, mode: 100644-100644)
[2015-01-12 09:08:02.055086] N [ec-generic.c:819:ec_combine_lookup] 0-test-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2015-01-12 09:08:32.741658] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-12 09:08:32.742744] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-12 09:08:32.742803] W [ec-common.c:162:ec_check_status] 0-test-disperse-0: Operation failed on some subvolumes (up=7, mask=7, remaining=0, good=6, bad=1)

Comment 1 Xavi Hernandez 2015-01-13 12:50:03 UTC

I've repeated your test and all bricks are healthy after several iterations. What error do you get ?

Do you say that some brick is bad only based on the log contents ? if so, these logs are normal when a regular lookup is made while another operation that modifies the file is taking place. The root cause of this is that initial lookups are made asynchronously without any lock taken to all bricks, and since these bricks are being modified, it could receive inconsistent data that it shows as a warning. When self-heal checks the file, it sees there's nothing bad an does nothing.

Comment 2 lidi 2015-01-14 07:49:45 UTC

Yes, in this case the file can be healed. But if in the already degraded disperse volume, it will crash the file.

Comment 3 Xavi Hernandez 2015-01-14 08:26:02 UTC

In this case the volume has not been degraded at any moment. These warning are only caused by an initial lookup. This apparent discrepancy triggers a self-heal. Self-heal has enough information to lock the inode and repeat the lookup. This lookup succeeds and self-heal finishes without doing anything else.

Any other operation on the file uses locks, so there shouldn't be any problem.

This is exactly what I've tested:

gluster volume create test server:/bricks/test{1..3} force
gluster volume start test
mount -t glusterfs server:/test /gluster/test
mkdir -p /gluster/test/a/b/c
while true; do dd if=/dev/zero of=/gluster/test/a/b/c/aa bs=1M count=200; done

on another session:

while true; do ls -l /gluster/test/a/b/c/aa; sleep 1; done

After some time running both commands without any error I've stopped them.

getfattr -m. -e hex -d /bricks/test{1..3}/a/b/c/aa
getfattr: Removing leading '/' from absolute path names
# file: bricks/test1/a/b/c/aa
trusted.ec.config=0x0000080301000200
trusted.ec.size=0x0000000000000000
trusted.ec.version=0x0000000000003208
trusted.gfid=0xe662cae312924039918520bf1be8a1ac

# file: bricks/test2/a/b/c/aa
trusted.ec.config=0x0000080301000200
trusted.ec.size=0x0000000000000000
trusted.ec.version=0x0000000000003208
trusted.gfid=0xe662cae312924039918520bf1be8a1ac

# file: bricks/test3/a/b/c/aa
trusted.ec.config=0x0000080301000200
trusted.ec.size=0x0000000000000000
trusted.ec.version=0x0000000000003208
trusted.gfid=0xe662cae312924039918520bf1be8a1ac

As you can see, file on all three bricks is healthy and fully synchronized.

Logs report some warnings, but no real self-heal has been executed.

Are you seeing something different ?

Comment 4 lidi 2015-01-14 08:29:01 UTC

sorry, it seems that the problem doesn't like what I said. I will test it again, and will tell you the results later.

Comment 5 lidi 2015-01-14 09:14:05 UTC

I test another case, and this is the kind of scene I found the problem.

Steps to Reproduce:
1. create a disperse volume with 3 disks(there's 3 virtual machines, each has 1 disk), mount it to /cluster2/test
2. on first console execute 'for ((;;));do dd if=/dev/zero of=/cluster2/test/aa bs=1M count=200;done'
3. on second execute 'for ((;;));do dd if=/dev/zero of=/cluster2/test/bb bs=1M count=200;done'
4. on third execute 'for ((;;));do dd of=/dev/null of=/cluster2/test/aa bs=1M count=200;done'
5. shutdown one virtual machine

Actual results:
after a while 

console 1  report "dd: opening `/cluster2/test/aa': Input/output error"
console 2  report "dd: opening `/cluster2/test/bb': Input/output error"

Additional info:
[2015-01-14 08:52:01.332216] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-14 08:52:01.333679] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-14 08:52:01.334035] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-14 08:52:01.335555] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-14 08:52:01.335921] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-14 08:52:01.337420] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-14 08:52:01.337771] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-14 08:52:01.339272] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-14 08:52:01.339624] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-14 08:52:01.341095] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-14 08:52:01.341429] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-14 08:52:01.341481] W [fuse-bridge.c:780:fuse_attr_cbk] 0-digioceanfs-fuse: 61238: LOOKUP() / => -1 (Input/output error)
[2015-01-14 08:52:01.343026] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-14 08:52:01.343354] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-14 08:52:01.344831] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-14 08:52:01.345172] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-14 08:52:01.346700] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-14 08:52:01.347030] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-14 08:52:01.348520] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-14 08:52:01.348852] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-14 08:52:01.350356] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-14 08:52:01.350696] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-14 08:52:01.352201] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-14 08:52:01.352534] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-14 08:52:01.354057] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-14 08:52:01.354400] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-14 08:52:01.355878] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-14 08:52:01.356224] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-14 08:52:01.357752] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-14 08:52:01.358085] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-14 08:52:01.359564] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-14 08:52:01.359898] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-14 08:52:01.361530] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-14 08:52:01.361859] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-14 08:52:01.363362] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)

Comment 6 lidi 2015-01-14 09:18:27 UTC

The xattr of root directory is different.


[root@node-1 ~]# getfattr -m. -de hex /sdb
getfattr: Removing leading '/' from absolute path names
# file: sdb
trusted.digioceanfs.dht=0x000000010000000000000000ffffffff
trusted.digioceanfs.volume-id=0xecf40c3853534203b308a1bd04720214
trusted.ec.version=0x0000000000000004
trusted.gfid=0x00000000000000000000000000000001


[root@node-2 ~]# getfattr -m. -de hex /sdb
getfattr: Removing leading '/' from absolute path names
# file: sdb
trusted.digioceanfs.volume-id=0xecf40c3853534203b308a1bd04720214
trusted.ec.version=0x0000000000000002
trusted.gfid=0x00000000000000000000000000000001

Comment 7 Xavi Hernandez 2015-01-14 09:20:45 UTC

The command executed on the third server what it is exactly ? what you have written makes no sense and doesn't work. It's a read or a write ?

I also assume that your are stopping the third node, right ?

Comment 8 lidi 2015-01-14 09:24:31 UTC

for ((;;));do dd of=/dev/null if=/cluster2/test/aa bs=1M count=200;done

is read

Comment 9 lidi 2015-01-14 09:25:52 UTC

yes, I execute the three commands on node-1, and shutdown node-3.

Comment 10 Xavi Hernandez 2015-01-14 09:53:03 UTC

I've repeated the test with the only difference that all three bricks are on the same server and I killed one of glusterfsd processes.

After killing the brick everything continues working normally.

I'm using latest master. Can you test this with latest master and check if there is really a difference between having bricks on the same node or not ?

Comment 11 lidi 2015-01-14 09:57:54 UTC

ok, I'll do that.

Comment 12 lidi 2015-01-15 03:49:06 UTC

When I use latest master for this test with one virtual machine, I found another phenomenon.

After kill one brick, console 1 and 3 work well. console 2 seems like dead lock, and report nothing. 

[root@node-1 ~]# getfattr -m. -de hex /sd{b,c,d}/bb
getfattr: Removing leading '/' from absolute path names
# file: sdb/bb
trusted.ec.config=0x0000080301000200
trusted.ec.size=0x0000000000000000
trusted.ec.version=0x0000000000000c82
trusted.gfid=0x5db7082224804e00816bc03309449577

# file: sdc/bb
trusted.ec.config=0x0000080301000200
trusted.ec.size=0x0000000000000000
trusted.ec.version=0x0000000000000c82
trusted.gfid=0x5db7082224804e00816bc03309449577

# file: sdd/bb
trusted.ec.config=0x0000080301000200
trusted.ec.size=0x000000000c800000
trusted.ec.version=0x0000000000000c81
trusted.gfid=0x5db7082224804e00816bc03309449577



[2015-01-15 03:19:13.662595] W [socket.c:611:__socket_rwv] 0-test-client-2: readv on 10.10.21.20:49154 failed (Connection reset by peer)
[2015-01-15 03:19:13.663052] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1ab)[0x7f75d023c4db] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1cb)[0x7f75d001237b] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f75d001248e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xab)[0x7f75d001255b] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f)[0x7f75d0012b1f] ))))) 0-test-client-2: forced unwinding frame type(GlusterFS 3.3) op(FTRUNCATE(24)) called at 2015-01-15 03:18:48.958574 (xid=0x508d)
[2015-01-15 03:19:13.663096] W [client-rpc-fops.c:1421:client3_3_ftruncate_cbk] 0-test-client-2: remote operation failed: Transport endpoint is not connected
[2015-01-15 03:19:13.663120] W [ec-common.c:162:ec_check_status] 0-test-disperse-0: Operation failed on some subvolumes (up=7, mask=7, remaining=0, good=3, bad=4)
[2015-01-15 03:19:13.663187] I [ec-heal.c:546:ec_heal_init] 0-ec: Healing '/aa', gfid 269a9bc0-7a85-404f-a126-af878520847b
[2015-01-15 03:19:13.666965] I [socket.c:3301:socket_submit_request] 0-test-client-2: not connected (priv->connected = 0)
[2015-01-15 03:19:13.667001] W [rpc-clnt.c:1565:rpc_clnt_submit] 0-test-client-2: failed to submit rpc-request (XID: 0x5296 Program: GlusterFS 3.3, ProgVers: 330, Proc: 31) to rpc-transport (test-client-2)
[2015-01-15 03:19:13.667029] E [client-rpc-fops.c:1630:client3_3_entrylk_cbk] 0-test-client-2: remote operation failed: Transport endpoint is not connected
[2015-01-15 03:19:13.667131] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-15 03:19:13.667451] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1ab)[0x7f75d023c4db] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1cb)[0x7f75d001237b] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f75d001248e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xab)[0x7f75d001255b] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f)[0x7f75d0012b1f] ))))) 0-test-client-2: forced unwinding frame type(GlusterFS 3.3) op(FTRUNCATE(24)) called at 2015-01-15 03:18:49.325357 (xid=0x5294)
[2015-01-15 03:19:13.667508] W [client-rpc-fops.c:1421:client3_3_ftruncate_cbk] 0-test-client-2: remote operation failed: Transport endpoint is not connected
[2015-01-15 03:19:13.667544] W [ec-common.c:162:ec_check_status] 0-test-disperse-0: Operation failed on some subvolumes (up=7, mask=7, remaining=0, good=3, bad=4)
[2015-01-15 03:19:13.667583] I [ec-heal.c:546:ec_heal_init] 0-ec: Healing '/bb', gfid 5db70822-2480-4e00-816b-c03309449577
[2015-01-15 03:19:13.667679] W [rpc-clnt.c:1565:rpc_clnt_submit] 0-test-client-2: failed to submit rpc-request (XID: 0x5297 Program: GlusterFS 3.3, ProgVers: 330, Proc: 31) to rpc-transport (test-client-2)
[2015-01-15 03:19:13.667717] E [client-rpc-fops.c:1630:client3_3_entrylk_cbk] 0-test-client-2: remote operation failed: Transport endpoint is not connected
[2015-01-15 03:19:13.667979] W [rpc-clnt.c:1565:rpc_clnt_submit] 0-test-client-2: failed to submit rpc-request (XID: 0x5298 Program: GlusterFS 3.3, ProgVers: 330, Proc: 33) to rpc-transport (test-client-2)
[2015-01-15 03:19:13.668031] W [client-rpc-fops.c:1743:client3_3_xattrop_cbk] 0-test-client-2: remote operation failed: Success. Path: /bb (5db70822-2480-4e00-816b-c03309449577)
[2015-01-15 03:19:13.668098] W [rpc-clnt.c:1565:rpc_clnt_submit] 0-test-client-2: failed to submit rpc-request (XID: 0x5299 Program: GlusterFS 3.3, ProgVers: 330, Proc: 29) to rpc-transport (test-client-2)
[2015-01-15 03:19:13.668135] E [client-rpc-fops.c:1526:client3_3_inodelk_cbk] 0-test-client-2: remote operation failed: Transport endpoint is not connected
[2015-01-15 03:19:13.668373] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1ab)[0x7f75d023c4db] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1cb)[0x7f75d001237b] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f75d001248e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xab)[0x7f75d001255b] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f)[0x7f75d0012b1f] ))))) 0-test-client-2: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2015-01-15 03:18:59.744160 (xid=0x5295)
[2015-01-15 03:19:13.668421] W [rpc-clnt-ping.c:154:rpc_clnt_ping_cbk] 0-test-client-2: socket disconnected
[2015-01-15 03:19:13.668452] I [client.c:2215:client_rpc_notify] 0-test-client-2: disconnected from test-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2015-01-15 03:19:13.668663] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)

Comment 13 lidi 2015-01-15 07:02:05 UTC

I test this for several times.I can't reproduce this everytime.Sometimes everything goes well.

Comment 14 Xavi Hernandez 2015-01-15 16:33:32 UTC

I've tested it several times and I haven't been able to reproduce the problem.

Do you rebuild the volume before each test or you reuse the old one ?
If you reuse the old one, do you stop everything and then you restart it or you simply bring the killed bring online while the 'dd' are running ?

Please give me as much information as you can about the precise details of your test.

Comment 15 lidi 2015-01-16 02:17:28 UTC

I never reuse old one. I write a script that automatically rebuild the volume between each test. This is really hard to reproduce, I tried dozens of times only reproduce 2 times. It may involve some race.

And I test it in the virtual environment, may more easily reproduce this problem.

Comment 16 Xavi Hernandez 2015-01-16 11:29:35 UTC

I'll do more tests.

Comment 17 lidi 2015-01-20 06:22:18 UTC

The above 2 types of test scenario are all hard to reproduce.I use another test case that can reproduce the error something like 'Comment 5'.

I use latest master for this test.

1. create a disperse volume with 3 disks(all disks in the same virtual machine), mount it to /cluster2/test
2. from console 1 execute 'for ((;;));do dd if=/dev/zero of=/cluster2/test/aa bs=1M count=20;done'
3. kill one brick
4. from console 2 execute 'ls /cluster2/test' many times

Actual results:
sometimes console 2 report follows.
 
[root@node-1 ~]# ls /cluster2/test/
ls: cannot access /cluster2/test/aa: Input/output error
aa

Additional info:
[2015-01-20 05:41:48.881650] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-20 05:41:48.882630] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-20 05:41:48.883332] W [ec-combine.c:801:ec_combine_check] 0-test-disperse-0: Mismatching xdata in answers of 'LOOKUP'
[2015-01-20 05:41:48.883404] I [dht-layout.c:682:dht_layout_normalize] 0-test-dht: Found anomalies in /aa (gfid = a9463c35-e6c2-4a54-9dfc-4c7d68c78096). Holes=1 overlaps=0
[2015-01-20 05:41:48.883517] W [fuse-resolve.c:147:fuse_resolve_gfid_cbk] 0-fuse: a9463c35-e6c2-4a54-9dfc-4c7d68c78096: failed to resolve (Input/output error)
[2015-01-20 05:41:48.883547] E [fuse-bridge.c:809:fuse_getattr_resume] 0-glusterfs-fuse: 30887: GETATTR 15520940 (a9463c35-e6c2-4a54-9dfc-4c7d68c78096) resolution failed
[2015-01-20 05:41:49.212596] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-20 05:41:49.217569] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-20 05:41:49.624787] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)
[2015-01-20 05:41:50.491950] W [ec-common.c:397:ec_child_select] 0-test-disperse-0: Executing operation with some subvolumes unavailable (4)


Assume that the lookup was executed in some other operations like writev or readv, I supposed it would cause the operation fail.

Comment 18 lidi 2015-01-20 06:29:53 UTC

This case is very easy to reproduce.

Comment 19 Xavi Hernandez 2015-01-20 08:41:35 UTC

(In reply to lidi from comment #13)
> I test this for several times.I can't reproduce this everytime.Sometimes
> everything goes well.

I've tried your test all the night. After more than 1000 repetitions, I haven't detected a single failure.

I'll try your last scenario.

Comment 20 Xavi Hernandez 2015-01-21 09:22:14 UTC

After many tests, I haven't seen any Input/Output error, however I've detected one 'ls' getting stuck. I'll try to determine the cause and see if it could be the same problem you are seeing.

Comment 21 lidi 2015-01-23 06:33:53 UTC

I have a hypothesis that described below.

1. There's a disperse volume with 3 disks(A,B and C).

2. On first session a write is executing, all data was writen, then it would update size and version 

3. At this time, on second session a lockless lookup was executed first, and its return value indicated that disk B was broken, disk B would be marked as broken in inode or fd. 

4. Before the self-heal was executed, disk C was killed or disconnected, so nothing would be healed.

5. Then update size and version, the operation would dispatch to disk A and C(ec->xl_up not be updated yet), so finally only disk A was updated.

I just don't know how to verify this. Do you think it could happen?

Comment 22 Xavi Hernandez 2015-01-23 09:41:27 UTC

At point 3, it's possible that a lockless lookup returns inconsistent data, however in this case inode and fd are not updated (because this inconsistency might be a false positive). This means that step 5 would send the update to all three bricks.

The scenario you propose would make all sessions to fail: writes from session 1 and ls from session 2 would fail, and this is not what it's happening.

It could happen, thought I haven't been able to reproduce it yet, that the lockless lookup made by the second session while a write is being processed on first session and brick C is down, could return inconsistent data:

1. Brick C is down
2. A write is initiated by session 1 to bricks A and B
3. Brick A completes the write, but brick B is still pending
4. A lockless lookup is initiated by session 2 to bricks A and B
5. Brick A returns file information with size Sa
6. Brick B returns file information with size Sb (Sb < Sa)
7. No data from brick C

In a 3:1 configuration it's required to have at least 2 consistent answers to return the data. In this case we don't have quorum and an EIO can be returned. However the next ls would probably succeed. Only an EIO from time to time could be seen.

Is this what you are seeing ?

Comment 23 lidi 2015-01-23 09:50:54 UTC

Yes, I can understand the scenario by "comment 17". I just suppose the scenario by "comment 5" that why the xattrs of root directory are different..

Comment 24 Xavi Hernandez 2015-01-26 08:19:47 UTC

(In reply to lidi from comment #23)
> Yes, I can understand the scenario by "comment 17". I just suppose the
> scenario by "comment 5" that why the xattrs of root directory are different..

No. The possible inconsistency in the lockless lookup cannot explain the discrepancy in the xattr of root directory. All operations modifying data are made after a lock.

Comment 25 lidi 2015-01-26 10:43:15 UTC

I found ec_open is also lockless, and it would update fd. Do this may cause problems?

Comment 26 Xavi Hernandez 2015-01-26 10:58:29 UTC

(In reply to lidi from comment #25)
> I found ec_open is also lockless, and it would update fd. Do this may cause
> problems?

open does not modify anything. At worst this could cause future writes to the file to not be sent to some bricks (but this is ok if the file cannot be opened in some brick), but shouldn't be related to the discrepancy in directory versions.

Comment 27 lidi 2015-01-26 10:59:30 UTC

ec_inodelk and ec_entrylk are all lockless, and they may update inode.

Comment 28 lidi 2015-01-26 11:03:20 UTC

In the case of comment 21, just replace the lookup with open(inodelk or entry). Does this case may cause problems?

Comment 29 Xavi Hernandez 2015-01-26 11:52:32 UTC

(In reply to lidi from comment #27)
> ec_inodelk and ec_entrylk are all lockless, and they may update inode.

Obviously inodelk and entrylk are lockless because these are the functions that take the lock.

I don't see how open, inodelk or entrylk could cause any problem.

Comment 30 Pranith Kumar K 2015-08-05 00:39:08 UTC

lidi,
      Could you let us know if you are still seeing this issue on 3.7.3 release of gluster?

Pranith

Comment 31 Pranith Kumar K 2015-09-18 03:13:31 UTC

Lidi,
   We are not able to re-create this with the given steps in the latest releases. We are not seeing these problems even in our labs where heavy tests are performed. For now I am inclined to close this. Feel free to re-open if you are seeing this with some other steps in the latest releases.

Thanks a lot for your time for making ec better!
Pranith

Note You need to log in before you can comment on or make changes to this bug.