Bug 1615224

Summary:	Fix spurious failures in tests/bugs/ec/ec-1468261.t
Product:	[Community] GlusterFS	Reporter:	Ashish Pandey <aspandey>
Component:	disperse	Assignee:	Ashish Pandey <aspandey>
Status:	CLOSED UPSTREAM	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	mainline	CC:	bugs
Target Milestone:	---	Flags:	aspandey: needinfo-
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-03-12 12:31:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ashish Pandey 2018-08-13 05:13:01 UTC

Description of problem:

Failure reported  - 

23:03:05 ok 34, LINENUM:79 23:03:05 not ok 35 Got "5" instead of "6", LINENUM:80 23:03:05 FAILED COMMAND: 6 ec_child_up_count patchy 0 23:03:05 not ok 36 Got "1298" instead of "^0$", LINENUM:83 23:03:05 FAILED COMMAND: ^0$ get_pending_heal_count patchy 23:03:05 ok 37, LINENUM:86 23:03:05 ok 38, LINENUM:87 23:03:05 not ok 39 Got "3" instead of "4", LINENUM:88 
---------------------------- 
When I see the glustershd log, I can see that there is an issue while starting the volume by force to starte the killed bricks. 
The bricks are not getting connected. 
I am seeing following logs in glustershd---- 
================================== 
[2018-08-06 23:05:45.077699] I [MSGID: 101016] [glusterfs3.h:739:dict_to_xdr] 0-dict: key 'trusted.ec.size' is would not be sent on wire in future [Invalid argument] 
[2018-08-06 23:05:45.077724] I [MSGID: 101016] [glusterfs3.h:739:dict_to_xdr] 0-dict: key 'trusted.ec.dirty' is would not be sent on wire in future [Invalid argument] 
[2018-08-06 23:05:45.077744] I [MSGID: 101016] [glusterfs3.h:739:dict_to_xdr] 0-dict: key 'trusted.ec.version' is would not be sent on wire in future [Invalid argument] 
[2018-08-06 23:05:46.695719] I [rpc-clnt.c:2087:rpc_clnt_reconfig] 0-patchy-client-1: changing port to 49152 (from 0) 
[2018-08-06 23:05:46.699766] W [MSGID: 114043] [client-handshake.c:1061:client_setvolume_cbk] 0-patchy-client-1: failed to set the volume [Resource temporarily unavailable] 
[2018-08-06 23:05:46.699809] W [MSGID: 114007] [client-handshake.c:1090:client_setvolume_cbk] 0-patchy-client-1: failed to get 'process-uuid' from reply dict [Invalid argument] 
[2018-08-06 23:05:46.699833] E [MSGID: 114044] [client-handshake.c:1096:client_setvolume_cbk] 0-patchy-client-1: SETVOLUME on remote-host failed: cleanup flag is set for xlator.  Try again later [Resource temporarily unavailable] 
[2018-08-06 23:05:46.699855] I [MSGID: 114051] [client-handshake.c:1201:client_setvolume_cbk] 0-patchy-client-1: sending CHILD_CONNECTING event 
[2018-08-06 23:05:46.699920] I [MSGID: 114018] [client.c:2255:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available 
[2018-08-06 23:05:50.702806] I [rpc-clnt.c:2087:rpc_clnt_reconfig] 0-patchy-client-1: changing port to 49152 (from 0) 
[2018-08-06 23:05:50.706726] W [MSGID: 114043] [client-handshake.c:1061:client_setvolume_cbk] 0-patchy-client-1: failed to set the volume [Resource temporarily unavailable] 
[2018-08-06 23:05:50.706783] W [MSGID: 114007] [client-handshake.c:1090:client_setvolume_cbk] 0-patchy-client-1: failed to get 'process-uuid' from reply dict [Invalid argument] 
[2018-08-06 23:05:50.706808] E [MSGID: 114044] [client-handshake.c:1096:client_setvolume_cbk] 0-patchy-client-1: SETVOLUME on remote-host failed: cleanup flag is set for xlator.  Try again later [Resource temporarily unavailable] 
[2018-08-06 23:05:50.706831] I [MSGID: 114051] [client-handshake.c:1201:client_setvolume_cbk] 0-patchy-client-1: sending CHILD_CONNECTING event 
[2018-08-06 23:05:50.706904] I [MSGID: 114018] [client.c:2255:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available 
[2018-08-06 23:05:54.713490] I [rpc-clnt.c:2087:rpc_clnt_reconfig] 0-patchy-client-1: changing port to 49152 (from 0) 
[2018-08-06 23:05:54.717417] W [MSGID: 114043] [client-handshake.c:1061:client_setvolume_cbk] 0-patchy-client-1: failed to set the volume [Resource temporarily unavailable] 
[2018-08-06 23:05:54.717483] W [MSGID: 114007] [client-handshake.c:1090:client_setvolume_cbk] 0-patchy-client-1: failed to get 'process-uuid' from reply dict [Invalid argument] 
[2018-08-06 23:05:54.717508] E [MSGID: 114044] [client-handshake.c:1096:client_setvolume_cbk] 0-patchy-client-1: SETVOLUME on remote-host failed: cleanup flag is set for xlator.  Try again later [Resource temporarily unavailable] 
[2018-08-06 23:05:54.717530] I [MSGID: 114051] [client-handshake.c:1201:client_setvolume_cbk] 0-patchy-client-1: sending CHILD_CONNECTING event 
[2018-08-06 23:05:54.717605] I [MSGID: 114018] [client.c:2255:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available 
[2018-08-06 23:05:58.204494]:++++++++++ G_LOG:./tests/basic/ec/ec-1468261.t: TEST: 83 ^0$ get_pending_heal_count patchy ++++++++++ 
There are many more such logs in this duration 
============================ 
Time at which test at line no 80 started - 
[2018-08-06 23:05:38.652297]:++++++++++ G_LOG:./tests/basic/ec/ec-1468261.t: TEST: 80 6 ec_child_up_count patchy 0 ++++++++++ 
Time at which next test started to get pending heal count - 
[2018-08-06 23:05:58.204494]:++++++++++ G_LOG:./tests/basic/ec/ec-1468261.t: TEST: 83 ^0$ get_pending_heal_count patchy ++++++++++ 
Time gap is 20 second which is exactly the time for which we wait for all the connection to come up in our tests. 
78 #Bring up the down bricks 
79 TEST $CLI volume start $V0 force 
80 EXPECT_WITHIN $CHILD_UP_TIMEOUT "6" ec_child_up_count $V0 0 
81 
82 #Wait for heal to complete 
83 EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0 
In this time, 20 seconds, bricks did not come up. That's why test on line no 80 failed. As bricks did not come UP, heal did not happen and that lead to failure of test at line no 83. 



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Ashish Pandey 2018-08-13 05:14:23 UTC


While part of the problem should be fixed by https://review.gluster.org/#/c/glusterfs/+/20657/
we have also add a check to make sure that all the bricks connection has been done with shd before healing starts.

Comment 2 Amar Tumballi 2019-06-14 10:23:12 UTC

The mentioned patch is merged. Should we be closing the issue?

Comment 3 Worker Ant 2020-03-12 12:31:19 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/899, and will be tracked there from now on. Visit GitHub issues URL for further details