1615224 – Fix spurious failures in tests/bugs/ec/ec-1468261.t

Bug 1615224 - Fix spurious failures in tests/bugs/ec/ec-1468261.t

Summary: Fix spurious failures in tests/bugs/ec/ec-1468261.t

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	disperse
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Ashish Pandey
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-13 05:13 UTC by Ashish Pandey
Modified:	2020-08-06 07:46 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-03-12 12:31:19 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:
Flags:	aspandey: needinfo-

Attachments	(Terms of Use)

Description Ashish Pandey 2018-08-13 05:13:01 UTC

Description of problem:

Failure reported  - 

23:03:05 ok 34, LINENUM:79 23:03:05 not ok 35 Got "5" instead of "6", LINENUM:80 23:03:05 FAILED COMMAND: 6 ec_child_up_count patchy 0 23:03:05 not ok 36 Got "1298" instead of "^0$", LINENUM:83 23:03:05 FAILED COMMAND: ^0$ get_pending_heal_count patchy 23:03:05 ok 37, LINENUM:86 23:03:05 ok 38, LINENUM:87 23:03:05 not ok 39 Got "3" instead of "4", LINENUM:88 
---------------------------- 
When I see the glustershd log, I can see that there is an issue while starting the volume by force to starte the killed bricks. 
The bricks are not getting connected. 
I am seeing following logs in glustershd---- 
================================== 
[2018-08-06 23:05:45.077699] I [MSGID: 101016] [glusterfs3.h:739:dict_to_xdr] 0-dict: key 'trusted.ec.size' is would not be sent on wire in future [Invalid argument] 
[2018-08-06 23:05:45.077724] I [MSGID: 101016] [glusterfs3.h:739:dict_to_xdr] 0-dict: key 'trusted.ec.dirty' is would not be sent on wire in future [Invalid argument] 
[2018-08-06 23:05:45.077744] I [MSGID: 101016] [glusterfs3.h:739:dict_to_xdr] 0-dict: key 'trusted.ec.version' is would not be sent on wire in future [Invalid argument] 
[2018-08-06 23:05:46.695719] I [rpc-clnt.c:2087:rpc_clnt_reconfig] 0-patchy-client-1: changing port to 49152 (from 0) 
[2018-08-06 23:05:46.699766] W [MSGID: 114043] [client-handshake.c:1061:client_setvolume_cbk] 0-patchy-client-1: failed to set the volume [Resource temporarily unavailable] 
[2018-08-06 23:05:46.699809] W [MSGID: 114007] [client-handshake.c:1090:client_setvolume_cbk] 0-patchy-client-1: failed to get 'process-uuid' from reply dict [Invalid argument] 
[2018-08-06 23:05:46.699833] E [MSGID: 114044] [client-handshake.c:1096:client_setvolume_cbk] 0-patchy-client-1: SETVOLUME on remote-host failed: cleanup flag is set for xlator.  Try again later [Resource temporarily unavailable] 
[2018-08-06 23:05:46.699855] I [MSGID: 114051] [client-handshake.c:1201:client_setvolume_cbk] 0-patchy-client-1: sending CHILD_CONNECTING event 
[2018-08-06 23:05:46.699920] I [MSGID: 114018] [client.c:2255:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available 
[2018-08-06 23:05:50.702806] I [rpc-clnt.c:2087:rpc_clnt_reconfig] 0-patchy-client-1: changing port to 49152 (from 0) 
[2018-08-06 23:05:50.706726] W [MSGID: 114043] [client-handshake.c:1061:client_setvolume_cbk] 0-patchy-client-1: failed to set the volume [Resource temporarily unavailable] 
[2018-08-06 23:05:50.706783] W [MSGID: 114007] [client-handshake.c:1090:client_setvolume_cbk] 0-patchy-client-1: failed to get 'process-uuid' from reply dict [Invalid argument] 
[2018-08-06 23:05:50.706808] E [MSGID: 114044] [client-handshake.c:1096:client_setvolume_cbk] 0-patchy-client-1: SETVOLUME on remote-host failed: cleanup flag is set for xlator.  Try again later [Resource temporarily unavailable] 
[2018-08-06 23:05:50.706831] I [MSGID: 114051] [client-handshake.c:1201:client_setvolume_cbk] 0-patchy-client-1: sending CHILD_CONNECTING event 
[2018-08-06 23:05:50.706904] I [MSGID: 114018] [client.c:2255:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available 
[2018-08-06 23:05:54.713490] I [rpc-clnt.c:2087:rpc_clnt_reconfig] 0-patchy-client-1: changing port to 49152 (from 0) 
[2018-08-06 23:05:54.717417] W [MSGID: 114043] [client-handshake.c:1061:client_setvolume_cbk] 0-patchy-client-1: failed to set the volume [Resource temporarily unavailable] 
[2018-08-06 23:05:54.717483] W [MSGID: 114007] [client-handshake.c:1090:client_setvolume_cbk] 0-patchy-client-1: failed to get 'process-uuid' from reply dict [Invalid argument] 
[2018-08-06 23:05:54.717508] E [MSGID: 114044] [client-handshake.c:1096:client_setvolume_cbk] 0-patchy-client-1: SETVOLUME on remote-host failed: cleanup flag is set for xlator.  Try again later [Resource temporarily unavailable] 
[2018-08-06 23:05:54.717530] I [MSGID: 114051] [client-handshake.c:1201:client_setvolume_cbk] 0-patchy-client-1: sending CHILD_CONNECTING event 
[2018-08-06 23:05:54.717605] I [MSGID: 114018] [client.c:2255:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available 
[2018-08-06 23:05:58.204494]:++++++++++ G_LOG:./tests/basic/ec/ec-1468261.t: TEST: 83 ^0$ get_pending_heal_count patchy ++++++++++ 
There are many more such logs in this duration 
============================ 
Time at which test at line no 80 started - 
[2018-08-06 23:05:38.652297]:++++++++++ G_LOG:./tests/basic/ec/ec-1468261.t: TEST: 80 6 ec_child_up_count patchy 0 ++++++++++ 
Time at which next test started to get pending heal count - 
[2018-08-06 23:05:58.204494]:++++++++++ G_LOG:./tests/basic/ec/ec-1468261.t: TEST: 83 ^0$ get_pending_heal_count patchy ++++++++++ 
Time gap is 20 second which is exactly the time for which we wait for all the connection to come up in our tests. 
78 #Bring up the down bricks 
79 TEST $CLI volume start $V0 force 
80 EXPECT_WITHIN $CHILD_UP_TIMEOUT "6" ec_child_up_count $V0 0 
81 
82 #Wait for heal to complete 
83 EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0 
In this time, 20 seconds, bricks did not come up. That's why test on line no 80 failed. As bricks did not come UP, heal did not happen and that lead to failure of test at line no 83. 



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Ashish Pandey 2018-08-13 05:14:23 UTC


While part of the problem should be fixed by https://review.gluster.org/#/c/glusterfs/+/20657/
we have also add a check to make sure that all the bricks connection has been done with shd before healing starts.

Comment 2 Amar Tumballi 2019-06-14 10:23:12 UTC

The mentioned patch is merged. Should we be closing the issue?

Comment 3 Worker Ant 2020-03-12 12:31:19 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/899, and will be tracked there from now on. Visit GitHub issues URL for further details

Note You need to log in before you can comment on or make changes to this bug.