Bug 1615224 - Fix spurious failures in tests/bugs/ec/ec-1468261.t [NEEDINFO]
Summary: Fix spurious failures in tests/bugs/ec/ec-1468261.t
Keywords:
Status: NEW
Alias: None
Product: GlusterFS
Classification: Community
Component: disperse
Version: mainline
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Ashish Pandey
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-13 05:13 UTC by Ashish Pandey
Modified: 2019-08-31 17:59 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
atumball: needinfo? (aspandey)


Attachments (Terms of Use)

Description Ashish Pandey 2018-08-13 05:13:01 UTC
Description of problem:

Failure reported  - 

23:03:05 ok 34, LINENUM:79 23:03:05 not ok 35 Got "5" instead of "6", LINENUM:80 23:03:05 FAILED COMMAND: 6 ec_child_up_count patchy 0 23:03:05 not ok 36 Got "1298" instead of "^0$", LINENUM:83 23:03:05 FAILED COMMAND: ^0$ get_pending_heal_count patchy 23:03:05 ok 37, LINENUM:86 23:03:05 ok 38, LINENUM:87 23:03:05 not ok 39 Got "3" instead of "4", LINENUM:88 
---------------------------- 
When I see the glustershd log, I can see that there is an issue while starting the volume by force to starte the killed bricks. 
The bricks are not getting connected. 
I am seeing following logs in glustershd---- 
================================== 
[2018-08-06 23:05:45.077699] I [MSGID: 101016] [glusterfs3.h:739:dict_to_xdr] 0-dict: key 'trusted.ec.size' is would not be sent on wire in future [Invalid argument] 
[2018-08-06 23:05:45.077724] I [MSGID: 101016] [glusterfs3.h:739:dict_to_xdr] 0-dict: key 'trusted.ec.dirty' is would not be sent on wire in future [Invalid argument] 
[2018-08-06 23:05:45.077744] I [MSGID: 101016] [glusterfs3.h:739:dict_to_xdr] 0-dict: key 'trusted.ec.version' is would not be sent on wire in future [Invalid argument] 
[2018-08-06 23:05:46.695719] I [rpc-clnt.c:2087:rpc_clnt_reconfig] 0-patchy-client-1: changing port to 49152 (from 0) 
[2018-08-06 23:05:46.699766] W [MSGID: 114043] [client-handshake.c:1061:client_setvolume_cbk] 0-patchy-client-1: failed to set the volume [Resource temporarily unavailable] 
[2018-08-06 23:05:46.699809] W [MSGID: 114007] [client-handshake.c:1090:client_setvolume_cbk] 0-patchy-client-1: failed to get 'process-uuid' from reply dict [Invalid argument] 
[2018-08-06 23:05:46.699833] E [MSGID: 114044] [client-handshake.c:1096:client_setvolume_cbk] 0-patchy-client-1: SETVOLUME on remote-host failed: cleanup flag is set for xlator.  Try again later [Resource temporarily unavailable] 
[2018-08-06 23:05:46.699855] I [MSGID: 114051] [client-handshake.c:1201:client_setvolume_cbk] 0-patchy-client-1: sending CHILD_CONNECTING event 
[2018-08-06 23:05:46.699920] I [MSGID: 114018] [client.c:2255:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available 
[2018-08-06 23:05:50.702806] I [rpc-clnt.c:2087:rpc_clnt_reconfig] 0-patchy-client-1: changing port to 49152 (from 0) 
[2018-08-06 23:05:50.706726] W [MSGID: 114043] [client-handshake.c:1061:client_setvolume_cbk] 0-patchy-client-1: failed to set the volume [Resource temporarily unavailable] 
[2018-08-06 23:05:50.706783] W [MSGID: 114007] [client-handshake.c:1090:client_setvolume_cbk] 0-patchy-client-1: failed to get 'process-uuid' from reply dict [Invalid argument] 
[2018-08-06 23:05:50.706808] E [MSGID: 114044] [client-handshake.c:1096:client_setvolume_cbk] 0-patchy-client-1: SETVOLUME on remote-host failed: cleanup flag is set for xlator.  Try again later [Resource temporarily unavailable] 
[2018-08-06 23:05:50.706831] I [MSGID: 114051] [client-handshake.c:1201:client_setvolume_cbk] 0-patchy-client-1: sending CHILD_CONNECTING event 
[2018-08-06 23:05:50.706904] I [MSGID: 114018] [client.c:2255:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available 
[2018-08-06 23:05:54.713490] I [rpc-clnt.c:2087:rpc_clnt_reconfig] 0-patchy-client-1: changing port to 49152 (from 0) 
[2018-08-06 23:05:54.717417] W [MSGID: 114043] [client-handshake.c:1061:client_setvolume_cbk] 0-patchy-client-1: failed to set the volume [Resource temporarily unavailable] 
[2018-08-06 23:05:54.717483] W [MSGID: 114007] [client-handshake.c:1090:client_setvolume_cbk] 0-patchy-client-1: failed to get 'process-uuid' from reply dict [Invalid argument] 
[2018-08-06 23:05:54.717508] E [MSGID: 114044] [client-handshake.c:1096:client_setvolume_cbk] 0-patchy-client-1: SETVOLUME on remote-host failed: cleanup flag is set for xlator.  Try again later [Resource temporarily unavailable] 
[2018-08-06 23:05:54.717530] I [MSGID: 114051] [client-handshake.c:1201:client_setvolume_cbk] 0-patchy-client-1: sending CHILD_CONNECTING event 
[2018-08-06 23:05:54.717605] I [MSGID: 114018] [client.c:2255:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available 
[2018-08-06 23:05:58.204494]:++++++++++ G_LOG:./tests/basic/ec/ec-1468261.t: TEST: 83 ^0$ get_pending_heal_count patchy ++++++++++ 
There are many more such logs in this duration 
============================ 
Time at which test at line no 80 started - 
[2018-08-06 23:05:38.652297]:++++++++++ G_LOG:./tests/basic/ec/ec-1468261.t: TEST: 80 6 ec_child_up_count patchy 0 ++++++++++ 
Time at which next test started to get pending heal count - 
[2018-08-06 23:05:58.204494]:++++++++++ G_LOG:./tests/basic/ec/ec-1468261.t: TEST: 83 ^0$ get_pending_heal_count patchy ++++++++++ 
Time gap is 20 second which is exactly the time for which we wait for all the connection to come up in our tests. 
78 #Bring up the down bricks 
79 TEST $CLI volume start $V0 force 
80 EXPECT_WITHIN $CHILD_UP_TIMEOUT "6" ec_child_up_count $V0 0 
81 
82 #Wait for heal to complete 
83 EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0 
In this time, 20 seconds, bricks did not come up. That's why test on line no 80 failed. As bricks did not come UP, heal did not happen and that lead to failure of test at line no 83. 



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Ashish Pandey 2018-08-13 05:14:23 UTC

While part of the problem should be fixed by https://review.gluster.org/#/c/glusterfs/+/20657/
we have also add a check to make sure that all the bricks connection has been done with shd before healing starts.

Comment 2 Amar Tumballi 2019-06-14 10:23:12 UTC
The mentioned patch is merged. Should we be closing the issue?


Note You need to log in before you can comment on or make changes to this bug.