Bug 1658451

Summary:	Mountpoint not accessible for few seconds when bricks are brought down to max redundancy after reset brick
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Upasana <ubansal>
Component:	disperse	Assignee:	Ashish Pandey <aspandey>
Status:	CLOSED WORKSFORME	QA Contact:	Nag Pavan Chilakam <nchilaka>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.4	CC:	pkarampu, rhinduja, rhs-bugs, storage-qa-internal, ubansal, vavuthu
Target Milestone:	---	Keywords:	Automation, ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1658472 (view as bug list)		Environment:
Last Closed:	2020-02-04 06:18:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1658472

Description Upasana 2018-12-12 07:32:49 UTC

Description of problem:
========================
Had written a automation script for reset volume in EC and it was failing 2 out of 11 times in getting arequal after bringing down bricks to max redundancy.
so added a ls -lrt /mnt before getting arequal at this point and the logs show that the mount point is not accessible

2018-12-12 12:28:11,941 INFO (run) root.35.11 (cp): ls -lrt /mnt
2018-12-12 12:28:11,941 DEBUG (_get_ssh_connection) Retrieved connection from cache: root.35.11

2018-12-12 12:28:12,432 INFO (_log_results) RETCODE (root.35.11): 1
2018-12-12 12:28:12,433 INFO (_log_results) STDOUT (root.35.11)...
total 0
d?????????? ? ?    ?    ?            ? testvol_dispersed_glusterfs
drwxr-xr-x. 2 root root 6 Dec 11 02:22 tmp

2018-12-12 12:28:12,433 INFO (_log_results) STDERR (root.35.11)...
ls: cannot access /mnt/testvol_dispersed_glusterfs: Transport endpoint is not connected




Version-Release number of selected component (if applicable):
=============================================================
3.4


How reproducible:
================
Downstream - 2/11
Upstream - 2/2

Steps to Reproduce:
===================
Create a EC Volume and mount the volume
        - Create IO on dir2 of volume mountpoint
        - Reset brick start
        - Check if brick is offline
        - Reset brick with destination same as source with force running IO's
        - Validating IO's and waiting for it to complete on dir2
        - Remove dir2
        - Create 5 directory and 5 files in dir1 of mountpoint
        - Rename all files inside dir1 at mountpoint
        - Create softlink and hardlink of files in dir1 of mountpoint
        - Delete op for deleting all file in one of the dirs inside dir1
        - Change chmod, chown, chgrp
        - Create tiny, small, medium and large file
        - Create IO's
        - Validating IO's and waiting for it to complete
        - Calculate arequal before kiiling brick
        - Get brick from Volume
        - Reset brick
        - Check if brick is offline
        - Reset brick by giving a different source and dst node --> Fails (Expected)
        - Reset brick by giving dst and source same without force --> fails (Expected)
        - Obtain hostname
        - Reset brick with dst-source same force using hostname - Successful
        - Monitor heal completion
        - Bring down other bricks to max redundancy
        - Get arequal after bringing down bricks


Actual results:
================
Getting arequal fails with the below error
2018-12-12 12:28:12,435 INFO (run_async) root.35.11 (cp): arequal-checksum -p /mnt/testvol_dispersed_glusterfs -i .trashcan
2018-12-12 12:28:12,436 DEBUG (_get_ssh_connection) Retrieved connection from cache: root.35.11
2018-12-12 12:28:13,117 INFO (_log_results) RETCODE (root.35.11): 1
2018-12-12 12:28:13,119 INFO (_log_results) STDERR (root.35.11)...
ftw (-p) returned -1 (Transport endpoint is not connected), terminating
2018-12-12 12:28:13,119 ERROR (collect_mounts_arequal) Collecting arequal-checksum failed on 10.70.35.11:/mnt/testvol_dispersed_glusterfs


Expected results:
=================
This should pass


Additional info:
=================
This issue is seen only for a few seconds after which mountpoint becomes accessible hence very difficult to reproduce it manually 

Tried a couple of times but was not able to reproduce the issue manually on downstream