Bug 1658451

Summary: Mountpoint not accessible for few seconds when bricks are brought down to max redundancy after reset brick
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Upasana <ubansal>
Component: disperseAssignee: Ashish Pandey <aspandey>
Status: CLOSED WORKSFORME QA Contact: Nag Pavan Chilakam <nchilaka>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: pkarampu, rhinduja, rhs-bugs, storage-qa-internal, ubansal, vavuthu
Target Milestone: ---Keywords: Automation, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1658472 (view as bug list) Environment:
Last Closed: 2020-02-04 06:18:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1658472    

Description Upasana 2018-12-12 07:32:49 UTC
Description of problem:
========================
Had written a automation script for reset volume in EC and it was failing 2 out of 11 times in getting arequal after bringing down bricks to max redundancy.
so added a ls -lrt /mnt before getting arequal at this point and the logs show that the mount point is not accessible

2018-12-12 12:28:11,941 INFO (run) root.35.11 (cp): ls -lrt /mnt
2018-12-12 12:28:11,941 DEBUG (_get_ssh_connection) Retrieved connection from cache: root.35.11

2018-12-12 12:28:12,432 INFO (_log_results) RETCODE (root.35.11): 1
2018-12-12 12:28:12,433 INFO (_log_results) STDOUT (root.35.11)...
total 0
d?????????? ? ?    ?    ?            ? testvol_dispersed_glusterfs
drwxr-xr-x. 2 root root 6 Dec 11 02:22 tmp

2018-12-12 12:28:12,433 INFO (_log_results) STDERR (root.35.11)...
ls: cannot access /mnt/testvol_dispersed_glusterfs: Transport endpoint is not connected




Version-Release number of selected component (if applicable):
=============================================================
3.4


How reproducible:
================
Downstream - 2/11
Upstream - 2/2

Steps to Reproduce:
===================
Create a EC Volume and mount the volume
        - Create IO on dir2 of volume mountpoint
        - Reset brick start
        - Check if brick is offline
        - Reset brick with destination same as source with force running IO's
        - Validating IO's and waiting for it to complete on dir2
        - Remove dir2
        - Create 5 directory and 5 files in dir1 of mountpoint
        - Rename all files inside dir1 at mountpoint
        - Create softlink and hardlink of files in dir1 of mountpoint
        - Delete op for deleting all file in one of the dirs inside dir1
        - Change chmod, chown, chgrp
        - Create tiny, small, medium and large file
        - Create IO's
        - Validating IO's and waiting for it to complete
        - Calculate arequal before kiiling brick
        - Get brick from Volume
        - Reset brick
        - Check if brick is offline
        - Reset brick by giving a different source and dst node --> Fails (Expected)
        - Reset brick by giving dst and source same without force --> fails (Expected)
        - Obtain hostname
        - Reset brick with dst-source same force using hostname - Successful
        - Monitor heal completion
        - Bring down other bricks to max redundancy
        - Get arequal after bringing down bricks


Actual results:
================
Getting arequal fails with the below error
2018-12-12 12:28:12,435 INFO (run_async) root.35.11 (cp): arequal-checksum -p /mnt/testvol_dispersed_glusterfs -i .trashcan
2018-12-12 12:28:12,436 DEBUG (_get_ssh_connection) Retrieved connection from cache: root.35.11
2018-12-12 12:28:13,117 INFO (_log_results) RETCODE (root.35.11): 1
2018-12-12 12:28:13,119 INFO (_log_results) STDERR (root.35.11)...
ftw (-p) returned -1 (Transport endpoint is not connected), terminating
2018-12-12 12:28:13,119 ERROR (collect_mounts_arequal) Collecting arequal-checksum failed on 10.70.35.11:/mnt/testvol_dispersed_glusterfs


Expected results:
=================
This should pass


Additional info:
=================
This issue is seen only for a few seconds after which mountpoint becomes accessible hence very difficult to reproduce it manually 

Tried a couple of times but was not able to reproduce the issue manually on downstream