1658451 – Mountpoint not accessible for few seconds when bricks are brought down to max redundancy after reset brick

Bug 1658451 - Mountpoint not accessible for few seconds when bricks are brought down to max redundancy after reset brick

Summary: Mountpoint not accessible for few seconds when bricks are brought down to max...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	disperse
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ashish Pandey
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1658472
TreeView+	depends on / blocked

Reported:	2018-12-12 07:32 UTC by Upasana
Modified:	2020-02-04 06:18 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1658472 (view as bug list)
Environment:
Last Closed:	2020-02-04 06:18:56 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Upasana 2018-12-12 07:32:49 UTC

Description of problem:
========================
Had written a automation script for reset volume in EC and it was failing 2 out of 11 times in getting arequal after bringing down bricks to max redundancy.
so added a ls -lrt /mnt before getting arequal at this point and the logs show that the mount point is not accessible

2018-12-12 12:28:11,941 INFO (run) root.35.11 (cp): ls -lrt /mnt
2018-12-12 12:28:11,941 DEBUG (_get_ssh_connection) Retrieved connection from cache: root.35.11

2018-12-12 12:28:12,432 INFO (_log_results) RETCODE (root.35.11): 1
2018-12-12 12:28:12,433 INFO (_log_results) STDOUT (root.35.11)...
total 0
d?????????? ? ?    ?    ?            ? testvol_dispersed_glusterfs
drwxr-xr-x. 2 root root 6 Dec 11 02:22 tmp

2018-12-12 12:28:12,433 INFO (_log_results) STDERR (root.35.11)...
ls: cannot access /mnt/testvol_dispersed_glusterfs: Transport endpoint is not connected




Version-Release number of selected component (if applicable):
=============================================================
3.4


How reproducible:
================
Downstream - 2/11
Upstream - 2/2

Steps to Reproduce:
===================
Create a EC Volume and mount the volume
        - Create IO on dir2 of volume mountpoint
        - Reset brick start
        - Check if brick is offline
        - Reset brick with destination same as source with force running IO's
        - Validating IO's and waiting for it to complete on dir2
        - Remove dir2
        - Create 5 directory and 5 files in dir1 of mountpoint
        - Rename all files inside dir1 at mountpoint
        - Create softlink and hardlink of files in dir1 of mountpoint
        - Delete op for deleting all file in one of the dirs inside dir1
        - Change chmod, chown, chgrp
        - Create tiny, small, medium and large file
        - Create IO's
        - Validating IO's and waiting for it to complete
        - Calculate arequal before kiiling brick
        - Get brick from Volume
        - Reset brick
        - Check if brick is offline
        - Reset brick by giving a different source and dst node --> Fails (Expected)
        - Reset brick by giving dst and source same without force --> fails (Expected)
        - Obtain hostname
        - Reset brick with dst-source same force using hostname - Successful
        - Monitor heal completion
        - Bring down other bricks to max redundancy
        - Get arequal after bringing down bricks


Actual results:
================
Getting arequal fails with the below error
2018-12-12 12:28:12,435 INFO (run_async) root.35.11 (cp): arequal-checksum -p /mnt/testvol_dispersed_glusterfs -i .trashcan
2018-12-12 12:28:12,436 DEBUG (_get_ssh_connection) Retrieved connection from cache: root.35.11
2018-12-12 12:28:13,117 INFO (_log_results) RETCODE (root.35.11): 1
2018-12-12 12:28:13,119 INFO (_log_results) STDERR (root.35.11)...
ftw (-p) returned -1 (Transport endpoint is not connected), terminating
2018-12-12 12:28:13,119 ERROR (collect_mounts_arequal) Collecting arequal-checksum failed on 10.70.35.11:/mnt/testvol_dispersed_glusterfs


Expected results:
=================
This should pass


Additional info:
=================
This issue is seen only for a few seconds after which mountpoint becomes accessible hence very difficult to reproduce it manually 

Tried a couple of times but was not able to reproduce the issue manually on downstream

Note You need to log in before you can comment on or make changes to this bug.