Bug 2232121

Summary: [EC] Commvault RCA Commvault backups fail in writing to bricks
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Andrew Robinson <anrobins>
Component: disperseAssignee: Sheetal Pamecha <spamecha>
Status: ASSIGNED --- QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.5CC: abhishku, spamecha
Target Milestone: ---Flags: abhishku: needinfo? (spamecha)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andrew Robinson 2023-08-15 13:09:34 UTC
Before you record your issue, ensure you are using the latest version of Gluster.


Provide version-Release number of selected component (if applicable):

glusterfs-6.0-63.el7rhgs.x86_64

Have you searched the Bugzilla archives for same/similar issues reported.



Did you run SoS report with Insights tool?.



Have you discovered any workarounds?. 
If not, Read the troubleshooting documentation to help solve your issue.
(https://mojo.redhat.com/groups/gss-gluster (Gluster feature and its troubleshooting)  https://access.redhat.com/articles/1365073 
(Specific debug data that needs to be collected for GlusterFS to help troubleshooting)



Please provide the below Mandatory Information:
1 - gluster v <volname> info
2 - gluster v <volname> heal info
3 - gluster v <volname> status
4 - Fuse Mount/SMB/nfs-ganesha/OCS ???



Describe the issue:(please be detailed as possible and provide log snippets)
[Provide TimeStamp when the issue is seen]glusterfs-6.0-63.el7rhgs.x86_64

From the support case description:

~~~
What are you experiencing? What are you expecting to happen?
The backups that writes to gluster disk errors out intermittently with "Failed to mount the disk media in library"

Define the value or impact to you or the business
backups are impacted

Where are you experiencing this behavior? What environment?
From the commvault log, we could find that the backup job failed to get gluster media to write the data

2742 5231 08/12 04:36:40 6943840 [DM_BASE    ] 23080429--1 Failed to get a media to write the backup data 
2742 5231 08/12 04:36:40 6943840 [DM_RECEIVER] 23080429--1 DataReceiver::InitWriter: DataWriter Init failed for media_group [3957]
2742 5231 08/12 04:36:42 ####### [DSBACKUP   ] ERROR: DataReceiver reported Initialization Failure
2742 5231 08/12 04:36:42 ####### [DSBACKUP   ] Error During DataMover Initialization Type: 16 SubT'

-During this timestamp,from gluster logs, I see "Transport endpoint is not connected" errors.
 
glusterfs\ws-glus_69.log
*********************************************
The message "W [MSGID: 122053] [ec-common.c:331:ec_check_status] 0-CHBSP1_devid_69-disperse-3: Operation failed on 1 of 6 subvolumes.(up=111111, mask=111110, remaining=000000, good=111110, bad=000001,(Least significant bit represents first client/brick of subvol), FOP : 'STAT' failed on '/Folder_D6XXFP_10.18.2021_17.26/CV_MAGNETIC' with gfid e4695067-407f-473e-8d25-007b0352c9f1. Parent FOP: No Parent)" repeated 12 times between [2023-08-12 04:22:40.956384] and [2023-08-12 04:23:01.069339]
[2023-08-12 04:26:24.283028] E [rpc-clnt.c:183:call_bail] 0-CHBSP1_devid_69-client-18: bailing out frame type(GlusterFS 4.x v1), op(INODELK(29)), xid = 0x1481fd21, unique = 8844720681, sent = 2023-08-12 03:56:22.148967, timeout = 1800 for 10.166.168.149:49159
[2023-08-12 04:26:24.283090] E [MSGID: 114031] [client-rpc-fops_v2.c:1346:client4_0_inodelk_cbk] 0-CHBSP1_devid_69-client-18: remote operation failed [Transport endpoint is not connected]
[2023-08-12 04:28:54.341795] E [rpc-clnt.c:183:call_bail] 0-CHBSP1_devid_69-client-18: bailing out frame type(GlusterFS 4.x v1), op(INODELK(29)), xid = 0x148200c6, unique = 8844741275, sent = 2023-08-12 03:58:52.184769, timeout = 1800 for 10.166.168.149:49159
[2023-08-12 04:28:54.341839] E [MSGID: 114031] [client-rpc-fops_v2.c:1346:client4_0_inodelk_cbk] 0-CHBSP1_devid_69-client-18: remote operation failed [Transport endpoint is not connected]
[2023-08-12 04:28:54.341907] E [rpc-clnt.c:183:call_bail] 0-CHBSP1_devid_69-client-18: bailing out frame type(GlusterFS 4.x v1), op(INODELK(29)), xid = 0x148200bf, unique = 8844741274, sent = 2023-08-12 03:58:51.983738, timeout = 1800 for 10.166.168.149:49159
[2023-08-12 04:28:54.341916] E [MSGID: 114031] [client-rpc-fops_v2.c:1346:client4_0_inodelk_cbk] 0-CHBSP1_devid_69-client-18: remote operation failed [Transport endpoint is not connected]
[2023-08-12 04:56:25.050003] E [rpc-clnt.c:183:call_bail] 0-CHBSP1_devid_69-client-19: bailing out frame type(GlusterFS 4.x v1), op(INODELK(29)), xid = 0x147fbb39, unique = 8844720681, sent = 2023-08-12 04


When does this behavior occur? Frequency? Repeatedly? At certain times?
-Need to find why there are transport endpoint error reported, and how to resolve the same.
-Validated there are no brick failures, and there are no much pending heals
~~~

The logs for the volume mounts ws-glus-69.log on each gluster node show lots of these errors:

0-CHBSP1_devid_69-client-18: remote operation failed [Transport endpoint is not connected]
0-CHBSP1_devid_69-client-19: remote operation failed [Transport endpoint is not connected]

It is client-18 and client-19 for each node.


Is this issue reproducible? If yes, share more details.:

glusterfs-6.0-63.el7rhgs.x86_64glusterfs-6.0-63.el7rhgs.x86_64
Steps to Reproduce:
1.
2.
3.
Actual results:
 
Expected results:
 
Any Additional info:

It appears the problem is communication with two bricks. The customer wants to how to prevent the problem in the future. I want to know which bricks are pointed out and is there anything wrong with those two bricks.