Bug 2232121 - [EC] Commvault RCA Commvault backups fail in writing to bricks [NEEDINFO]
Summary: [EC] Commvault RCA Commvault backups fail in writing to bricks
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: disperse
Version: rhgs-3.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Sheetal Pamecha
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-15 13:09 UTC by Andrew Robinson
Modified: 2023-08-16 09:18 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
abhishku: needinfo? (spamecha)


Attachments (Terms of Use)

Description Andrew Robinson 2023-08-15 13:09:34 UTC
Before you record your issue, ensure you are using the latest version of Gluster.


Provide version-Release number of selected component (if applicable):

glusterfs-6.0-63.el7rhgs.x86_64

Have you searched the Bugzilla archives for same/similar issues reported.



Did you run SoS report with Insights tool?.



Have you discovered any workarounds?. 
If not, Read the troubleshooting documentation to help solve your issue.
(https://mojo.redhat.com/groups/gss-gluster (Gluster feature and its troubleshooting)  https://access.redhat.com/articles/1365073 
(Specific debug data that needs to be collected for GlusterFS to help troubleshooting)



Please provide the below Mandatory Information:
1 - gluster v <volname> info
2 - gluster v <volname> heal info
3 - gluster v <volname> status
4 - Fuse Mount/SMB/nfs-ganesha/OCS ???



Describe the issue:(please be detailed as possible and provide log snippets)
[Provide TimeStamp when the issue is seen]glusterfs-6.0-63.el7rhgs.x86_64

From the support case description:

~~~
What are you experiencing? What are you expecting to happen?
The backups that writes to gluster disk errors out intermittently with "Failed to mount the disk media in library"

Define the value or impact to you or the business
backups are impacted

Where are you experiencing this behavior? What environment?
From the commvault log, we could find that the backup job failed to get gluster media to write the data

2742 5231 08/12 04:36:40 6943840 [DM_BASE    ] 23080429--1 Failed to get a media to write the backup data 
2742 5231 08/12 04:36:40 6943840 [DM_RECEIVER] 23080429--1 DataReceiver::InitWriter: DataWriter Init failed for media_group [3957]
2742 5231 08/12 04:36:42 ####### [DSBACKUP   ] ERROR: DataReceiver reported Initialization Failure
2742 5231 08/12 04:36:42 ####### [DSBACKUP   ] Error During DataMover Initialization Type: 16 SubT'

-During this timestamp,from gluster logs, I see "Transport endpoint is not connected" errors.
 
glusterfs\ws-glus_69.log
*********************************************
The message "W [MSGID: 122053] [ec-common.c:331:ec_check_status] 0-CHBSP1_devid_69-disperse-3: Operation failed on 1 of 6 subvolumes.(up=111111, mask=111110, remaining=000000, good=111110, bad=000001,(Least significant bit represents first client/brick of subvol), FOP : 'STAT' failed on '/Folder_D6XXFP_10.18.2021_17.26/CV_MAGNETIC' with gfid e4695067-407f-473e-8d25-007b0352c9f1. Parent FOP: No Parent)" repeated 12 times between [2023-08-12 04:22:40.956384] and [2023-08-12 04:23:01.069339]
[2023-08-12 04:26:24.283028] E [rpc-clnt.c:183:call_bail] 0-CHBSP1_devid_69-client-18: bailing out frame type(GlusterFS 4.x v1), op(INODELK(29)), xid = 0x1481fd21, unique = 8844720681, sent = 2023-08-12 03:56:22.148967, timeout = 1800 for 10.166.168.149:49159
[2023-08-12 04:26:24.283090] E [MSGID: 114031] [client-rpc-fops_v2.c:1346:client4_0_inodelk_cbk] 0-CHBSP1_devid_69-client-18: remote operation failed [Transport endpoint is not connected]
[2023-08-12 04:28:54.341795] E [rpc-clnt.c:183:call_bail] 0-CHBSP1_devid_69-client-18: bailing out frame type(GlusterFS 4.x v1), op(INODELK(29)), xid = 0x148200c6, unique = 8844741275, sent = 2023-08-12 03:58:52.184769, timeout = 1800 for 10.166.168.149:49159
[2023-08-12 04:28:54.341839] E [MSGID: 114031] [client-rpc-fops_v2.c:1346:client4_0_inodelk_cbk] 0-CHBSP1_devid_69-client-18: remote operation failed [Transport endpoint is not connected]
[2023-08-12 04:28:54.341907] E [rpc-clnt.c:183:call_bail] 0-CHBSP1_devid_69-client-18: bailing out frame type(GlusterFS 4.x v1), op(INODELK(29)), xid = 0x148200bf, unique = 8844741274, sent = 2023-08-12 03:58:51.983738, timeout = 1800 for 10.166.168.149:49159
[2023-08-12 04:28:54.341916] E [MSGID: 114031] [client-rpc-fops_v2.c:1346:client4_0_inodelk_cbk] 0-CHBSP1_devid_69-client-18: remote operation failed [Transport endpoint is not connected]
[2023-08-12 04:56:25.050003] E [rpc-clnt.c:183:call_bail] 0-CHBSP1_devid_69-client-19: bailing out frame type(GlusterFS 4.x v1), op(INODELK(29)), xid = 0x147fbb39, unique = 8844720681, sent = 2023-08-12 04


When does this behavior occur? Frequency? Repeatedly? At certain times?
-Need to find why there are transport endpoint error reported, and how to resolve the same.
-Validated there are no brick failures, and there are no much pending heals
~~~

The logs for the volume mounts ws-glus-69.log on each gluster node show lots of these errors:

0-CHBSP1_devid_69-client-18: remote operation failed [Transport endpoint is not connected]
0-CHBSP1_devid_69-client-19: remote operation failed [Transport endpoint is not connected]

It is client-18 and client-19 for each node.


Is this issue reproducible? If yes, share more details.:

glusterfs-6.0-63.el7rhgs.x86_64glusterfs-6.0-63.el7rhgs.x86_64
Steps to Reproduce:
1.
2.
3.
Actual results:
 
Expected results:
 
Any Additional info:

It appears the problem is communication with two bricks. The customer wants to how to prevent the problem in the future. I want to know which bricks are pointed out and is there anything wrong with those two bricks.


Note You need to log in before you can comment on or make changes to this bug.