1540282 – [GSS] EBADF errors filling up /var on all storage nodes

Bug 1540282 - [GSS] EBADF errors filling up /var on all storage nodes

Summary: [GSS] EBADF errors filling up /var on all storage nodes

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	rhgs-3.3
Hardware:	All
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Ravishankar N
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-01-30 16:59 UTC by Pan Ousley
Modified:	2021-06-10 14:24 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-02-26 13:27:17 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Pan Ousley 2018-01-30 16:59:06 UTC

Description of problem: 

EBADF errors are filling up /var (brick logs) on all storage nodes almost every day. This is also happening in the client logs. This has been happening since around Nov 27, 2017. The cu patched their systems around this time.

The volumes are all 3x2=6 dist-rep volumes. I will post the gluster v info in a private comment. In this configuration there are multiple bricks running from the same XFS filesystem. The customer is aware that this is not a recommended configuration but I'm not sure if it would be contributing to this issue.

We collected straces and our analysis led us to believe that the issue is with the application. We saw that a file was being opened with the O_WRONLY flag (open for writing only) and then the application was attempting to read from it (the file was not closed in the interim). We were able to catch one of these invalid read requests and map it to the open request that gave the file descriptor. 

This led us to believe that this is not a gluster bug. However, the cu was not able to isolate the issue to one application, and it continues to happen which is heavily impacting prod. Most recently the errors have been seen when a cluster job was running tar and gzip. We need to identify if this is an issue within gluster so that we can mitigate the problem.


Version-Release number of selected component (if applicable):

glusterfs-3.8.4-44.el7rhgs.x86_64 on the server
glusterfs-3.8.4-52.el7_4.x86_64  on the client


Additional info:

We have collected sosreports and straces which are available in collab-shell. I will post the details in a private comment. Please let me know if there is anything else I can provide that would help.

Note You need to log in before you can comment on or make changes to this bug.