Bug 1540282

Summary: [GSS] EBADF errors filling up /var on all storage nodes
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Pan Ousley <pousley>
Component: coreAssignee: Ravishankar N <ravishankar>
Status: CLOSED NOTABUG QA Contact: Rahul Hinduja <rhinduja>
Severity: urgent Docs Contact:
Priority: high    
Version: rhgs-3.3CC: atumball, kdhananj, pousley, ravishankar, rhs-bugs, storage-qa-internal
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-26 13:27:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pan Ousley 2018-01-30 16:59:06 UTC
Description of problem: 

EBADF errors are filling up /var (brick logs) on all storage nodes almost every day. This is also happening in the client logs. This has been happening since around Nov 27, 2017. The cu patched their systems around this time.

The volumes are all 3x2=6 dist-rep volumes. I will post the gluster v info in a private comment. In this configuration there are multiple bricks running from the same XFS filesystem. The customer is aware that this is not a recommended configuration but I'm not sure if it would be contributing to this issue.

We collected straces and our analysis led us to believe that the issue is with the application. We saw that a file was being opened with the O_WRONLY flag (open for writing only) and then the application was attempting to read from it (the file was not closed in the interim). We were able to catch one of these invalid read requests and map it to the open request that gave the file descriptor. 

This led us to believe that this is not a gluster bug. However, the cu was not able to isolate the issue to one application, and it continues to happen which is heavily impacting prod. Most recently the errors have been seen when a cluster job was running tar and gzip. We need to identify if this is an issue within gluster so that we can mitigate the problem.


Version-Release number of selected component (if applicable):

glusterfs-3.8.4-44.el7rhgs.x86_64 on the server
glusterfs-3.8.4-52.el7_4.x86_64  on the client


Additional info:

We have collected sosreports and straces which are available in collab-shell. I will post the details in a private comment. Please let me know if there is anything else I can provide that would help.