Bug 1396258

Summary: Metrics not functional with Cassandra showing error message: ERROR 18:35:35 Exception in thread Thread[CompactionExcecutor:4,1,main] org.apache.cassandra.io.FSReadError: java.io.IOException: Stale file handle
Product: OpenShift Container Platform Reporter: Eric Jones <erjones>
Component: HawkularAssignee: Matt Wringe <mwringe>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Peng Li <penli>
Severity: low Docs Contact:
Priority: medium    
Version: 3.3.0CC: aos-bugs, bchilds, bfields, eboyd, eparis, erjones, hchen, mwringe, swatt, swhiteho
Target Milestone: ---Keywords: Unconfirmed
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-21 22:40:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Eric Jones 2016-11-17 19:33:39 UTC
Description of problem:
Metrics were not functional. Customer checked the pod logs for the various components and saw the following (at the start) in the log:

ERROR 18:35:35 Exception in thread Thread[CompactionExcecutor:4,1,main] org.apache.cassandra.io.FSReadError: java.io.IOException: Stale file handle

Unfortunately customer does have log rollover setup so the logs I am attaching shortly might not contain everything, please let me know and we will attempt to recreate the behavior and catch all of the logs separately.

Version-Release number of selected component (if applicable):
All of the images are 3.3.0

Comment 4 Matt Wringe 2016-11-17 21:29:13 UTC
The 'stale file handle' is most likely a filesystem issue with NFS.

I am assigning this to the storage component as they probably have more expertise on the best way to handle this.

Comment 6 Steve Watt 2016-11-22 15:24:07 UTC
Outside of Kubernetes/OpenShift, Cassandra expects (and is designed) to use non-shared local disks for each node. If you're going to run Cassandra in Kubernetes/OpenShift with network disks, I'd recommend you still use non-shared (ReadWriteOnce - like iSCSI, Ceph RBD, FC, EBS, etc. ) storage and not Shared (ReadWriteMany - GlusterFS, NFS, etc.) storage. Net net, don't use NFS.

Comment 8 Matt Wringe 2016-11-22 15:48:37 UTC
(In reply to Steve Watt from comment #6)
> Outside of Kubernetes/OpenShift, Cassandra expects (and is designed) to use
> non-shared local disks for each node. If you're going to run Cassandra in
> Kubernetes/OpenShift with network disks, I'd recommend you still use
> non-shared (ReadWriteOnce - like iSCSI, Ceph RBD, FC, EBS, etc. ) storage
> and not Shared (ReadWriteMany - GlusterFS, NFS, etc.) storage. Net net,
> don't use NFS.

Yes, and it also recommended to run this on a dedicated machine with SSDs only. Its of course recommended to run in an environment where it will have the best performance.

Excuse me if I am wrong, but I believe the 'State file handler' issue here is a filesystem issue and is independent of the application using it.

Comment 11 J. Bruce Fields 2016-11-22 22:59:56 UTC
On an ordinary filesystem removing an open file just removes that file's name, but leaves the file allocated on disk and still usable until the last user of the file closes it.  On NFS, if a client removes a file, then that file really goes away, and users on other clients receive ESTALE on further attempts to do IO to it.

So I don't understand exactly what's happening here, but it's likely this is expected behavior; possibly workarounds might be:

- stop removing the file out from under the user--maybe you can wait till you're sure nobody is using it? or
- teach the application accessing the file to recover from ESTALE somehow (e.g. if it's writing to a log file that was moved to another filesystem and deleted, maybe it just needs to close, open a new file, and retry the write).

Comment 13 Matt Wringe 2017-02-21 19:52:40 UTC
Sorry, this issue kindof fell threw the cracks.

Did unmounting and remounting the filesystem work in this case?

Has this issue occurred more than once?

Comment 14 Eric Jones 2017-02-21 22:29:00 UTC
@Matt, the last I heard from the customer was that they were looking to move away from NFS.

I got no confirmation about any of it.

Comment 15 Matt Wringe 2017-02-21 22:40:51 UTC
I will close this issue as 'INSUFFICIENT_DATA', there is not much for us to go on here and we don't know if something else was mucking around with NFS outside of Cassandra.

If this issue can be reproduced, please re-open and explain what was done to cause this issue to happen.