Hide Forgot
Description of problem: Metrics were not functional. Customer checked the pod logs for the various components and saw the following (at the start) in the log: ERROR 18:35:35 Exception in thread Thread[CompactionExcecutor:4,1,main] org.apache.cassandra.io.FSReadError: java.io.IOException: Stale file handle Unfortunately customer does have log rollover setup so the logs I am attaching shortly might not contain everything, please let me know and we will attempt to recreate the behavior and catch all of the logs separately. Version-Release number of selected component (if applicable): All of the images are 3.3.0
The 'stale file handle' is most likely a filesystem issue with NFS. I am assigning this to the storage component as they probably have more expertise on the best way to handle this.
Outside of Kubernetes/OpenShift, Cassandra expects (and is designed) to use non-shared local disks for each node. If you're going to run Cassandra in Kubernetes/OpenShift with network disks, I'd recommend you still use non-shared (ReadWriteOnce - like iSCSI, Ceph RBD, FC, EBS, etc. ) storage and not Shared (ReadWriteMany - GlusterFS, NFS, etc.) storage. Net net, don't use NFS.
(In reply to Steve Watt from comment #6) > Outside of Kubernetes/OpenShift, Cassandra expects (and is designed) to use > non-shared local disks for each node. If you're going to run Cassandra in > Kubernetes/OpenShift with network disks, I'd recommend you still use > non-shared (ReadWriteOnce - like iSCSI, Ceph RBD, FC, EBS, etc. ) storage > and not Shared (ReadWriteMany - GlusterFS, NFS, etc.) storage. Net net, > don't use NFS. Yes, and it also recommended to run this on a dedicated machine with SSDs only. Its of course recommended to run in an environment where it will have the best performance. Excuse me if I am wrong, but I believe the 'State file handler' issue here is a filesystem issue and is independent of the application using it.
On an ordinary filesystem removing an open file just removes that file's name, but leaves the file allocated on disk and still usable until the last user of the file closes it. On NFS, if a client removes a file, then that file really goes away, and users on other clients receive ESTALE on further attempts to do IO to it. So I don't understand exactly what's happening here, but it's likely this is expected behavior; possibly workarounds might be: - stop removing the file out from under the user--maybe you can wait till you're sure nobody is using it? or - teach the application accessing the file to recover from ESTALE somehow (e.g. if it's writing to a log file that was moved to another filesystem and deleted, maybe it just needs to close, open a new file, and retry the write).
Sorry, this issue kindof fell threw the cracks. Did unmounting and remounting the filesystem work in this case? Has this issue occurred more than once?
@Matt, the last I heard from the customer was that they were looking to move away from NFS. I got no confirmation about any of it.
I will close this issue as 'INSUFFICIENT_DATA', there is not much for us to go on here and we don't know if something else was mucking around with NFS outside of Cassandra. If this issue can be reproduced, please re-open and explain what was done to cause this issue to happen.