Bug 1644322
Summary: | flooding log with "glusterfs-fuse: read from /dev/fuse returned -1 (Operation not permitted)" | ||
---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Christian Lohmaier <lohmaier+rhbz> |
Component: | geo-replication | Assignee: | Csaba Henk <csaba> |
Status: | CLOSED NEXTRELEASE | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | mainline | CC: | atumball, bugs, lohmaier+rhbz, pasik, sacharya, sunkumar |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-08-08 06:11:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Christian Lohmaier
2018-10-30 14:06:43 UTC
Please confirm one thing. So does it happen that the glusterfs client producing the "read from /dev/fuse returned -1 (Operation not permitted)" flood recovers and gets back to normal operational state? I wonder if it's a transient overloaded state in the kernel or a non-recoverable faulty state. (As far as I understand you, it should be the former, just please let me know if my understanding is correct.) And if yes, then is there anything else that can be said about the circumstances? How often does it manage to recover, how long does the faulty state hold, is there anything that you can observe about the system state when it hits in, while it holds, when it ceases? yes, there is a possibility of it recovering, however never when it manages to fill up the ~60GB of free disk space on var before - which unfortunately is the case more often than not.. - if it fills the disk, then also the other geo-replication sessions go to faulty state. so if it cannot recover within 10-15 minutes, it likely won't (as the disk is filled up with the log spam) - I'd say we have it once a week. Nothing special about system state AFAICT - at least not a ramp-up of resource usage, if there's anything, it comes and goes in a flash. No effect on rest of the system, apart from var being full and other geo-replication sessions suffering from that. Geo-replication where it occurred last time are in history changelog mode, but not sure whether that is coincidence But I think bug#1643716 might be related, as I think it is more likely to trigger after it failed because of that, i.e. when geo-repliction session keeps relaunching a gvfs mount / switches from failed to initializing. But that as well might be a red herring, as the recovery method used so far is to truncate the logs.... But at least that was the case on the last case where I didn't throw away the whole log. The usage pattern on the volume that is geo-replicated is as follows: rsnapshot creates backups from other hosts via rsync, then those backups are rotated using hardlinks, in the directories .sync, daily.[0-6], weekly.[0-3] i.e. it rsyncs to .sync, then mv daily.6 _delete.$pid; mv daily.5 daily.6 (...); cp -al .sync daily.0; rm -r _delete.$pid Thus most of the files are hardlinks. Unfortunately I cannot offer a 100% reliable way to trigger the problem, HTH. REVIEW: https://review.gluster.org/22494 (fuse: rate limit reading from fuse device upon receiving EPERM) posted (#1) for review on master by Csaba Henk REVIEW: https://review.gluster.org/22494 (fuse: rate limit reading from fuse device upon receiving EPERM) merged (#7) on master by Amar Tumballi The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |