Description of problem: FUSE clients ping-timeout on all connections to bricks in the cluster, and post this keep returning ENOTCONN on operations or commands from a shell on the mount hangs. Also, the FUSE clients do not seem to be reconnecting to the bricks on detecting this failure, i.e there are no connections seen with 'ss' or 'netstat' to the said bricks from the container. Further, the container uses host based networking and is based of Docker and there are intermittent network interface loss on the container. A newer FUSE mount against the same volume from the same client container, works clean. Version-Release number of selected component (if applicable): RHGS 3.1 bits running on top of a CentOS container, running on CoreOS base. How reproducible: Quite reproducible, takes some time (say ~2-3 hours) to reproduce, but occurs in the cu environment quite consistently. Steps to Reproduce: At the cu setup, the test is as simple as having the environment setup as above, and running fio on the FUSE mount, post some duration of time the problem manifests itself. <<These steps seem to be the relevant ones at present, other than recreating the cu environment in total in house>> NOTE: These steps are not tested. - bricks on different containers across different nodes (could be even the same node) - client on a different container in a different node - Some client operations ongoing (simple operations are fine, say a simple fio workload) - client container needs to *lose* its network and react with a ping timeout - In the cu environment we see from the logs that after the first ping timeout, the client recovers after 30 minutes, and then very soon after this another ping timeout occurs, and the client does not recover post this point. - How to make this network *loss* on the client could be through iptables, blocking all traffic to and from the client, till the ping timeout is observed in the logs against all bricks, and then disabling the iptables filter, and repeating this exercise a few times. Actual results: Post some time the FUSE mount hangs and does not show any evidence of connections being made to the bricks. Expected results: FUSE client should recover from the ping-timeout and the network failure, when the network is up and running again, and reconnect to the bricks.
https://code.engineering.redhat.com/gerrit/#/c/59832/
Ignore previous link. Correct one is below. https://code.engineering.redhat.com/gerrit/#/c/60584/1
Can we get the doc text for this?
Updated the doc text as requested.
Verified as fixed in RHGS container image rhgs-server-rhel7:3.1.2-6 (having glusterfs-3.7.5-16). Repeated the steps described in comment #0. Fuse client hang is no longer observed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0193.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days