We have a 5 node Gluster cluster running 4.1.7. The gluster service is running as a container on each host, and is mounted the `/srv/brick` directory on the host. We have several volumes that we have set up as dispersed volumes. The client for Gluster is a Kubernetes cluster. One of the apps we have running is a Devpi server. After a couple of days running the devpi server, we noticed that the Gluster servers were unresponsive. Trying to ssh to any of the nodes gave an error about too many files open. We eventually had to reboot all of the servers to recover them. The next day, we checked again, and saw that the glusterfs process that was responsible for the devpi volume had 3 million files open (as seen with the command `sudo lsof -a -p <pid> | wc -l`). Stopping the container did not free up the file descriptors. Only stopping and starting the volume would release the FDs. However, as soon as devpi starts again and serves files, then the open FDs start rising again. I was able to narrow down to when writing to files. Here are the replication steps: Create a Gluster dispersed volume: gluster volume create fd-test disperse 5 redundancy 2 srv1:/path srv2:/path srv3:/path srv4:/path srv5:/path gluster volume quota fd-test enable gluster volume quota fd-test limit-usage / 1GB Mount the volume on a host, and run the simple script in the Gluster volume: #!/bin/bash while [ 1 -eq 1 ] do echo "something\n" > file.txt sleep 1 done From any one of the Gluster nodes, find the PID of the Gluster process for the volume, and run the commands to see the number of FDs (every 5 seconds): admin@gfs-01:~$ sudo lsof -a -p 11606 | wc -l 26 admin@gfs-01:~$ sudo lsof -a -p 11606 | wc -l 30 admin@gfs-01:~$ sudo lsof -a -p 11606 | wc -l 35 admin@gfs-01:~$ sudo lsof -a -p 11606 | wc -l 40 If you take out the sleep, the FDs will jump up by thousands every second. If you view the actual FDs without the `wc` command, they are almost all the same file: glusterfs 11606 root 1935w REG 8,17 18944 53215266 /srv/brick/fd-test/.glusterfs/2e/4c/2e4c7104-02c4-4ac9-b611-7290938a6e3f glusterfs 11606 root 1936w REG 8,17 18944 53215266 /srv/brick/fd-test/.glusterfs/2e/4c/2e4c7104-02c4-4ac9-b611-7290938a6e3f glusterfs 11606 root 1937w REG 8,17 18944 53215266 /srv/brick/fd-test/.glusterfs/2e/4c/2e4c7104-02c4-4ac9-b611-7290938a6e3f glusterfs 11606 root 1938w REG 8,17 18944 53215266 /srv/brick/fd-test/.glusterfs/2e/4c/2e4c7104-02c4-4ac9-b611-7290938a6e3f glusterfs 11606 root 1939w REG 8,17 18944 53215266 /srv/brick/fd-test/.glusterfs/2e/4c/2e4c7104-02c4-4ac9-b611-7290938a6e3f glusterfs 11606 root 1940w REG 8,17 18944 53215266 /srv/brick/fd-test/.glusterfs/2e/4c/2e4c7104-02c4-4ac9-b611-7290938a6e3f glusterfs 11606 root 1941w REG 8,17 18944 53215266 /srv/brick/fd-test/.glusterfs/2e/4c/2e4c7104-02c4-4ac9-b611-7290938a6e3f glusterfs 11606 root 1942w REG 8,17 18944 53215266 /srv/brick/fd-test/.glusterfs/2e/4c/2e4c7104-02c4-4ac9-b611-7290938a6e3f glusterfs 11606 root 1943w REG 8,17 18944 53215266 /srv/brick/fd-test/.glusterfs/2e/4c/2e4c7104-02c4-4ac9-b611-7290938a6e3f glusterfs 11606 root 1944w REG 8,17 18944 53215266 /srv/brick/fd-test/.glusterfs/2e/4c/2e4c7104-02c4-4ac9-b611-7290938a6e3f glusterfs 11606 root 1945w REG 8,17 18944 53215266 /srv/brick/fd-test/.glusterfs/2e/4c/2e4c7104-02c4-4ac9-b611-7290938a6e3f glusterfs 11606 root 1946w REG 8,17 18944 53215266 /srv/brick/fd-test/.glusterfs/2e/4c/2e4c7104-02c4-4ac9-b611-7290938a6e3f glusterfs 11606 root 1947w REG 8,17 18944 53215266 /srv/brick/fd-test/.glusterfs/2e/4c/2e4c7104-02c4-4ac9-b611-7290938a6e3f The container itself does not see any open FDs. It is only the Gluster host. We tried creating a replicated volume and moved the devpi data to the new volume, and it worked fine without leaving open FDs (constant 90 FDs open), so the problem appears to be just with dispersed mode.
Hi, I tried to reproduce this issue on latest master but could not see any issue with rising number of open fd's. There was nothing suspicious on my setup. Could you please try to reproduce it with latest release of gluster and let us know if you can see the issue? --- Ashish
We are using the Docker version of Gluster, and the latest version is 4.1.7. The last updated image was 4 months ago. https://hub.docker.com/r/gluster/gluster-centos/tags. I see there is also a glusterd2-nightly Docker image. I presume this is the version you used? Is this a drop-in replacement for the gluster-centos Docker image?
This bug is moved to https://github.com/gluster/glusterfs/issues/925, and will be tracked there from now on. Visit GitHub issues URL for further details