Description of problem: ======================= Tierd is running but the status on most of the nodes in cluster is shown as failed after stopping and starting volume. For example: Tier process on localhost; [root@dhcp37-165 glusterfs]# ps aux | grep tier root 12829 74.6 71.2 4987944 2765140 ? Ssl Dec01 2046:18 /usr/sbin/glusterfs -s localhost --volfile-id rebalance/tiervolume --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --xlator-option *tier-dht.xattr-name=trusted.tier.tier-dht --xlator-option *dht.rebalance-cmd=6 --xlator-option *dht.node-uuid=fb12984d-c631-4364-bf4e-aa91f9ea76fb --xlator-option *dht.commit-hash=3001417369 --socket-file /var/run/gluster/gluster-tier-bdb6ee8c-4410-4f0f-8714-8d4a3ff5812c.sock --pid-file /var/lib/glusterd/vols/tiervolume/tier/fb12984d-c631-4364-bf4e-aa91f9ea76fb.pid -l /var/log/glusterfs/tiervolume-tier.log root 16484 7.8 1.1 1442396 46492 ? Ssl 00:35 0:35 /usr/sbin/glusterfsd -s 10.70.37.165 --volfile-id tiervolume.10.70.37.165.rhs-brick3-tiervolume_hot -p /var/lib/glusterd/vols/tiervolume/run/10.70.37.165-rhs-brick3-tiervolume_hot.pid -S /var/run/gluster/4f46770e383fab1ee7789ff7a656a342.socket --brick-name /rhs/brick3/tiervolume_hot -l /var/log/glusterfs/bricks/rhs-brick3-tiervolume_hot.log --xlator-option *-posix.glusterd-uuid=fb12984d-c631-4364-bf4e-aa91f9ea76fb --brick-port 49153 --xlator-option tiervolume-server.listen-port=49153 root 16503 47.9 2.3 1577596 89472 ? Ssl 00:35 3:35 /usr/sbin/glusterfsd -s 10.70.37.165 --volfile-id tiervolume.10.70.37.165.rhs-brick1-tiervolume_ct-disp1 -p /var/lib/glusterd/vols/tiervolume/run/10.70.37.165-rhs-brick1-tiervolume_ct-disp1.pid -S /var/run/gluster/4cdd38c5ea86fe823baaa5dcde1b4b57.socket --brick-name /rhs/brick1/tiervolume_ct-disp1 -l /var/log/glusterfs/bricks/rhs-brick1-tiervolume_ct-disp1.log --xlator-option *-posix.glusterd-uuid=fb12984d-c631-4364-bf4e-aa91f9ea76fb --brick-port 49152 --xlator-option tiervolume-server.listen-port=49152 root 16655 0.0 0.0 112648 956 pts/0 S+ 00:42 0:00 grep --color=auto tier [root@dhcp37-165 glusterfs]# Two for each brick and one for tierd. Tier status is shown as follows: ================================ [root@dhcp37-165 glusterfs]# gluster volume rebal tiervolume status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 780 0Bytes 1161977 0 0 failed 164607.00 10.70.37.133 9725 0Bytes 11176 307 0 failed 164423.00 10.70.37.160 0 0Bytes 0 0 0 in progress 482.00 10.70.37.158 9791 0Bytes 11031 1 0 failed 164370.00 10.70.37.110 606 0Bytes 1194893 0 0 failed 164242.00 10.70.37.155 0 0Bytes 0 0 0 in progress 482.00 10.70.37.99 833 0Bytes 1673312 0 0 failed 164607.00 10.70.37.88 9790 0Bytes 11524 1 0 failed 164291.00 10.70.37.112 0 0Bytes 0 0 0 in progress 482.00 10.70.37.199 9839 0Bytes 11836 172 0 failed 164285.00 10.70.37.162 0 0Bytes 0 0 0 in progress 482.00 10.70.37.87 9885 0Bytes 12501 127 0 failed 164225.00 volume rebalance: tiervolume: success [root@dhcp37-165 glusterfs]# Logs report following: ====================== [2015-12-02 19:04:26.687972] E [MSGID: 114058] [client-handshake.c:1524:client_query_portmap_cbk] 0-tiervolume-client-13: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2015-12-02 19:04:26.688162] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-tiervolume-client-13: disconnected from tiervolume-client-13. Client process will keep trying to connect to glusterd until brick's port is available [2015-12-02 19:04:26.690100] E [MSGID: 114058] [client-handshake.c:1524:client_query_portmap_cbk] 0-tiervolume-client-15: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2015-12-02 19:04:26.690220] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-tiervolume-client-15: disconnected from tiervolume-client-15. Client process will keep trying to connect to glusterd until brick's port is available [2015-12-02 19:04:26.693118] E [MSGID: 114058] [client-handshake.c:1524:client_query_portmap_cbk] 0-tiervolume-client-19: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2015-12-02 19:04:26.693223] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-tiervolume-client-19: disconnected from tiervolume-client-19. Client process will keep trying to connect to glusterd until brick's port is available [2015-12-02 19:04:26.695809] E [MSGID: 114058] [client-handshake.c:1524:client_query_portmap_cbk] 0-tiervolume-client-7: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2015-12-02 19:04:26.695914] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-tiervolume-client-7: disconnected from tiervolume-client-7. Client process will keep trying to connect to glusterd until brick's port is available [2015-12-02 19:04:26.698311] E [MSGID: 114058] [client-handshake.c:1524:client_query_portmap_cbk] 0-tiervolume-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2015-12-02 19:04:26.698413] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-tiervolume-client-1: disconnected from tiervolume-client-1. Client process will keep trying to connect to glusterd until brick's port is available Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.7.5-8.el7rhgs.x86_64 How reproducible: ================= 1/1 Steps carried: ============== 1. 12 node cluster 2. Hot tier {6x2} , Cold tier {2x(4=2)} 3. Mounted the volume on 7.2,7.1 and 6.7 clients 4. Huge set of data is created on volume {148GB} 5. Stopped the volume {No data creation or IO was in progress at this time} 6. Started the volume
Partial RCA: Tier will start only when all the child's are up for dht. So in this case tier process started running after a child. And then it started promoting/demoting threads in first cycle. During this time network got disconnected and child went down, When one child is down tier/rebalance process will die after completing it is current thread. But it immediately update the status as failed. That's why it is still showing as failed though it is running. After completing the spawned thread it will die immediately. What we are not RCA'ed, why network connection interrupted. It might be because of the memory leak issue.
The readv fails which disconnects the client from glusterd. readv could fail when the connection is down, which brings the child down and fails the rebalance. As the client is not able to communicate with the glusterd, the status is marked as failed.
RCA: When we do a volume stop, we will get a child down event and we will mark the process as failed. But if we are under migration we will wait until we finish them. So in case of tier daemon, If there are large number of files to migrate, the migration thread will take some time to finish even if it is marked for killing. Once it finishes the list, the process will be killed. If we are again starting the process before it dies, it will try to start the tier daemon again, but here in this case it is still running, so it will just skip the starting of daemon. But the process will die after the migration thread returns.
The tiered volume was brought down while it was middle of the migration. And brought back up before all the files that were marked for migration were migrated. Once the volume is back up, the migration continues. The status still remains correct ( in-progress ) couldn't reproduce the issue. Still there was another problem where the file which was being migrated during volume stop, remains in the hot tier. It does move to the cold tier. The md5sum of the file is ok. No issues with that other than it not getting demoted. The bug is not reproducible.
Here what we need to make sure is, there should be large number files to migrate in one cycle. 1) Create large number of files. 2) Make all of them promote or demote in a single cycle. 3) Just when the cycle starts with a huge list stop the volume and start the volume If that doesn't work, I think we can reproduce using gdb with little change in code.
the fix on master: http://review.gluster.org/#/c/13646/ Is in 3.8 through rebase.
As tier is not being actively developed, I'm closing this bug. Feel free to open it if necessary.