Hide Forgot
Description of problem: ===================== I brought down one brick of each replica pair in hot tier, and found that a file which was already promoting/demotion didn't proceed further. It was stuck in that state for more than half-an-hour. That same file in a ideal scenario(where all bricks were up), took about 3-4 min to migrate [root@dhcp37-202 ~]# gluster v status nagvol Status of volume: nagvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Hot Bricks: Brick 10.70.37.120:/rhs/brick7/nagvol_hot 49156 0 Y 32513 Brick 10.70.37.60:/rhs/brick7/nagvol_hot N/A N/A N N/A Brick 10.70.37.69:/rhs/brick7/nagvol_hot 49156 0 Y 32442 Brick 10.70.37.101:/rhs/brick7/nagvol_hot N/A N/A N N/A Brick 10.70.35.163:/rhs/brick7/nagvol_hot 49156 0 Y 617 Brick 10.70.35.173:/rhs/brick7/nagvol_hot N/A N/A N N/A Brick 10.70.35.232:/rhs/brick7/nagvol_hot 49156 0 Y 32361 Brick 10.70.35.176:/rhs/brick7/nagvol_hot N/A N/A N N/A Brick 10.70.35.222:/rhs/brick7/nagvol_hot 49155 0 Y 22713 Brick 10.70.35.155:/rhs/brick7/nagvol_hot N/A N/A N N/A Brick 10.70.37.195:/rhs/brick7/nagvol_hot N/A N/A N N/A Brick 10.70.37.202:/rhs/brick7/nagvol_hot 49156 0 Y 26275 Cold Bricks: Brick 10.70.37.202:/rhs/brick1/nagvol 49152 0 Y 16950 Brick 10.70.37.195:/rhs/brick1/nagvol 49152 0 Y 16702 Brick 10.70.35.155:/rhs/brick1/nagvol 49152 0 Y 13578 Brick 10.70.35.222:/rhs/brick1/nagvol 49152 0 Y 13546 Brick 10.70.35.108:/rhs/brick1/nagvol 49152 0 Y 4675 Brick 10.70.35.44:/rhs/brick1/nagvol 49152 0 Y 12288 Brick 10.70.35.89:/rhs/brick1/nagvol 49152 0 Y 2668 Brick 10.70.35.231:/rhs/brick1/nagvol 49152 0 Y 22810 Brick 10.70.35.176:/rhs/brick1/nagvol 49152 0 Y 22781 Brick 10.70.35.232:/rhs/brick1/nagvol 49152 0 Y 22783 Brick 10.70.35.173:/rhs/brick1/nagvol 49152 0 Y 22795 Brick 10.70.35.163:/rhs/brick1/nagvol 49152 0 Y 22805 Brick 10.70.37.101:/rhs/brick1/nagvol 49152 0 Y 22847 Brick 10.70.37.69:/rhs/brick1/nagvol 49152 0 Y 22847 Brick 10.70.37.60:/rhs/brick1/nagvol 49152 0 Y 22895 Brick 10.70.37.120:/rhs/brick1/nagvol 49152 0 Y 22916 Brick 10.70.37.202:/rhs/brick2/nagvol 49153 0 Y 16969 Brick 10.70.37.195:/rhs/brick2/nagvol 49153 0 Y 16721 Brick 10.70.35.155:/rhs/brick2/nagvol 49153 0 Y 13597 Brick 10.70.35.222:/rhs/brick2/nagvol 49153 0 Y 13565 Brick 10.70.35.108:/rhs/brick2/nagvol 49153 0 Y 4694 Brick 10.70.35.44:/rhs/brick2/nagvol 49153 0 Y 12307 Brick 10.70.35.89:/rhs/brick2/nagvol 49153 0 Y 2683 Brick 10.70.35.231:/rhs/brick2/nagvol 49153 0 Y 22829 NFS Server on localhost 2049 0 Y 20158 Self-heal Daemon on localhost N/A N/A Y 20166 Quota Daemon on localhost N/A N/A Y 20174 NFS Server on 10.70.37.195 2049 0 Y 19012 Self-heal Daemon on 10.70.37.195 N/A N/A Y 19020 Quota Daemon on 10.70.37.195 N/A N/A Y 19028 NFS Server on 10.70.37.69 2049 0 Y 26129 Self-heal Daemon on 10.70.37.69 N/A N/A Y 26137 Quota Daemon on 10.70.37.69 N/A N/A Y 26145 NFS Server on 10.70.35.231 2049 0 Y 25739 Self-heal Daemon on 10.70.35.231 N/A N/A Y 25747 Quota Daemon on 10.70.35.231 N/A N/A Y 25755 NFS Server on 10.70.37.101 2049 0 Y 6099 Self-heal Daemon on 10.70.37.101 N/A N/A Y 6107 Quota Daemon on 10.70.37.101 N/A N/A Y 6115 NFS Server on 10.70.37.120 2049 0 Y 26317 Self-heal Daemon on 10.70.37.120 N/A N/A Y 26325 Quota Daemon on 10.70.37.120 N/A N/A Y 26333 NFS Server on 10.70.37.60 2049 0 Y 31342 Self-heal Daemon on 10.70.37.60 N/A N/A Y 31350 Quota Daemon on 10.70.37.60 N/A N/A Y 31358 NFS Server on 10.70.35.232 2049 0 Y 26640 Self-heal Daemon on 10.70.35.232 N/A N/A Y 26648 Quota Daemon on 10.70.35.232 N/A N/A Y 26656 NFS Server on 10.70.35.163 2049 0 Y 26759 Self-heal Daemon on 10.70.35.163 N/A N/A Y 26767 Quota Daemon on 10.70.35.163 N/A N/A Y 26775 NFS Server on 10.70.35.155 2049 0 Y 16144 Self-heal Daemon on 10.70.35.155 N/A N/A Y 16152 Quota Daemon on 10.70.35.155 N/A N/A Y 16160 NFS Server on 10.70.35.44 2049 0 Y 15159 Self-heal Daemon on 10.70.35.44 N/A N/A Y 15167 Quota Daemon on 10.70.35.44 N/A N/A Y 15175 NFS Server on 10.70.35.108 2049 0 Y 7435 Self-heal Daemon on 10.70.35.108 N/A N/A Y 7443 Quota Daemon on 10.70.35.108 N/A N/A Y 7451 NFS Server on 10.70.35.173 2049 0 Y 26912 Self-heal Daemon on 10.70.35.173 N/A N/A Y 26920 Quota Daemon on 10.70.35.173 N/A N/A Y 26928 NFS Server on 10.70.35.176 2049 0 Y 26492 Self-heal Daemon on 10.70.35.176 N/A N/A Y 26500 Quota Daemon on 10.70.35.176 N/A N/A Y 26508 NFS Server on 10.70.35.222 2049 0 Y 16478 Self-heal Daemon on 10.70.35.222 N/A N/A Y 16486 Quota Daemon on 10.70.35.222 N/A N/A Y 16494 NFS Server on 10.70.35.89 2049 0 Y 6458 Self-heal Daemon on 10.70.35.89 N/A N/A Y 6466 Quota Daemon on 10.70.35.89 N/A N/A Y 6474 Task Status of Volume nagvol ------------------------------------------------------------------------------ Task : Tier migration ID : 0870550a-70ba-4cd1-98da-b456059bd6cc Status : in progress Version-Release number of selected component (if applicable): ============= 3.7.5-17 Steps to Reproduce: 1.on the 16 node setup, have promotes/demotes happening. Observe the file which is getting demoted. Ideally a big file in size of a few GB is to be observed, as it takes time to migrate Now while demoting, bring down one brick of each hot replica pair. the demote was stuck there. IOs being done: Nothing much Note: I was however able to access the file, but don't know if it was the data i wanted to access. I mean, was it corrupt or not, I am not sure, as it was a file created by dd command with random input, hence, reading with cat may not help in assessing the integrity of the file. sosreports can be found on the nodes of the servers
refer to this node: [root@dhcp37-101 glusterfs]# gluster v status nagvol Status of volume: nagvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Hot Bricks: Brick 10.70.37.120:/rhs/brick7/nagvol_hot 49156 0 Y 32513 Brick 10.70.37.60:/rhs/brick7/nagvol_hot N/A N/A N N/A Brick 10.70.37.69:/rhs/brick7/nagvol_hot 49156 0 Y 32442 Brick 10.70.37.101:/rhs/brick7/nagvol_hot N/A N/A N N/A Brick 10.70.35.163:/rhs/brick7/nagvol_hot 49156 0 Y 617 Brick 10.70.35.173:/rhs/brick7/nagvol_hot N/A N/A N N/A Brick 10.70.35.232:/rhs/brick7/nagvol_hot 49156 0 Y 32361 Brick 10.70.35.176:/rhs/brick7/nagvol_hot N/A N/A N N/A Brick 10.70.35.222:/rhs/brick7/nagvol_hot 49155 0 Y 22713 Brick 10.70.35.155:/rhs/brick7/nagvol_hot N/A N/A N N/A Brick 10.70.37.195:/rhs/brick7/nagvol_hot N/A N/A N N/A Brick 10.70.37.202:/rhs/brick7/nagvol_hot 49156 0 Y 26275 Cold Bricks: Brick 10.70.37.202:/rhs/brick1/nagvol 49152 0 Y 16950 Brick 10.70.37.195:/rhs/brick1/nagvol 49152 0 Y 16702 Brick 10.70.35.155:/rhs/brick1/nagvol 49152 0 Y 13578 Brick 10.70.35.222:/rhs/brick1/nagvol 49152 0 Y 13546 Brick 10.70.35.108:/rhs/brick1/nagvol 49152 0 Y 4675 Brick 10.70.35.44:/rhs/brick1/nagvol 49152 0 Y 12288 Brick 10.70.35.89:/rhs/brick1/nagvol 49152 0 Y 2668 Brick 10.70.35.231:/rhs/brick1/nagvol 49152 0 Y 22810 Brick 10.70.35.176:/rhs/brick1/nagvol 49152 0 Y 22781 Brick 10.70.35.232:/rhs/brick1/nagvol 49152 0 Y 22783 Brick 10.70.35.173:/rhs/brick1/nagvol 49152 0 Y 22795 Brick 10.70.35.163:/rhs/brick1/nagvol 49152 0 Y 22805 Brick 10.70.37.101:/rhs/brick1/nagvol 49152 0 Y 22847 Brick 10.70.37.69:/rhs/brick1/nagvol 49152 0 Y 22847 Brick 10.70.37.60:/rhs/brick1/nagvol 49152 0 Y 22895 Brick 10.70.37.120:/rhs/brick1/nagvol 49152 0 Y 22916 Brick 10.70.37.202:/rhs/brick2/nagvol 49153 0 Y 16969 Brick 10.70.37.195:/rhs/brick2/nagvol 49153 0 Y 16721 Brick 10.70.35.155:/rhs/brick2/nagvol 49153 0 Y 13597 Brick 10.70.35.222:/rhs/brick2/nagvol 49153 0 Y 13565 Brick 10.70.35.108:/rhs/brick2/nagvol 49153 0 Y 4694 Brick 10.70.35.44:/rhs/brick2/nagvol 49153 0 Y 12307 Brick 10.70.35.89:/rhs/brick2/nagvol 49153 0 Y 2683 Brick 10.70.35.231:/rhs/brick2/nagvol 49153 0 Y 22829 NFS Server on localhost 2049 0 Y 6099 Self-heal Daemon on localhost N/A N/A Y 6107 Quota Daemon on localhost N/A N/A Y 6115 NFS Server on 10.70.37.69 2049 0 Y 26129 Self-heal Daemon on 10.70.37.69 N/A N/A Y 26137 Quota Daemon on 10.70.37.69 N/A N/A Y 26145 NFS Server on 10.70.37.195 2049 0 Y 19012 Self-heal Daemon on 10.70.37.195 N/A N/A Y 19020 Quota Daemon on 10.70.37.195 N/A N/A Y 19028 NFS Server on 10.70.37.60 2049 0 Y 31342 Self-heal Daemon on 10.70.37.60 N/A N/A Y 31350 Quota Daemon on 10.70.37.60 N/A N/A Y 31358 NFS Server on 10.70.37.120 2049 0 Y 26317 Self-heal Daemon on 10.70.37.120 N/A N/A Y 26325 Quota Daemon on 10.70.37.120 N/A N/A Y 26333 NFS Server on dhcp37-202.lab.eng.blr.redhat .com 2049 0 Y 20158 Self-heal Daemon on dhcp37-202.lab.eng.blr. redhat.com N/A N/A Y 20166 Quota Daemon on dhcp37-202.lab.eng.blr.redh at.com N/A N/A Y 20174 NFS Server on 10.70.35.108 2049 0 Y 7435 Self-heal Daemon on 10.70.35.108 N/A N/A Y 7443 Quota Daemon on 10.70.35.108 N/A N/A Y 7451 NFS Server on 10.70.35.232 2049 0 Y 26640 Self-heal Daemon on 10.70.35.232 N/A N/A Y 26648 Quota Daemon on 10.70.35.232 N/A N/A Y 26656 NFS Server on 10.70.35.176 2049 0 Y 26492 Self-heal Daemon on 10.70.35.176 N/A N/A Y 26500 Quota Daemon on 10.70.35.176 N/A N/A Y 26508 NFS Server on 10.70.35.173 2049 0 Y 26912 Self-heal Daemon on 10.70.35.173 N/A N/A Y 26920 Quota Daemon on 10.70.35.173 N/A N/A Y 26928 NFS Server on 10.70.35.44 2049 0 Y 15159 Self-heal Daemon on 10.70.35.44 N/A N/A Y 15167 Quota Daemon on 10.70.35.44 N/A N/A Y 15175 NFS Server on 10.70.35.155 2049 0 Y 16144 Self-heal Daemon on 10.70.35.155 N/A N/A Y 16152 Quota Daemon on 10.70.35.155 N/A N/A Y 16160 NFS Server on 10.70.35.89 2049 0 Y 6458 Self-heal Daemon on 10.70.35.89 N/A N/A Y 6466 Quota Daemon on 10.70.35.89 N/A N/A Y 6474 NFS Server on 10.70.35.231 2049 0 Y 25739 Self-heal Daemon on 10.70.35.231 N/A N/A Y 25747 Quota Daemon on 10.70.35.231 N/A N/A Y 25755 NFS Server on 10.70.35.222 2049 0 Y 16478 Self-heal Daemon on 10.70.35.222 N/A N/A Y 16486 Quota Daemon on 10.70.35.222 N/A N/A Y 16494 NFS Server on 10.70.35.163 2049 0 Y 26759 Self-heal Daemon on 10.70.35.163 N/A N/A Y 26767 Quota Daemon on 10.70.35.163 N/A N/A Y 26775 Task Status of Volume nagvol ------------------------------------------------------------------------------ Task : Tier migration ID : 0870550a-70ba-4cd1-98da-b456059bd6cc Status : in progress [root@dhcp37-101 glusterfs]# less /var/log/glusterfs/nagvol-tier.log [root@dhcp37-101 glusterfs]# [root@dhcp37-101 glusterfs]# [root@dhcp37-101 glusterfs]# md5sum ddfile.21 md5sum: ddfile.21: No such file or directory [root@dhcp37-101 glusterfs]# cd /rhs/brick1/nagvol/ddcommand [root@dhcp37-101 ddcommand]# md5sum ddfile.21 md5sum: ddfile.21: No such file or directory [root@dhcp37-101 ddcommand]# ls ddfile.1 ddfile.13 ddfile.18 ddfile.22 ddfile.26 ddfile.3 ddlogme2.txt ddfile.10 ddfile.15 ddfile.2 ddfile.23 ddfile.27 ddfile.8 ddlogme.txt ddfile.102 ddfile.16 ddfile.20 ddfile.25 ddfile.29 ddfile.9 [root@dhcp37-101 ddcommand]# ls -l total 2250020 -rw-r--r--. 2 root root 192000000 Jan 27 11:01 ddfile.1 -rw-r--r--. 2 root root 192000000 Jan 27 11:24 ddfile.10 -rw-r--r--. 2 root root 192000000 Jan 25 19:38 ddfile.102 -rw-r-Sr-T. 2 root root 192000000 Jan 27 11:32 ddfile.13 -rw-r--r--. 2 root root 192000000 Jan 27 11:37 ddfile.15 ---------T. 2 root root 0 Jan 27 11:37 ddfile.16 ---------T. 2 root root 0 Jan 27 11:43 ddfile.18 ---------T. 2 root root 0 Jan 27 11:01 ddfile.2 -rw-r--r--. 2 root root 192000000 Jan 27 11:52 ddfile.20 -rw-r--r--. 2 root root 192000000 Jan 27 11:57 ddfile.22 ---------T. 2 root root 0 Jan 27 11:57 ddfile.23 -rw-r--r--. 2 root root 192000000 Jan 27 12:05 ddfile.25 -rw-r--r--. 2 root root 192000000 Jan 27 12:08 ddfile.26 ---------T. 2 root root 0 Jan 27 12:08 ddfile.27 -rw-r--r--. 2 root root 192000000 Jan 27 12:15 ddfile.29 ---------T. 2 root root 0 Jan 27 11:03 ddfile.3 -rw-r--r--. 2 root root 192000000 Jan 27 11:19 ddfile.8 -rw-r--r--. 2 root root 192000000 Jan 27 11:22 ddfile.9 ---------T. 2 root root 0 Jan 27 10:58 ddlogme2.txt -rw-r--r--. 2 root root 512 Jan 25 20:51 ddlogme.txt [root@dhcp37-101 ddcommand]# ls -l ^C [root@dhcp37-101 ddcommand]# ls -l total 2250020 -rw-r--r--. 2 root root 192000000 Jan 27 11:01 ddfile.1 -rw-r--r--. 2 root root 192000000 Jan 27 11:24 ddfile.10 -rw-r--r--. 2 root root 192000000 Jan 25 19:38 ddfile.102 -rw-r-Sr-T. 2 root root 192000000 Jan 27 11:32 ddfile.13 -rw-r--r--. 2 root root 192000000 Jan 27 11:37 ddfile.15 ---------T. 2 root root 0 Jan 27 11:37 ddfile.16 ---------T. 2 root root 0 Jan 27 11:43 ddfile.18 ---------T. 2 root root 0 Jan 27 11:01 ddfile.2 -rw-r--r--. 2 root root 192000000 Jan 27 11:52 ddfile.20 -rw-r--r--. 2 root root 192000000 Jan 27 11:57 ddfile.22 ---------T. 2 root root 0 Jan 27 11:57 ddfile.23 -rw-r--r--. 2 root root 192000000 Jan 27 12:05 ddfile.25 -rw-r--r--. 2 root root 192000000 Jan 27 12:08 ddfile.26 ---------T. 2 root root 0 Jan 27 12:08 ddfile.27 -rw-r--r--. 2 root root 192000000 Jan 27 12:15 ddfile.29 ---------T. 2 root root 0 Jan 27 11:03 ddfile.3 -rw-r--r--. 2 root root 192000000 Jan 27 11:19 ddfile.8 -rw-r--r--. 2 root root 192000000 Jan 27 11:22 ddfile.9 ---------T. 2 root root 0 Jan 27 10:58 ddlogme2.txt -rw-r--r--. 2 root root 512 Jan 25 20:51 ddlogme.txt [root@dhcp37-101 ddcommand]# ls -lrth /rhs/brick*/nagvol*/ddcommand/ddfile.21 -rw-r-Sr-T. 2 root root 1.5G Jan 27 11:54 /rhs/brick7/nagvol_hot/ddcommand/ddfile.21 [root@dhcp37-101 ddcommand]# ls -l total 2250020 -rw-r--r--. 2 root root 192000000 Jan 27 11:01 ddfile.1 -rw-r--r--. 2 root root 192000000 Jan 27 11:24 ddfile.10 -rw-r--r--. 2 root root 192000000 Jan 25 19:38 ddfile.102 -rw-r-Sr-T. 2 root root 192000000 Jan 27 11:32 ddfile.13 -rw-r--r--. 2 root root 192000000 Jan 27 11:37 ddfile.15 ---------T. 2 root root 0 Jan 27 11:37 ddfile.16 ---------T. 2 root root 0 Jan 27 11:43 ddfile.18 ---------T. 2 root root 0 Jan 27 11:01 ddfile.2 -rw-r--r--. 2 root root 192000000 Jan 27 11:52 ddfile.20 -rw-r--r--. 2 root root 192000000 Jan 27 11:57 ddfile.22 ---------T. 2 root root 0 Jan 27 11:57 ddfile.23 -rw-r--r--. 2 root root 192000000 Jan 27 12:05 ddfile.25 -rw-r--r--. 2 root root 192000000 Jan 27 12:08 ddfile.26 ---------T. 2 root root 0 Jan 27 12:08 ddfile.27 -rw-r--r--. 2 root root 192000000 Jan 27 12:15 ddfile.29 ---------T. 2 root root 0 Jan 27 11:03 ddfile.3 -rw-r--r--. 2 root root 192000000 Jan 27 11:19 ddfile.8 -rw-r--r--. 2 root root 192000000 Jan 27 11:22 ddfile.9 ---------T. 2 root root 0 Jan 27 10:58 ddlogme2.txt -rw-r--r--. 2 root root 512 Jan 25 20:51 ddlogme.txt [root@dhcp37-101 ddcommand]# ls -lrth /rhs/brick*/nagvol*/ddcommand/ddfile.21 -rw-r-Sr-T. 2 root root 1.5G Jan 27 11:54 /rhs/brick7/nagvol_hot/ddcommand/ddfile.21 [root@dhcp37-101 ddcommand]# pwd /rhs/brick1/nagvol/ddcommand [root@dhcp37-101 ddcommand]# ls -lrth /rhs/brick*/nagvol*/ddcommand/ddfile.21 -rw-r-Sr-T. 2 root root 1.5G Jan 27 11:54 /rhs/brick7/nagvol_hot/ddcommand/ddfile.21 [root@dhcp37-101 ddcommand]# [root@dhcp37-101 ddcommand]# [root@dhcp37-101 ddcommand]# cd /rhs/brick7/nagvol_hot/ddcommand/ [root@dhcp37-101 ddcommand]# md5sum ddfile.21 0d8ea8feb6d830fed9ee0703d73aea1f ddfile.21 [root@dhcp37-101 ddcommand]# #sosreport [root@dhcp37-101 ddcommand]# sosreport sosreport (version 3.2) This command will collect diagnostic and configuration information from this Red Hat Enterprise Linux system and installed applications. An archive containing the collected information will be generated in /var/tmp and may be provided to a Red Hat support representative. Any information provided to Red Hat will be treated in accordance with the published support policies at: https://access.redhat.com/support/ The generated archive may contain data considered sensitive and its content should be reviewed by the originating organization before being passed to any third party. No changes will be made to system configuration. Press ENTER to continue, or CTRL-C to quit. Please enter your first initial and last name [dhcp37-101.lab.eng.blr.redhat.com]: Please enter the case id that you are generating this report for []: brickdown_file_demote_stuck Setting up archive ... Setting up plugins ... Running plugins. Please wait ... Running 89/89: yum... Creating compressed archive... Your sosreport has been generated and saved in: /var/tmp/sosreport-dhcp37-101.lab.eng.blr.redhat.com.brickdownfiledemotestuck-20160129190259.tar.xz The checksum is: 98401ea77116695635c5ff51cad3b808 Please send this file to your support representative. [root@dhcp37-101 ddcommand]# exit logout Connection to 10.70.37.101 closed. bash-4.3$
(In reply to nchilaka from comment #0) > Description of problem: > ===================== > I brought down one brick of each replica pair in hot tier, and found that a > file which was already promoting/demotion didn't proceed further. It was > stuck in that state for more than half-an-hour. Most likely the disconnect was not identified till a call-bail happened (note that 30 min is the timeout for call-bail). Looks like more of an issue with rpc/transport. If logs are still available, we can grep through log-files whether call_bail happened for any fops.
Thank you for your bug report. We are no longer working on any improvements for Tier. This bug will be set to CLOSED WONTFIX to reflect this. Please reopen if the rfe is deemed critical.