Description of problem: ======================== In a 2 x 2 dis-rep volume, simulated a brick crash and the brick went offline. Tried to replace the crashed brick with new brick. replace-brick <old_brick> <new_brick> commit force failed . root@mia [Jul-13-2015-19:16:19] >gluster v replace-brick testvol rhsauto038.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick3 rhsauto038.lab.eng.blr.redhat.com:/bricks/brick1/testvol_brick3 commit force volume replace-brick: failed: Commit failed on rhsauto038.lab.eng.blr.redhat.com. Please check log file for details. Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.7.1-9.el6rhs.x86_64 How reproducible: =================== Often Steps to Reproduce: ===================== 1. Create 2 x 2 dis-rep volume. Start the volume. 2. Create fuse mount. create few files/dirs 3. Simulate brick crash on one of the brick. The brick process should go offline. 4. replace the crashed brick with new brick. Actual results: ============== replace-brick command failed to replace the brick. The volume info gets changed but the staging on the new brick node fails. Expected results: ================ replace-brick on a offline brick should be successful. Additional info: ================== root@mia [Jul-13-2015-18:49:16] >gluster v info Volume Name: testvol Type: Distributed-Replicate Volume ID: 5ddff23f-2f07-42e4-91da-c08bbb8f0e7c Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: rhsauto017.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick0 Brick2: rhsauto020.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick1 Brick3: rhsauto021.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick2 Brick4: rhsauto038.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick3 Options Reconfigured: performance.readdir-ahead: on root@mia [Jul-13-2015-18:49:21] > root@mia [Jul-13-2015-18:49:26] >gluster v quota testvol enable volume quota : success root@mia [Jul-13-2015-18:49:40] >gluster volume quota testvol limit-usage / 500GB volume quota : success root@mia [Jul-13-2015-18:49:51] >gluster v info Volume Name: testvol Type: Distributed-Replicate Volume ID: 5ddff23f-2f07-42e4-91da-c08bbb8f0e7c Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: rhsauto017.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick0 Brick2: rhsauto020.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick1 Brick3: rhsauto021.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick2 Brick4: rhsauto038.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick3 Options Reconfigured: features.quota-deem-statfs: on features.inode-quota: on features.quota: on performance.readdir-ahead: on root@mia [Jul-13-2015-18:49:55] >gluster v status Status of volume: testvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick rhsauto017.lab.eng.blr.redhat.com:/br icks/brick0/testvol_brick0 49152 0 Y 13755 Brick rhsauto020.lab.eng.blr.redhat.com:/br icks/brick0/testvol_brick1 49152 0 Y 13186 Brick rhsauto021.lab.eng.blr.redhat.com:/br icks/brick0/testvol_brick2 49152 0 Y 29286 Brick rhsauto038.lab.eng.blr.redhat.com:/br icks/brick0/testvol_brick3 49152 0 Y 2370 NFS Server on localhost 2049 0 Y 19711 Self-heal Daemon on localhost N/A N/A Y 19717 Quota Daemon on localhost N/A N/A Y 21195 NFS Server on rhsauto021.lab.eng.blr.redhat .com 2049 0 Y 29307 Self-heal Daemon on rhsauto021.lab.eng.blr. redhat.com N/A N/A Y 29314 Quota Daemon on rhsauto021.lab.eng.blr.redh at.com N/A N/A Y 30737 NFS Server on rhsauto017.lab.eng.blr.redhat .com 2049 0 Y 13777 Self-heal Daemon on rhsauto017.lab.eng.blr. redhat.com N/A N/A Y 13783 Quota Daemon on rhsauto017.lab.eng.blr.redh at.com N/A N/A Y 15221 NFS Server on rhsauto020.lab.eng.blr.redhat .com 2049 0 Y 13208 Self-heal Daemon on rhsauto020.lab.eng.blr. redhat.com N/A N/A Y 13215 Quota Daemon on rhsauto020.lab.eng.blr.redh at.com N/A N/A Y 14663 NFS Server on rhsauto038.lab.eng.blr.redhat .com 2049 0 Y 2390 Self-heal Daemon on rhsauto038.lab.eng.blr. redhat.com N/A N/A Y 2399 Quota Daemon on rhsauto038.lab.eng.blr.redh at.com N/A N/A Y 3827 Task Status of Volume testvol ------------------------------------------------------------------------------ There are no active volume tasks root@mia [Jul-13-2015-18:50:00] >gluster volume set testvol features.bitrot on volume set: success root@mia [Jul-13-2015-18:50:10] >gluster volume set testvol features.uss on volume set: success root@mia [Jul-13-2015-18:50:20] >gluster v info Volume Name: testvol Type: Distributed-Replicate Volume ID: 5ddff23f-2f07-42e4-91da-c08bbb8f0e7c Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: rhsauto017.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick0 Brick2: rhsauto020.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick1 Brick3: rhsauto021.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick2 Brick4: rhsauto038.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick3 Options Reconfigured: features.uss: on features.bitrot: on features.quota-deem-statfs: on features.inode-quota: on features.quota: on performance.readdir-ahead: on root@mia [Jul-13-2015-18:50:32] >gluster v status Status of volume: testvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick rhsauto017.lab.eng.blr.redhat.com:/br icks/brick0/testvol_brick0 49152 0 Y 13755 Brick rhsauto020.lab.eng.blr.redhat.com:/br icks/brick0/testvol_brick1 49152 0 Y 13186 Brick rhsauto021.lab.eng.blr.redhat.com:/br icks/brick0/testvol_brick2 49152 0 Y 29286 Brick rhsauto038.lab.eng.blr.redhat.com:/br icks/brick0/testvol_brick3 49152 0 Y 2370 Snapshot Daemon on localhost 49154 0 Y 21312 NFS Server on localhost 2049 0 Y 21329 Self-heal Daemon on localhost N/A N/A Y 19717 Quota Daemon on localhost N/A N/A Y 21195 Bitrot Daemon on localhost N/A N/A N N/A Scrubber Daemon on localhost N/A N/A N N/A Snapshot Daemon on rhsauto017.lab.eng.blr.r edhat.com 49153 0 Y 15299 NFS Server on rhsauto017.lab.eng.blr.redhat .com 2049 0 Y 15307 Self-heal Daemon on rhsauto017.lab.eng.blr. redhat.com N/A N/A Y 13783 Quota Daemon on rhsauto017.lab.eng.blr.redh at.com N/A N/A Y 15221 Bitrot Daemon on rhsauto017.lab.eng.blr.red hat.com N/A N/A N N/A Scrubber Daemon on rhsauto017.lab.eng.blr.r edhat.com N/A N/A N N/A Snapshot Daemon on rhsauto038.lab.eng.blr.r edhat.com 49153 0 Y 3892 NFS Server on rhsauto038.lab.eng.blr.redhat .com 2049 0 Y 3900 Self-heal Daemon on rhsauto038.lab.eng.blr. redhat.com N/A N/A Y 2399 Quota Daemon on rhsauto038.lab.eng.blr.redh at.com N/A N/A Y 3827 Bitrot Daemon on rhsauto038.lab.eng.blr.red hat.com N/A N/A N N/A Scrubber Daemon on rhsauto038.lab.eng.blr.r edhat.com N/A N/A N N/A Snapshot Daemon on rhsauto021.lab.eng.blr.r edhat.com 49153 0 Y 30811 NFS Server on rhsauto021.lab.eng.blr.redhat .com 2049 0 Y 30819 Self-heal Daemon on rhsauto021.lab.eng.blr. redhat.com N/A N/A Y 29314 Quota Daemon on rhsauto021.lab.eng.blr.redh at.com N/A N/A Y 30737 Bitrot Daemon on rhsauto021.lab.eng.blr.red hat.com N/A N/A N N/A Scrubber Daemon on rhsauto021.lab.eng.blr.r edhat.com N/A N/A N N/A Snapshot Daemon on rhsauto020.lab.eng.blr.r edhat.com 49153 0 Y 14732 NFS Server on rhsauto020.lab.eng.blr.redhat .com 2049 0 Y 14740 Self-heal Daemon on rhsauto020.lab.eng.blr. redhat.com N/A N/A Y 13215 Quota Daemon on rhsauto020.lab.eng.blr.redh at.com N/A N/A Y 14663 Bitrot Daemon on rhsauto020.lab.eng.blr.red hat.com N/A N/A N N/A Scrubber Daemon on rhsauto020.lab.eng.blr.r edhat.com N/A N/A N N/A Task Status of Volume testvol ------------------------------------------------------------------------------ There are no active volume tasks root@mia [Jul-13-2015-18:50:35] > root@mia [Jul-13-2015-18:50:37] > root@mia [Jul-13-2015-18:51:52] > root@mia [Jul-13-2015-18:53:30] >gluster v status Status of volume: testvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick rhsauto017.lab.eng.blr.redhat.com:/br icks/brick0/testvol_brick0 N/A N/A N N/A Brick rhsauto020.lab.eng.blr.redhat.com:/br icks/brick0/testvol_brick1 49152 0 Y 13186 Brick rhsauto021.lab.eng.blr.redhat.com:/br icks/brick0/testvol_brick2 49152 0 Y 29286 Brick rhsauto038.lab.eng.blr.redhat.com:/br icks/brick0/testvol_brick3 49152 0 Y 2370 Snapshot Daemon on localhost 49154 0 Y 21312 NFS Server on localhost 2049 0 Y 21329 Self-heal Daemon on localhost N/A N/A Y 19717 Quota Daemon on localhost N/A N/A Y 21195 Bitrot Daemon on localhost N/A N/A N N/A Scrubber Daemon on localhost N/A N/A N N/A Snapshot Daemon on rhsauto020.lab.eng.blr.r edhat.com 49153 0 Y 14732 NFS Server on rhsauto020.lab.eng.blr.redhat .com 2049 0 Y 14740 Self-heal Daemon on rhsauto020.lab.eng.blr. redhat.com N/A N/A Y 13215 Quota Daemon on rhsauto020.lab.eng.blr.redh at.com N/A N/A Y 14663 Bitrot Daemon on rhsauto020.lab.eng.blr.red hat.com N/A N/A N N/A Scrubber Daemon on rhsauto020.lab.eng.blr.r edhat.com N/A N/A N N/A Snapshot Daemon on rhsauto017.lab.eng.blr.r edhat.com 49153 0 Y 15299 NFS Server on rhsauto017.lab.eng.blr.redhat .com 2049 0 Y 15307 Self-heal Daemon on rhsauto017.lab.eng.blr. redhat.com N/A N/A Y 13783 Quota Daemon on rhsauto017.lab.eng.blr.redh at.com N/A N/A Y 15221 Bitrot Daemon on rhsauto017.lab.eng.blr.red hat.com N/A N/A N N/A Scrubber Daemon on rhsauto017.lab.eng.blr.r edhat.com N/A N/A N N/A Snapshot Daemon on rhsauto021.lab.eng.blr.r edhat.com 49153 0 Y 30811 NFS Server on rhsauto021.lab.eng.blr.redhat .com 2049 0 Y 30819 Self-heal Daemon on rhsauto021.lab.eng.blr. redhat.com N/A N/A Y 29314 Quota Daemon on rhsauto021.lab.eng.blr.redh at.com N/A N/A Y 30737 Bitrot Daemon on rhsauto021.lab.eng.blr.red hat.com N/A N/A N N/A Scrubber Daemon on rhsauto021.lab.eng.blr.r edhat.com N/A N/A N N/A Snapshot Daemon on rhsauto038.lab.eng.blr.r edhat.com 49153 0 Y 3892 NFS Server on rhsauto038.lab.eng.blr.redhat .com 2049 0 Y 3900 Self-heal Daemon on rhsauto038.lab.eng.blr. redhat.com N/A N/A Y 2399 Quota Daemon on rhsauto038.lab.eng.blr.redh at.com N/A N/A Y 3827 Bitrot Daemon on rhsauto038.lab.eng.blr.red hat.com N/A N/A N N/A Scrubber Daemon on rhsauto038.lab.eng.blr.r edhat.com N/A N/A N N/A Task Status of Volume testvol ------------------------------------------------------------------------------ There are no active volume tasks root@mia [Jul-13-2015-18:53:31] > root@mia [Jul-13-2015-18:53:38] > root@mia [Jul-13-2015-18:56:23] >gluster v replace-brick testvol rhsauto017.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick0 rhsauto017.lab.eng.blr.redhat.com:/bricks/brick2/testvol_brick0 commit force volume replace-brick: failed: Commit failed on rhsauto017.lab.eng.blr.redhat.com. Please check log file for details. root@mia [Jul-13-2015-18:56:44] > root@mia [Jul-13-2015-18:57:25] > root@mia [Jul-13-2015-18:57:25] >gluster v info Volume Name: testvol Type: Distributed-Replicate Volume ID: 5ddff23f-2f07-42e4-91da-c08bbb8f0e7c Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: rhsauto017.lab.eng.blr.redhat.com:/bricks/brick2/testvol_brick0 Brick2: rhsauto020.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick1 Brick3: rhsauto021.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick2 Brick4: rhsauto038.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick3 Options Reconfigured: features.uss: on features.bitrot: on features.quota-deem-statfs: on features.inode-quota: on features.quota: on performance.readdir-ahead: on root@mia [Jul-13-2015-18:57:28] >gluster v status Status of volume: testvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick rhsauto017.lab.eng.blr.redhat.com:/br icks/brick0/testvol_brick0 N/A N/A N N/A Brick rhsauto020.lab.eng.blr.redhat.com:/br icks/brick0/testvol_brick1 49152 0 Y 13186 Brick rhsauto021.lab.eng.blr.redhat.com:/br icks/brick0/testvol_brick2 49152 0 Y 29286 Brick rhsauto038.lab.eng.blr.redhat.com:/br icks/brick0/testvol_brick3 49152 0 Y 2370 Snapshot Daemon on localhost 49154 0 Y 21312 NFS Server on localhost 2049 0 Y 21558 Self-heal Daemon on localhost N/A N/A Y 21565 Quota Daemon on localhost N/A N/A Y 21571 Bitrot Daemon on localhost N/A N/A N N/A Scrubber Daemon on localhost N/A N/A N N/A Snapshot Daemon on rhsauto017.lab.eng.blr.r edhat.com 49153 0 Y 15299 NFS Server on rhsauto017.lab.eng.blr.redhat .com 2049 0 Y 15307 Self-heal Daemon on rhsauto017.lab.eng.blr. redhat.com N/A N/A Y 13783 Quota Daemon on rhsauto017.lab.eng.blr.redh at.com N/A N/A Y 15221 Bitrot Daemon on rhsauto017.lab.eng.blr.red hat.com N/A N/A N N/A Scrubber Daemon on rhsauto017.lab.eng.blr.r edhat.com N/A N/A N N/A Snapshot Daemon on rhsauto020.lab.eng.blr.r edhat.com 49153 0 Y 14732 NFS Server on rhsauto020.lab.eng.blr.redhat .com 2049 0 Y 14876 Self-heal Daemon on rhsauto020.lab.eng.blr. redhat.com N/A N/A Y 14881 Quota Daemon on rhsauto020.lab.eng.blr.redh at.com N/A N/A Y 14893 Bitrot Daemon on rhsauto020.lab.eng.blr.red hat.com N/A N/A Y 14899 Scrubber Daemon on rhsauto020.lab.eng.blr.r edhat.com N/A N/A Y 14905 Snapshot Daemon on rhsauto038.lab.eng.blr.r edhat.com 49153 0 Y 3892 NFS Server on rhsauto038.lab.eng.blr.redhat .com 2049 0 Y 4040 Self-heal Daemon on rhsauto038.lab.eng.blr. redhat.com N/A N/A Y 4051 Quota Daemon on rhsauto038.lab.eng.blr.redh at.com N/A N/A Y 4050 Bitrot Daemon on rhsauto038.lab.eng.blr.red hat.com N/A N/A Y 4062 Scrubber Daemon on rhsauto038.lab.eng.blr.r edhat.com N/A N/A Y 4068 Snapshot Daemon on rhsauto021.lab.eng.blr.r edhat.com 49153 0 Y 30811 NFS Server on rhsauto021.lab.eng.blr.redhat .com 2049 0 Y 30951 Self-heal Daemon on rhsauto021.lab.eng.blr. redhat.com N/A N/A Y 30959 Quota Daemon on rhsauto021.lab.eng.blr.redh at.com N/A N/A Y 30966 Bitrot Daemon on rhsauto021.lab.eng.blr.red hat.com N/A N/A Y 30975 Scrubber Daemon on rhsauto021.lab.eng.blr.r edhat.com N/A N/A Y 30982 Task Status of Volume testvol ------------------------------------------------------------------------------ There are no active volume tasks root@mia [Jul-13-2015-18:59:31] > root@mia [Jul-13-2015-19:01:45] > root@mia [Jul-13-2015-19:01:45] >gluster v info Volume Name: testvol Type: Distributed-Replicate Volume ID: 5ddff23f-2f07-42e4-91da-c08bbb8f0e7c Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: rhsauto017.lab.eng.blr.redhat.com:/bricks/brick2/testvol_brick0 Brick2: rhsauto020.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick1 Brick3: rhsauto021.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick2 Brick4: rhsauto038.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick3 Options Reconfigured: features.uss: on features.bitrot: on features.quota-deem-statfs: on features.inode-quota: on features.quota: on performance.readdir-ahead: on
I notice that the BitD and Scrubber are not UP in all the nodes in the cluster. And also could you upload - sosreports or glusterd log files on all the nodes ?
The bug was caused due to patch http://review.gluster.org/10101; commit # f9ebf5ab3cbec423f75e64c25385125d4b65e31b. In downstream rhgs-3.1 branch I did a revert of this patch to see if the failure occurs. I could replace the brick successfully even when it is down. I could see failure only after I applied the mentioned patch to the branch. I'm RCA'ing it. Meanwhile, adding need-info on the author to confirm if the suspicion is correct.
SOS Report : http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/1242543/
In reply of comment 2 >>I notice that the BitD and Scrubber are not UP in all the nodes in the cluster. >>And also could you upload - sosreports or glusterd log files on all the nodes ? she have enabled bitrot using root@mia [Jul-13-2015-18:50:00] >gluster volume set testvol features.bitrot on volume set: success command. gluster does not support this command. Valid command will be "gluster volume bitrot <VOLNAME> enable/disable" by using "gluster volume set testvol features.bitrot on" command bitrot, scrubber daemon might crash. so user should following command for bitrot. # gluster v help | grep bitrot volume bitrot <VOLNAME> {enable|disable} | volume bitrot <volname> scrub-throttle {lazy|normal|aggressive} | volume bitrot <volname> scrub-frequency {hourly|daily|weekly|biweekly|monthly} | volume bitrot <volname> scrub {pause|resume} - Bitrot translator specific operation. For more information about bitrot command type 'man gluster' will to RCA of replace brick failed issue.
RCA'ed and patch posted upstream for review : http://review.gluster.org/#/c/11651/
without this patch http://review.gluster.org/#/c/11651/ replace-brick commit force of dead brick is successful. It seems different issue. i am able to do replace-brick of dead brick successfully without facing any problem in both upstream and downstream branch. will further analysis this issue.
(In reply to Gaurav Kumar Garg from comment #7) > without this patch http://review.gluster.org/#/c/11651/ replace-brick > commit force of dead brick is successful. > It seems different issue. i am able to do replace-brick of dead brick > successfully without facing any problem in both upstream and downstream > branch. > > will further analysis this issue. Did you kill the brick with SIGTERM? I could easily reproduce this.
In response to comment #4, Shwetha, I'm unable to get the SOS reports. It says forbidden, do not have permission to access. I'm not sure if Sas would be facing the same issue. Could you attach glusterd logs here to the bug?
(In reply to Atin Mukherjee from comment #8) > (In reply to Gaurav Kumar Garg from comment #7) > > without this patch http://review.gluster.org/#/c/11651/ replace-brick > > commit force of dead brick is successful. > > It seems different issue. i am able to do replace-brick of dead brick > > successfully without facing any problem in both upstream and downstream > > branch. > > > > will further analysis this issue. > > Did you kill the brick with SIGTERM? I could easily reproduce this. yes i killed brick using #kill -9 pidof_brick_process
Patch on rhgs-3.1 : https://code.engineering.redhat.com/gerrit/#/c/52938/
moving back to modified, was moved to on_qa by errata tool
Verified the bug on "glusterfs-3.7.1-10.el6rhs.x86_64". Bug is fixed. Moving the bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html