Description of problem: ====================== EC volume: Self heal is not working properly on EC volume Version-Release number of selected component (if applicable): ================ glusterfs-3.7.1-13 How reproducible: Steps to Reproduce: ============== 1.Create disperse volume with 6 bricks and mount it on client using nfs 2.On the client un-tar the Linux kernel and while un tar is going on bring down two bricks 3. After few mins bring up the down bricks using gluster vol start force command 4. After few mins bring down another two bricks 5. After completion of un-tar, bring up the bricks using gluster vol start force command 6. while deleting the un-tar files getting input/output error, when i checked the self heal status it is failed Actual results: =================== Self heal is not healing all files on the ECVOl Expected results: =================== Self heal should heal all files on the ECVOL once all bricks are up Additional info: ===================== [root@rhs-client9 var]# gluster vol status ECVOL Status of volume: ECVOL Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick rhs-client4.lab.eng.blr.redhat.com:/r hs/brick1/ECVOL 49164 0 Y 7531 Brick rhs-client4.lab.eng.blr.redhat.com:/r hs/brick2/ECVOL 49165 0 Y 7549 Brick rhs-client9.lab.eng.blr.redhat.com:/r hs/brick1/ECVOL 49165 0 Y 393 Brick rhs-client9.lab.eng.blr.redhat.com:/r hs/brick2/ECVOL 49166 0 Y 411 Brick rhs-client39.lab.eng.blr.redhat.com:/ rhs/brick1/ECVOL 49155 0 Y 29237 Brick rhs-client39.lab.eng.blr.redhat.com:/ rhs/brick2/ECVOL 49156 0 Y 29255 NFS Server on localhost 2049 0 Y 14440 Self-heal Daemon on localhost N/A N/A Y 14448 NFS Server on rhs-client39.lab.eng.blr.redh at.com 2049 0 Y 29274 Self-heal Daemon on rhs-client39.lab.eng.bl r.redhat.com N/A N/A Y 29283 NFS Server on rhs-client4.lab.eng.blr.redha t.com 2049 0 Y 15067 Self-heal Daemon on rhs-client4.lab.eng.blr .redhat.com N/A N/A Y 15075 Task Status of Volume ECVOL ------------------------------------------------------------------------------ There are no active volume tasks
Logs are available @/home/repo/sosreports/bug.1258914
I was able to recreate this bug. However, this is actually an expected behavior and not a bug. If an IO is going on on a file and 2 bricks have been killed, It will work properly. When the bricks are up, it will require some time for heal daemon to pick up this file and complete the heal on this file. At the time of healing, if the IO is going on on this file and if next 2 bricks (which were acting like a source for heal) gets killed, file version *could* come in inconsistent state as heal process was not finished with the first 2 bricks. This could lead the IO error while deleting the file as ec will not be able to find minimum number of healthy bricks for this specific file. In steps to reproduce - step 3 and step 4 are important. We killed the bricks in step 4 after FEW minutes of bringing the bricks in step 3. These FEW minutes might not be enough to complete the heal process and to bring the volume in resilient state. Following are the xattr for one of the files giving IO error - ========================================================= root@aspandey:/mnt/gfs]# [root@aspandey:/mnt/gfs]# rm -rf linux-2.6.39/ rm: cannot remove ‘linux-2.6.39/arch’: Input/output error [root@aspandey:/mnt/gfs]# rm -rf linux-2.6.39/ rm: cannot remove ‘linux-2.6.39/’: Directory not empty ========================================================= [root@rhs-client4 ~]# getfattr -d -m. -e hex /brick/gluster/a{1..2}/linux-2.6.39/arch getfattr: /brick/gluster/a1/linux-2.6.39/arch: No such file or directory getfattr: /brick/gluster/a2/linux-2.6.39/arch: No such file or director ========================================================= [root@rhs-client39 ~]# getfattr -d -m. -e hex /brick/gluster/a{3..4}/linux-2.6.39/arch getfattr: Removing leading '/' from absolute path names # file: brick/gluster/a3/linux-2.6.39/arch security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000 trusted.ec.dirty=0x00000000000000020000000000000002 trusted.ec.version=0x00000000000000340000000000000039 trusted.gfid=0x39b6963aa23e4aaa8ab1be6dc5ef83a4 trusted.glusterfs.dht=0x000000010000000000000000ffffffff # file: brick/gluster/a4/linux-2.6.39/arch security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000 trusted.ec.dirty=0x00000000000000020000000000000002 trusted.ec.version=0x00000000000000340000000000000039 trusted.gfid=0x39b6963aa23e4aaa8ab1be6dc5ef83a4 trusted.glusterfs.dht=0x000000010000000000000000ffffffff ========================================================= [root@rhs-client40 ~]# getfattr -d -m. -e hex /brick/gluster/a{5..6}/linux-2.6.39/arch getfattr: Removing leading '/' from absolute path names # file: brick/gluster/a5/linux-2.6.39/arch security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000 trusted.ec.dirty=0x00000000000000000000000000000000 trusted.ec.version=0x00000000000000320000000000000037 trusted.gfid=0x39b6963aa23e4aaa8ab1be6dc5ef83a4 trusted.glusterfs.dht=0x000000010000000000000000ffffffff # file: brick/gluster/a6/linux-2.6.39/arch security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000 trusted.ec.dirty=0x00000000000000000000000000000000 trusted.ec.version=0x00000000000000320000000000000037 trusted.gfid=0x39b6963aa23e4aaa8ab1be6dc5ef83a4 trusted.glusterfs.dht=0x000000010000000000000000ffffffff ========================================================= [root@rhs-client4 ~]# gluster v info disp Volume Name: disp Type: Disperse Volume ID: 502f664f-437f-4e34-ba48-1cd9d0f4f357 Status: Started Number of Bricks: 1 x (4 + 2) = 6 Transport-type: tcp Bricks: Brick1: rhs-client4.lab.eng.blr.redhat.com:/brick/gluster/a1 Brick2: rhs-client4.lab.eng.blr.redhat.com:/brick/gluster/a2 Brick3: rhs-client39.lab.eng.blr.redhat.com:/brick/gluster/a3 Brick4: rhs-client39.lab.eng.blr.redhat.com:/brick/gluster/a4 Brick5: rhs-client40.lab.eng.blr.redhat.com:/brick/gluster/a5 Brick6: rhs-client40.lab.eng.blr.redhat.com:/brick/gluster/a6 Options Reconfigured: performance.readdir-ahead: on [root@rhs-client4 ~]# =========================================================