Bug 1807384 - [AFR] Heal not happening after disk cleanup and self-heal of files/dirs is done to simulate disk replacement
Summary: [AFR] Heal not happening after disk cleanup and self-heal of files/dirs is do...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: mainline
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-26 09:06 UTC by Kshithij Iyer
Modified: 2020-03-12 12:21 UTC (History)
2 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2020-03-12 12:21:55 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Kshithij Iyer 2020-02-26 09:06:38 UTC
Description of problem:
While trying to run patch [1] which does the steps mentioned in the following sections, it was observed that the arequal-checksums were different as shown below:
################################################################################
Checksum of the brick on which the data is removed
################################################################################
arequal-checksum -p /mnt/vol0/testvol_replicated_brick2 -i .glusterfs -i .landfill -i .trashcan

Entry counts
Regular files   : 14
Directories     : 3
Symbolic links  : 0
Other           : 0
Total           : 17

Metadata checksums
Regular files   : 3e9
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : c4a0e0fd92dba41dc446cc3b33287983
Directories     : 300002e01
Symbolic links  : 0
Other           : 0
Total           : e62cc5a1f3f39f
################################################################################
Checksum of the brick where data wan't removed
################################################################################
arequal-checksum -p /mnt/vol0/testvol_replicated_brick1 -i .glusterfs -i .landfill -i .trashcan

Entry counts
Regular files   : 16500
Directories     : 11
Symbolic links  : 0
Other           : 0
Total           : 16511

Metadata checksums
Regular files   : 3e9
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 6b72772e37d757ad53453c4aafed344c
Directories     : 301002f01
Symbolic links  : 0
Other           : 0
Total           : 38374b67993a4ce0
################################################################################


This mean that heal wasn't completing on the node where data was removed. However when the heal was checked before checking the checksum it was showing no entries to be healed on the bricks:
################################################################################
2020-02-25 12:15:34,416 INFO (run) root.2.161 (cp): gluster volume heal testvol_replicated info --xml
2020-02-25 12:15:34,416 DEBUG (_get_ssh_connection) Retrieved connection from cache: root.2.161
2020-02-25 12:15:34,618 INFO (_log_results) RETCODE (root.2.161): 0
2020-02-25 12:15:34,619 DEBUG (_log_results) STDOUT (root.2.161)...
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cliOutput>
  <healInfo>
    <bricks>
      <brick hostUuid="112835ce-16ed-43e1-a758-c104c78ff782">
        <name>172.19.2.161:/mnt/vol0/testvol_replicated_brick0</name>
        <status>Connected</status>
        <numberOfEntries>0</numberOfEntries>
      </brick>
      <brick hostUuid="3fdae765-7a1f-4ae5-99c1-ea7b24768554">
        <name>172.19.2.153:/mnt/vol0/testvol_replicated_brick1</name>
        <status>Connected</status>
        <numberOfEntries>0</numberOfEntries>
      </brick>
      <brick hostUuid="a3877a65-2963-423c-8e9f-95ceb07f907d">
        <name>172.19.2.164:/mnt/vol0/testvol_replicated_brick2</name>
        <status>Connected</status>
        <numberOfEntries>0</numberOfEntries>
      </brick>
    </bricks>
  </healInfo>
  <opRet>0</opRet>
  <opErrno>0</opErrno>
  <opErrstr/>
</cliOutput>
################################################################################


Version-Release number of selected component (if applicable):
glusterfs 20200220.a0e0890

How reproducible:
2/2

Steps to Reproduce:
- Create a volume of type replica or distributed-replica
- Create directory on mount point and write files/dirs
- Create another set of files (1K files)
- While creation of files/dirs are in progress Kill one brick
- Remove the contents of the killed brick(simulating disk replacement)
- When the IO's are still in progress, restart glusterd on the nodes
  where we simulated disk replacement to bring back bricks online
- Start volume heal
- Wait for IO's to complete
- Verify whether the files are self-healed
- Calculate arequals of the mount point and all the bricks

Actual results:
Arequal are different for replica volumes and aren't consistent in distributed replicated volumes. 

Expected results:
Arequals should be same in case of replicate and should be consistent in case of distributed-replicated volumes.

Additional info:
This issue wasn't observed in gluster 6.0 builds.

Reference links:
[1] https://review.gluster.org/#/c/glusto-tests/+/20378/
[2] https://ci.centos.org/job/gluster_glusto-patch-check/2053/artifact/glustomain.log

Comment 1 Worker Ant 2020-03-12 12:21:55 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/881, and will be tracked there from now on. Visit GitHub issues URL for further details


Note You need to log in before you can comment on or make changes to this bug.