Bug 1680560 - systemic: entries not getting healed at all post reboot of a node
Summary: systemic: entries not getting healed at all post reboot of a node
Keywords:
Status: CLOSED DUPLICATE of bug 1593242
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: replicate
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Karthik U S
QA Contact: Nag Pavan Chilakam
URL:
Whiteboard:
Depends On: 1593242
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-25 10:25 UTC by Nag Pavan Chilakam
Modified: 2020-04-28 04:57 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-28 04:57:00 UTC
Embargoed:


Attachments (Terms of Use)

Description Nag Pavan Chilakam 2019-02-25 10:25:00 UTC
Description of problem:
========================
On my nonfunctional setup, was running tests to validate rpc fixes.

I did a node reboot of one of the servers and saw that post reboot, a couple of entries are not getting healed at all 



IO details in "Nag Cluster IOs" worksheet
https://docs.google.com/spreadsheets/d/17Yf9ZRWnWOpbRyFQ2ZYxAAlp9I_yarzKZdjN8idBJM0/edit#gid=1472913705

[root@rhs-client19 ~]# gluster v heal rpcx3 info
Brick rhs-client19.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3
Status: Connected
Number of entries: 0

Brick rhs-client25.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3
Status: Connected
Number of entries: 0

Brick rhs-client32.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3
Status: Connected
Number of entries: 0

Brick rhs-client25.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3
Status: Connected
Number of entries: 0

Brick rhs-client32.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3
Status: Connected
Number of entries: 0

Brick rhs-client38.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3
Status: Connected
Number of entries: 0

Brick rhs-client32.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3
/IOs/samedir/level1.1/level2.1/level3.11/level4.35 
/IOs/samedir/level1.1/level2.1/level3.11/level4.35/level5.53 
Status: Connected
Number of entries: 2

Brick rhs-client38.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3
/IOs/samedir/level1.1/level2.1/level3.11/level4.35 
/IOs/samedir/level1.1/level2.1/level3.11/level4.35/level5.53 
Status: Connected
Number of entries: 2

Brick rhs-client19.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3
/IOs/samedir/level1.1/level2.1/level3.11/level4.35/level5.53 
Status: Connected
Number of entries: 1

[root@rhs-client19 ~]# gluster v info
 
Volume Name: rpcx3
Type: Distributed-Replicate
Volume ID: f7532c65-63d0-4e4a-a5b5-c95238635eff
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3
Brick2: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3
Brick3: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3
Brick4: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3
Brick5: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3
Brick6: rhs-client38.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3
Brick7: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3
Brick8: rhs-client38.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3
Brick9: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3
Options Reconfigured:
diagnostics.client-log-level: INFO
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
features.uss: enable
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on


Version-Release number of selected component (if applicable):
=================
3.12.2-43

How reproducible:
=================
hit it once on systemic setup

Steps to Reproduce:
=================
1.created a single 3x3 volume on 4 node setup (brickmux disabled, and all settings are default)
2.mounted volume on 8 clients and triggered below IOs
IOs:
1) linux untar from all mounts ---> note that on 4 clients they were being done from non root user, after enabling access through ACLs 
2) was collecting resource consumption and appending to individual files on the mount point
3) continuous lookups on all clients
4) same deep directory path creation from all 8 clients parallely
--kept the IOs going for about 2 days, and then as a random health check, --
5)after a day or so, started to create zerobyte files from a new client(~2.5million)
6)after 2 days, enabled quota and uss
6) rebooted one node
7)
after 3 days also 2 entries not getting healed


Note You need to log in before you can comment on or make changes to this bug.