Bug 1299752

Summary: One of the bricks on hot tier doesn't heal after node failure/recovery
Product: Red Hat Gluster Storage Reporter: krishnaram Karthick <kramdoss>
Component: replicateAssignee: Ravishankar N <ravishankar>
Status: CLOSED NOTABUG QA Contact: spandura
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: kramdoss, pkarampu, rhs-bugs, smohan, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: ---Flags: kramdoss: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-24 09:26:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description krishnaram Karthick 2016-01-19 08:33:20 UTC
Description of problem:

On a tiered volume with hot tier set as replica, one of the brick doesn't heal after node failure/recovery. Node was rebooted after it turned unresponsive and a hard reboot was performed. cpu utilization of glusterfsd process (of affected node's hot tier brick) is at 200% and lots of error messages are seen in brick, tier logs.

We are yet to determine what triggered this issue and how the system ended up in such state.

Volume Name: reg-test-cycle1
Type: Tier
Volume ID: f4b57f6b-f54b-4e46-834e-1a3ee2718a57
Status: Started
Number of Bricks: 20
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 4 x 2 = 8
Brick1: 10.70.37.132:/rhs/brick13/leg2
Brick2: 10.70.37.121:/rhs/brick14/leg2
Brick3: 10.70.37.77:/rhs/brick13/leg2
Brick4: 10.70.37.140:/rhs/brick14/leg2
Brick5: 10.70.42.149:/rhs/brick6/leg2
Brick6: 10.70.43.3:/rhs/brick6/leg2
Brick7: 10.70.43.141:/rhs/brick6/leg2 --> source
Brick8: 10.70.42.45:/rhs/brick6/leg2 -> sink
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (4 + 2) = 12
Brick9: 10.70.37.132:/rhs/brick11/leg1
Brick10: 10.70.37.121:/rhs/brick11/leg1
Brick11: 10.70.37.77:/rhs/brick11/leg1
Brick12: 10.70.37.140:/rhs/brick11/leg1
Brick13: 10.70.42.149:/rhs/brick3/leg1
Brick14: 10.70.43.3:/rhs/brick3/leg1
Brick15: 10.70.43.141:/rhs/brick3/leg1
Brick16: 10.70.42.45:/rhs/brick3/leg1
Brick17: 10.70.37.132:/rhs/brick12/leg1
Brick18: 10.70.37.121:/rhs/brick12/leg1
Brick19: 10.70.37.77:/rhs/brick12/leg1
Brick20: 10.70.37.140:/rhs/brick12/leg1
Options Reconfigured:
performance.readdir-ahead: on
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
features.ctr-enabled: on
cluster.tier-mode: cache
cluster.watermark-low: 31
cluster.watermark-hi: 40
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on


Version-Release number of selected component (if applicable):
glusterfs-3.7.5-15.el7rhgs.x86_64

How reproducible:
Unable to determine if this is reproducible 

Steps to Reproduce:
There are no exact steps to reproduce this issue

Actual results:
The glusterfsd process consumes 200% of cpu and heal doesn't happen on the hot tier brick of one of the nodes

Expected results:
No brick failure, high cpu consumption or heal failure

Additional info:
sosreport an the affected node seems to hang, I'll update necessary logs manually from that node.