Bug 1299752 - One of the bricks on hot tier doesn't heal after node failure/recovery
Summary: One of the bricks on hot tier doesn't heal after node failure/recovery
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: replicate
Version: rhgs-3.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Ravishankar N
QA Contact: spandura
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-01-19 08:33 UTC by krishnaram Karthick
Modified: 2016-09-17 14:18 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-02-24 09:26:01 UTC
Embargoed:
kramdoss: needinfo-


Attachments (Terms of Use)

Description krishnaram Karthick 2016-01-19 08:33:20 UTC
Description of problem:

On a tiered volume with hot tier set as replica, one of the brick doesn't heal after node failure/recovery. Node was rebooted after it turned unresponsive and a hard reboot was performed. cpu utilization of glusterfsd process (of affected node's hot tier brick) is at 200% and lots of error messages are seen in brick, tier logs.

We are yet to determine what triggered this issue and how the system ended up in such state.

Volume Name: reg-test-cycle1
Type: Tier
Volume ID: f4b57f6b-f54b-4e46-834e-1a3ee2718a57
Status: Started
Number of Bricks: 20
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 4 x 2 = 8
Brick1: 10.70.37.132:/rhs/brick13/leg2
Brick2: 10.70.37.121:/rhs/brick14/leg2
Brick3: 10.70.37.77:/rhs/brick13/leg2
Brick4: 10.70.37.140:/rhs/brick14/leg2
Brick5: 10.70.42.149:/rhs/brick6/leg2
Brick6: 10.70.43.3:/rhs/brick6/leg2
Brick7: 10.70.43.141:/rhs/brick6/leg2 --> source
Brick8: 10.70.42.45:/rhs/brick6/leg2 -> sink
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (4 + 2) = 12
Brick9: 10.70.37.132:/rhs/brick11/leg1
Brick10: 10.70.37.121:/rhs/brick11/leg1
Brick11: 10.70.37.77:/rhs/brick11/leg1
Brick12: 10.70.37.140:/rhs/brick11/leg1
Brick13: 10.70.42.149:/rhs/brick3/leg1
Brick14: 10.70.43.3:/rhs/brick3/leg1
Brick15: 10.70.43.141:/rhs/brick3/leg1
Brick16: 10.70.42.45:/rhs/brick3/leg1
Brick17: 10.70.37.132:/rhs/brick12/leg1
Brick18: 10.70.37.121:/rhs/brick12/leg1
Brick19: 10.70.37.77:/rhs/brick12/leg1
Brick20: 10.70.37.140:/rhs/brick12/leg1
Options Reconfigured:
performance.readdir-ahead: on
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
features.ctr-enabled: on
cluster.tier-mode: cache
cluster.watermark-low: 31
cluster.watermark-hi: 40
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on


Version-Release number of selected component (if applicable):
glusterfs-3.7.5-15.el7rhgs.x86_64

How reproducible:
Unable to determine if this is reproducible 

Steps to Reproduce:
There are no exact steps to reproduce this issue

Actual results:
The glusterfsd process consumes 200% of cpu and heal doesn't happen on the hot tier brick of one of the nodes

Expected results:
No brick failure, high cpu consumption or heal failure

Additional info:
sosreport an the affected node seems to hang, I'll update necessary logs manually from that node.


Note You need to log in before you can comment on or make changes to this bug.