Description of problem: On a tiered volume with hot tier set as replica, one of the brick doesn't heal after node failure/recovery. Node was rebooted after it turned unresponsive and a hard reboot was performed. cpu utilization of glusterfsd process (of affected node's hot tier brick) is at 200% and lots of error messages are seen in brick, tier logs. We are yet to determine what triggered this issue and how the system ended up in such state. Volume Name: reg-test-cycle1 Type: Tier Volume ID: f4b57f6b-f54b-4e46-834e-1a3ee2718a57 Status: Started Number of Bricks: 20 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 4 x 2 = 8 Brick1: 10.70.37.132:/rhs/brick13/leg2 Brick2: 10.70.37.121:/rhs/brick14/leg2 Brick3: 10.70.37.77:/rhs/brick13/leg2 Brick4: 10.70.37.140:/rhs/brick14/leg2 Brick5: 10.70.42.149:/rhs/brick6/leg2 Brick6: 10.70.43.3:/rhs/brick6/leg2 Brick7: 10.70.43.141:/rhs/brick6/leg2 --> source Brick8: 10.70.42.45:/rhs/brick6/leg2 -> sink Cold Tier: Cold Tier Type : Distributed-Disperse Number of Bricks: 2 x (4 + 2) = 12 Brick9: 10.70.37.132:/rhs/brick11/leg1 Brick10: 10.70.37.121:/rhs/brick11/leg1 Brick11: 10.70.37.77:/rhs/brick11/leg1 Brick12: 10.70.37.140:/rhs/brick11/leg1 Brick13: 10.70.42.149:/rhs/brick3/leg1 Brick14: 10.70.43.3:/rhs/brick3/leg1 Brick15: 10.70.43.141:/rhs/brick3/leg1 Brick16: 10.70.42.45:/rhs/brick3/leg1 Brick17: 10.70.37.132:/rhs/brick12/leg1 Brick18: 10.70.37.121:/rhs/brick12/leg1 Brick19: 10.70.37.77:/rhs/brick12/leg1 Brick20: 10.70.37.140:/rhs/brick12/leg1 Options Reconfigured: performance.readdir-ahead: on features.quota: on features.inode-quota: on features.quota-deem-statfs: on features.ctr-enabled: on cluster.tier-mode: cache cluster.watermark-low: 31 cluster.watermark-hi: 40 diagnostics.latency-measurement: on diagnostics.count-fop-hits: on Version-Release number of selected component (if applicable): glusterfs-3.7.5-15.el7rhgs.x86_64 How reproducible: Unable to determine if this is reproducible Steps to Reproduce: There are no exact steps to reproduce this issue Actual results: The glusterfsd process consumes 200% of cpu and heal doesn't happen on the hot tier brick of one of the nodes Expected results: No brick failure, high cpu consumption or heal failure Additional info: sosreport an the affected node seems to hang, I'll update necessary logs manually from that node.