Bug 1427419

Summary: Warning messages throwing when EC volume offline brick comes up are difficult to understand for end user.
Product: [Community] GlusterFS Reporter: Sunil Kumar Acharya <sheggodu>
Component: disperseAssignee: Sunil Kumar Acharya <sheggodu>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.8CC: aflyhorse, amukherj, aspandey, bsrirama, bugs, nchilaka, rhs-bugs, sheggodu, storage-qa-internal
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8.10 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1409202 Environment:
Last Closed: 2017-03-18 10:52:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1408361, 1409202, 1435592    
Bug Blocks: 1414347, 1427089    

Comment 1 Worker Ant 2017-02-28 07:40:59 UTC
REVIEW: https://review.gluster.org/16781 (cluster/ec: Fixing log message) posted (#1) for review on release-3.8 by Sunil Kumar Acharya (sheggodu)

Comment 2 Ashish Pandey 2017-02-28 07:55:23 UTC
Description of problem:
=======================
When any of the EC volume bricks goes down and comes up when IO was happening, getting the below warning messages in self heal daemon log (shd log), end user can't understand problem is with which sub volumes, we are printing the hex decimal values for subvolumes, enduser has to do lot of maths to know the sub volumes.

We have to improve these warning messages for end user to understand.



[2016-12-23 04:52:00.658995] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-Disperse1-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=3E, bad=1)
[2016-12-23 04:52:00.659085] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-Disperse1-disperse-0: Heal failed [Invalid argument]
[2016-12-23 04:52:00.812666] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-Disperse1-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=3E, bad=1)
[2016-12-23 04:52:00.812709] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-Disperse1-disperse-0: Heal failed [Invalid argument]
[2016-12-23 04:52:01.053575] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-Disperse1-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=3E, bad=1)
[2016-12-23 04:52:01.053651] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-Disperse1-disperse-0: Heal failed [Invalid argument]
[2016-12-23 04:52:01.059907] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-Disperse1-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=3E, bad=1)
[2016-12-23 04:52:01.059983] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-Disperse1-disperse-0: Heal failed [Invalid argument]
[2016-12-23 04:52:01.085491] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-Disperse1-disperse-0: Operation failed on some subvolumes


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.8.4-9.el6rhs.x86_64.

 
How reproducible:
=================
Always


Steps to Reproduce
===================
1. Have basic recommended EC volume setup.
2. Fuse mount the volume.
3. Make one brick down and start IO in the mount point.
4. after some time of IO happens, brick up the offline brick using volume start force.
5. Check the self heal daemon logs for above mentioned warning messages.

Actual results:
===============
Warning messages throwing when EC volume offline brick comes up are difficult to understand for end user.

Expected results:
=================
Improve the warning messages throwing when EC volume offline brick comes up to make end user to understand.

Comment 3 Worker Ant 2017-03-10 08:33:11 UTC
COMMIT: https://review.gluster.org/16781 committed in release-3.8 by jiffin tony Thottan (jthottan) 
------
commit a76304cd434028215de39cf3b45672cc7ec6ca70
Author: Sunil Kumar H G <sheggodu>
Date:   Fri Dec 30 14:11:15 2016 +0530

    cluster/ec: Fixing log message
    
    Updating the warning message with details to improve
    user understanding.
    
    >BUG: 1409202
    >Change-Id: I001f8d5c01c97fff1e4e1a3a84b62e17c025c520
    >Signed-off-by: Sunil Kumar H G <sheggodu>
    >Reviewed-on: http://review.gluster.org/16315
    >Tested-by: Sunil Kumar Acharya
    >Smoke: Gluster Build System <jenkins.org>
    >NetBSD-regression: NetBSD Build System <jenkins.org>
    >CentOS-regression: Gluster Build System <jenkins.org>
    >Reviewed-by: Xavier Hernandez <xhernandez>
    
    BUG: 1427419
    Change-Id: I34a869d7cd7630881c897e0e4ecac367cd2820f9
    Signed-off-by: Sunil Kumar Acharya <sheggodu>
    Reviewed-on: https://review.gluster.org/16781
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Ashish Pandey <aspandey>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: jiffin tony Thottan <jthottan>

Comment 4 Niels de Vos 2017-03-18 10:52:28 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.10, please open a new bug report.

glusterfs-3.8.10 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-March/000068.html
[2] https://www.gluster.org/pipermail/gluster-users/