1409202 – Warning messages throwing when EC volume offline brick comes up are difficult to understand for end user.

Bug 1409202 - Warning messages throwing when EC volume offline brick comes up are difficult to understand for end user.

Summary: Warning messages throwing when EC volume offline brick comes up are difficult...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	disperse
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Sunil Kumar Acharya
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1408361
Blocks:	1414347 1427089 1427419 1435592
TreeView+	depends on / blocked

Reported:	2016-12-30 08:43 UTC by Ashish Pandey
Modified:	2017-03-24 10:24 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.10.0
Clone Of:	1408361
Clones:	1414347 1427089 1427419 1435592 (view as bug list)
Environment:
Last Closed:	2017-03-06 17:41:22 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Sunil Kumar Acharya 2017-01-03 12:06:07 UTC

Description of problem:
=======================
When any of the EC volume bricks goes down and comes up when IO was happening, getting the below warning messages in self heal daemon log (shd log), end user can't understand problem is with which sub volumes, we are printing the hex decimal values for subvolumes, enduser has to do lot of maths to know the sub volumes.

We have to improve these warning messages for end user to understand.



[2016-12-23 04:52:00.658995] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-Disperse1-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=3E, bad=1)
[2016-12-23 04:52:00.659085] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-Disperse1-disperse-0: Heal failed [Invalid argument]
[2016-12-23 04:52:00.812666] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-Disperse1-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=3E, bad=1)
[2016-12-23 04:52:00.812709] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-Disperse1-disperse-0: Heal failed [Invalid argument]
[2016-12-23 04:52:01.053575] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-Disperse1-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=3E, bad=1)
[2016-12-23 04:52:01.053651] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-Disperse1-disperse-0: Heal failed [Invalid argument]
[2016-12-23 04:52:01.059907] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-Disperse1-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=3E, bad=1)
[2016-12-23 04:52:01.059983] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-Disperse1-disperse-0: Heal failed [Invalid argument]
[2016-12-23 04:52:01.085491] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-Disperse1-disperse-0: Operation failed on some subvolumes


Version-Release number of selected component (if applicable):
=============================================================

 
How reproducible:
=================
Always


Steps to Reproduce
===================
1. Have basic recommended EC volume setup.
2. Fuse mount the volume.
3. Make one brick down and start IO in the mount point.
4. after some time of IO happens, brick up the offline brick using volume start force.
5. Check the self heal daemon logs for above mentioned warning messages.

Actual results:
===============
Warning messages throwing when EC volume offline brick comes up are difficult to understand for end user.

Expected results:
=================
Improve the warning messages throwing when EC volume offline brick comes up to make end user to understand.

Comment 2 Worker Ant 2017-01-03 12:07:07 UTC

REVIEW: http://review.gluster.org/16315 (cluster/ec: Fixing log message) posted (#2) for review on master by Anonymous Coward

Comment 3 Worker Ant 2017-01-05 09:48:19 UTC

REVIEW: http://review.gluster.org/16315 (cluster/ec: Fixing log message) posted (#3) for review on master by Sunil Kumar Acharya

Comment 4 Worker Ant 2017-01-09 07:34:44 UTC

COMMIT: http://review.gluster.org/16315 committed in master by Xavier Hernandez (xhernandez) 
------
commit cc55be619830bc64544a1044f05367b8be9421bc
Author: Sunil Kumar H G <sheggodu>
Date:   Fri Dec 30 14:11:15 2016 +0530

    cluster/ec: Fixing log message
    
    Updating the warning message with details to improve
    user understanding.
    
    BUG: 1409202
    Change-Id: I001f8d5c01c97fff1e4e1a3a84b62e17c025c520
    Signed-off-by: Sunil Kumar H G <sheggodu>
    Reviewed-on: http://review.gluster.org/16315
    Tested-by: Sunil Kumar Acharya
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Xavier Hernandez <xhernandez>

Comment 5 Chen Chen 2017-02-16 03:04:06 UTC

I'm currently facing flooding EC errors, while "volume status" said all bricks were online. I could not fully understand what the "ec_bin()" is doing, so while waiting for this patch to be built and online, could anyone point me to a document about how this is calculated?

my log excerpt:
> [2017-02-16 03:02:40.455644] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-mainvol-disperse-1: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=39, bad=6)
> [2017-02-16 03:02:40.455684] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-mainvol-disperse-1: Heal failed [Invalid argument]

Comment 6 Sunil Kumar Acharya 2017-02-16 06:01:54 UTC

ec_bin() converts the hexadecimal values shown in message to bit flags. 

For example your log currently shows:

[2017-02-16 03:02:40.455644] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-mainvol-disperse-1: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=39, bad=6)

After applying the patch it would look something like:

[--------------------------] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-mainvol-disperse-1: Operation failed on some subvolumes (up=111111, mask=111111, remaining=000000, good=111001, bad=000110)


Which indicates that 2nd and 3rd brick(count flags from right to left) in 0-mainvol-disperse-1 are bad.

Comment 7 Chen Chen 2017-02-16 08:55:52 UTC

Thank you for your explanation.

But since my volume is a 4+2 ec, shouldn't it fix the bad blocks itself? Or does it means the underlying filesystem is corrupted so it cannot do anything other than reporting it?

Many thanks.

Comment 8 Sunil Kumar Acharya 2017-02-16 09:45:52 UTC

Yes, the data should get healed. Please perform basic sanity check of the volume.

Comment 9 Shyamsundar 2017-03-06 17:41:22 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.0, please open a new bug report.

glusterfs-3.10.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-February/030119.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.