1408361 – Warning messages throwing when EC volume offline brick comes up are difficult to understand for end user.

Bug 1408361 - Warning messages throwing when EC volume offline brick comes up are difficult to understand for end user.

Summary: Warning messages throwing when EC volume offline brick comes up are difficult...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	disperse
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Sunil Kumar Acharya
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1435592 (view as bug list)
Depends On:
Blocks:	1409202 1414347 1417147 1417151 1427089 1427419 1435592
TreeView+	depends on / blocked

Reported:	2016-12-23 05:15 UTC by Byreddy
Modified:	2017-09-21 04:56 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.8.4-19
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1409202 (view as bug list)
Environment:
Last Closed:	2017-09-21 04:30:55 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Byreddy 2016-12-23 05:15:11 UTC

Description of problem:
=======================
When any of the EC volume bricks goes down and comes up when IO was happening, getting the below warning messages in self heal daemon log (shd log), end user can't understand problem is with which sub volumes, we are printing the hex decimal values for subvolumes, enduser has to do lot of maths to know the sub volumes.

We have to improve these warning messages for end user to understand.



[2016-12-23 04:52:00.658995] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-Disperse1-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=3E, bad=1)
[2016-12-23 04:52:00.659085] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-Disperse1-disperse-0: Heal failed [Invalid argument]
[2016-12-23 04:52:00.812666] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-Disperse1-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=3E, bad=1)
[2016-12-23 04:52:00.812709] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-Disperse1-disperse-0: Heal failed [Invalid argument]
[2016-12-23 04:52:01.053575] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-Disperse1-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=3E, bad=1)
[2016-12-23 04:52:01.053651] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-Disperse1-disperse-0: Heal failed [Invalid argument]
[2016-12-23 04:52:01.059907] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-Disperse1-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=3E, bad=1)
[2016-12-23 04:52:01.059983] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-Disperse1-disperse-0: Heal failed [Invalid argument]
[2016-12-23 04:52:01.085491] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-Disperse1-disperse-0: Operation failed on some subvolumes


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.8.4-9.el6rhs.x86_64.

 
How reproducible:
=================
Always


Steps to Reproduce
===================
1. Have basic recommended EC volume setup.
2. Fuse mount the volume.
3. Make one brick down and start IO in the mount point.
4. after some time of IO happens, brick up the offline brick using volume start force.
5. Check the self heal daemon logs for above mentioned warning messages.

Actual results:
===============
Warning messages throwing when EC volume offline brick comes up are difficult to understand for end user.

Expected results:
=================
Improve the warning messages throwing when EC volume offline brick comes up to make end user to understand.


Additional info:

Comment 2 Nag Pavan Chilakam 2016-12-26 10:16:29 UTC

not a blocker for 3.2 can be deferred

Comment 4 Sunil Kumar Acharya 2017-02-28 06:44:45 UTC

Fix for the issue is ported to downstream as part of : 
https://code.engineering.redhat.com/gerrit/#/c/95621/

Comment 9 Sunil Kumar Acharya 2017-04-28 12:21:21 UTC

comment 6 of BUG 1409202 explains the expected output.

Comment 10 Nag Pavan Chilakam 2017-05-02 09:53:11 UTC

I am not seeing any such logs (not even the logs reported as part of the bz)

Can you confirm what am I missing, else I may have to fail bz

[2017-05-02 09:50:11.000192] W [MSGID: 114031] [client-rpc-fops.c:1102:client3_3_getxattr_cbk] 0-ecv-client-0: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001). Key: trusted.glusterfs.pathinfo [Transport endpoint is not connected]
[2017-05-02 09:50:11.000431] W [MSGID: 114031] [client-rpc-fops.c:1102:client3_3_getxattr_cbk] 0-ecv-client-0: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001). Key: glusterfs.xattrop_index_gfid [Transport endpoint is not connected]
[2017-05-02 09:50:11.000470] W [MSGID: 122032] [ec-heald.c:252:ec_shd_index_sweep] 0-ecv-disperse-0: unable to get index-dir on ecv-client-0
[2017-05-02 09:50:11.005913] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (1)
[2017-05-02 09:50:12.038091] W [MSGID: 122006] [ec-combine.c:206:ec_iatt_combine] 0-ecv-disperse-0: Failed to combine iatt (inode: 9530459126316708641-9530459126316708641, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 1883136-1884672, mode: 100644-100644)
[2017-05-02 09:50:12.431567] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (1)
[2017-05-02 09:50:14.830091] W [glusterfsd.c:1288:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f561366bdc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7f5614d01ef5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7f5614d01d6b] ) 0-: received signum (15), shutting down
[2017-05-02 09:51:19.007132] W [MSGID: 122006] [ec-combine.c:206:ec_iatt_combine] 0-ecv-disperse-0: Failed to combine iatt (inode: 12856018975539067138-12856018975539067138, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 872448-871936, mode: 100644-100644)
[2017-05-02 09:51:19.007270] W [MSGID: 122006] [ec-combine.c:206:ec_iatt_combine] 0-ecv-disperse-0: Failed to combine iatt (inode: 12856018975539067138-12856018975539067138, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 872448-871936, mode: 100644-100644)
[2017-05-02 09:51:19.007704] W [MSGID: 122006] [ec-combine.c:206:ec_iatt_combine] 0-ecv-disperse-0: Failed to combine iatt (inode: 12856018975539067138-12856018975539067138, link

Comment 11 Sunil Kumar Acharya 2017-05-02 10:53:51 UTC

I am not sure what was missed but I am able to notice the message on mount log by following the steps given above for recreating the issue.


*************************
/var/log/glusterfs/LAB-fuse_mounts-mount.log:[2017-05-02 10:50:17.040814] W [MSGID: 122053] [ec-common.c:161:ec_check_status] 0-ec-vol-disperse-0: Operation failed on 1 of 3 subvolumes.(up=111, mask=011, remaining=000, good=011, bad=100)
/var/log/glusterfs/LAB-fuse_mounts-mount.log:The message "W [MSGID: 122053] [ec-common.c:161:ec_check_status] 0-ec-vol-disperse-0: Operation failed on 1 of 3 subvolumes.(up=111, mask=011, remaining=000, good=011, bad=100)" repeated 27 times between [2017-05-02 10:50:17.040814] and [2017-05-02 10:50:17.263437]
*************************

Comment 12 Nag Pavan Chilakam 2017-05-02 12:01:25 UTC

on_qa validation:
Moving to verified as the fix works

Note: The logs are seen in mount logs unlike shd logs as was mentioned in the description details of this bz (https://bugzilla.redhat.com/show_bug.cgi?id=1408361#c0)


before fix(checked on 3.2 GA build, 3.8.4-18)
[2017-05-02 11:56:33.808629] W [MSGID: 122053] [ec-common.c:154:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3E, remaining=0, good=3E, bad=1)
[2017-05-02 11:56:35.324088] I [MSGID: 122058] [ec-heal.c:2380:ec_heal_do] 0-ec-disperse-0: /dir1/file.32: name heal successful on 3F
The message "I [MSGID: 122058] [ec-heal.c:2380:ec_heal_do] 0-ec-disperse-0: /dir1/file.32: name heal successful on 3F" repeated 68 times between [2017-05-02 11:56:35.324088] and [2017-05-02 11:56:36.178563]
[2017-05-02 11:56:36.180624] W [MSGID: 122053] [ec-common.c:154:ec_check_status] 0-ec-disperse-0: Operation failed on some subvolumes (up=3F, mask=3E, remaining=0, good=3E, bad=1)
[2017-05-02 11:56:36.193575] I [MSGID: 122058] [ec-heal.c:2380:ec_heal_do] 0-ec-disperse-0: /dir1/file.32: name heal successful on 3F


With fix(3.8.4-24) Observed the logs to be more meaningful as mentioned in comment#9
==================

2: name heal successful on 3FF
The message "I [MSGID: 122058] [ec-heal.c:2380:ec_heal_do] 0-ecv-disperse-0: /dir3/filo.22: name heal successful on 3FF" repeated 4 times between [2017-05-02 09:49:04.591874] and [2017-05-02 09:49:04.654420]
[2017-05-02 09:49:04.665916] W [MSGID: 122006] [ec-combine.c:206:ec_iatt_combine] 0-ecv-disperse-0: Failed to combine iatt (inode: 12307727039240654507-12307727039240654507, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 0-869376, mode: 100644-100644)
[2017-05-02 09:49:04.666009] N [MSGID: 122029] [ec-combine.c:93:ec_combine_write] 0-ecv-disperse-0: Mismatching iatt in answers of 'WRITE'
[2017-05-02 09:49:04.666481] W [MSGID: 122053] [ec-common.c:161:ec_check_status] 0-ecv-disperse-0: Operation failed on 1 of 10 subvolumes.(up=1111111111, mask=1111111111, remaining=0000000000, good=1111111110, bad=0000000001)
[2017-05-02 09:49:04.668391] I [MSGID: 122058] [ec-heal.c:2380:ec_heal_do] 0-ecv-disperse-0: /dir3/filo.22: name heal successful on 3FF
The message "I [MSGID: 122058] [ec-heal.c:2380:ec_heal_do] 0-ecv-disperse-0: /dir3/filo.22: name heal successful on 3FF" repeated 108 times between [2017-05-02 09:49:04.668391] and [2017-05-02 09:49:06.947593]
[2017-05-02 09:49:06.960188] W [MSGID: 122053] [ec-common.c:161:ec_check_status] 0-ecv-disperse-0: Operation failed on 1 of 10 subvolumes.(up=1111111111, mask=1111111110, remaining=0000000000, good=1111111110, bad=0000000001)
[2017-05-02 09:49:06.992557] I [MSGID: 122058] [ec-heal.c:2380:ec_heal_do] 0-ecv-disperse-0: /dir3/filo.22: name heal successful on 3FF
The message "I [MSGID: 122058] [ec-heal.c:2380:ec_heal_do] 0-ecv-disperse-0: /dir3/filo.22: name heal successful on 3FF" repeated 3 times between [2017-05-02 09:49:06.992557] and [2017-05-02 09:49:07.057389]
[2017-05-02 09:49:07.080095] W [MSGID: 122053] [ec-common.c:161:ec_check_status] 0-ecv-disperse-0: Operation failed on 1 of 10 subvolumes.(up=1111111111, mask=1111111110, remaining=0000000000, good=1111111110, bad=0000000001)
[2017-05-02 09:49:07.083458] I [MSGID: 122058] [ec-heal.c:2380:ec_heal_do] 0-ecv-disperse-0: /dir3/filo.22: name heal successful on 3FF
The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (1)" repeated 61 times between [2017-05-02 09:48:25.989120] and [2017-05-02 09:49:03.021721]
The message "I [MSGID: 122058] [ec-heal.c:2380:ec_heal_do] 0-ecv-disperse-0: /dir3/filo.22: name heal successful on 3FF" repeated 15 times between [2017-05-02 09:49:07.083458] and [2017-05-02 09:49:07.401948]
Waiting for data... (interrupt to abort)

Comment 13 Rahul Hinduja 2017-05-03 07:19:33 UTC

*** Bug 1435592 has been marked as a duplicate of this bug. ***

Comment 15 errata-xmlrpc 2017-09-21 04:30:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 16 errata-xmlrpc 2017-09-21 04:56:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.