Bug 1421607 - Getting error messages in glusterd.log when peer detach is done
Summary: Getting error messages in glusterd.log when peer detach is done
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Gaurav Yadav
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1383979
TreeView+ depends on / blocked
 
Reported: 2017-02-13 09:05 UTC by Gaurav Yadav
Modified: 2017-05-30 18:42 UTC (History)
10 users (show)

Fixed In Version: glusterfs-3.11.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1383979
Environment:
Last Closed: 2017-05-30 18:42:26 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Gaurav Yadav 2017-02-13 09:05:02 UTC
+++ This bug was initially created as a clone of Bug #1383979 +++

Description of problem:
=======================
Getting the below error messages in the glusterd log when deprobe the cluster node is done.

[2016-10-12 07:31:07.381464] E [MSGID: 106029] [glusterd-utils.c:7767:glusterd_check_files_identical] 0-management: stat on file: /var/lib/glusterd//-server.vol failed (No such file or directory) [No such file or directory]
[2016-10-12 07:31:07.381736] E [MSGID: 106570] [glusterd-utils.c:7196:glusterd_friend_remove_cleanup_vols] 0-management: Failed to reconfigure all daemon services.



Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.8.4-2


How reproducible:
=================
Always


Steps to Reproduce:
===================
1. Create a two nodes (n1 and n2) cluster using 3.8.4-2 build
2. detach the n2 node from n1 //  n1#]gluster peer detach n2
3. Check for error messages in n1 glusterd.log

Actual results:
===============
Getting error messages in glusterd.log when peer detach is done


Expected results:
=================
Error messages should not come.


Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-10-12 05:49:09 EDT ---

This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.2.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Atin Mukherjee on 2016-10-12 05:55:38 EDT ---

Root cause is @ http://post-office.corp.redhat.com/archives/a-team/2016-October/msg00043.html , in short disabling gNFS service in rhgs-3.2.0 has caused these additional log entries.

Given there is no functional impact, I'd like to move it beyond 3.2.0. However would leave it up to NFS-Ganesha team to decide.

--- Additional comment from Niels de Vos on 2016-10-12 06:27:06 EDT ---

I assume there is a Glusto or other regression test case for this. Please point to its location or attach it to this BZ. Thanks!

--- Additional comment from Niels de Vos on 2016-10-12 06:31:16 EDT ---

Also, could you mention how urgent this is? Severity is set to "high", but priority is set to "undefined". If this problem causes the log to grow rapidly, we should fix it soon, otherwise we'll move it out to a later update.

--- Additional comment from Byreddy on 2016-10-12 06:49:38 EDT ---

(In reply to Niels de Vos from comment #3)
> I assume there is a Glusto or other regression test case for this. Please
> point to its location or attach it to this BZ. Thanks!





(In reply to Niels de Vos from comment #3)
> I assume there is a Glusto or other regression test case for this. Please
> point to its location or attach it to this BZ. Thanks!

test case bug ID which caught this issue is https://bugzilla.redhat.com/show_bug.cgi?id=1246946

--- Additional comment from Byreddy on 2016-10-12 06:50:38 EDT ---

Regression test case in the polarion.

https://polarion.engineering.redhat.com/polarion/#/project/RHG3-9817

--- Additional comment from Byreddy on 2016-10-12 06:53:04 EDT ---

sorry i cleared the need info of others ..i will set it back

--- Additional comment from Byreddy on 2016-10-12 07:05:30 EDT ---

(In reply to Niels de Vos from comment #4)
> Also, could you mention how urgent this is? Severity is set to "high", but
> priority is set to "undefined". If this problem causes the log to grow
> rapidly, we should fix it soon, otherwise we'll move it out to a later
> update.

This is not urgent and it's not blocker and no functionality loss But we have regression test case ( mentioned above) which will be marked as failed during regression cycle and Regression keyword will be added to this bug.

and these error messages are not continuous, for every peer detach operation, it will throw those two error messages.

--- Additional comment from Kaleb KEITHLEY on 2016-10-12 07:33:23 EDT ---

I'm not really sure how changing the default for starting gNFS or not would have anything to do with peer probing or related log messages.

Can you elaborate?

--- Additional comment from Atin Mukherjee on 2016-10-12 07:37:08 EDT ---

(In reply to Byreddy from comment #8)
> (In reply to Niels de Vos from comment #4)
> > Also, could you mention how urgent this is? Severity is set to "high", but
> > priority is set to "undefined". If this problem causes the log to grow
> > rapidly, we should fix it soon, otherwise we'll move it out to a later
> > update.
> 
> This is not urgent and it's not blocker and no functionality loss But we
> have regression test case ( mentioned above) which will be marked as failed
> during regression cycle and Regression keyword will be added to this bug.

I disagree! Why would you want to mark a test failed given the test has actually passed? On the basis of having couple of error entries in the log a test case can not be failed and regression keyword can not be used IMO.

Rahul - please chime in with your thoughts.
> 
> and these error messages are not continuous, for every peer detach
> operation, it will throw those two error messages.

--- Additional comment from Atin Mukherjee on 2016-10-12 07:41:02 EDT ---

(In reply to Kaleb KEITHLEY from comment #9)
> I'm not really sure how changing the default for starting gNFS or not would
> have anything to do with peer probing or related log messages.
> 
> Can you elaborate?

glusterd_friend_remove () ==> glusterd_friend_remove_cleanup_vols () ==> glusterd_svcs_reconfigure () ==> glusterd_nfssvc_reconfigure () where this function is unconditionally called (should be called only if gNFS is active)

And this is for peer detach code path.

--- Additional comment from Byreddy on 2016-10-12 07:50:10 EDT ---

(In reply to Atin Mukherjee from comment #10)
> (In reply to Byreddy from comment #8)
> > (In reply to Niels de Vos from comment #4)
> > > Also, could you mention how urgent this is? Severity is set to "high", but
> > > priority is set to "undefined". If this problem causes the log to grow
> > > rapidly, we should fix it soon, otherwise we'll move it out to a later
> > > update.
> > 
> > This is not urgent and it's not blocker and no functionality loss But we
> > have regression test case ( mentioned above) which will be marked as failed
> > during regression cycle and Regression keyword will be added to this bug.
> 
> I disagree! Why would you want to mark a test failed given the test has
> actually passed? On the basis of having couple of error entries in the log a
> test case can not be failed and regression keyword can not be used IMO.
> 

As per the test case, peer detach should not populate any error messages but currently it's throwing errors and  this issue was not there in last GA release so it's regression from my side.


> Rahul - please chime in with your thoughts.
> > 
> > and these error messages are not continuous, for every peer detach
> > operation, it will throw those two error messages.

--- Additional comment from Rahul Hinduja on 2016-10-12 07:51:10 EDT ---

> 
> Rahul - please chime in with your thoughts.


Agree the functionality of detach works but with errors in log. If the test case was to check the functionality, it should be marked as Passed. But this particular case RHG3-9817 is actually to check for errors and not just the functionality. Hence this particular case will be marked failed in test run. 

Regarding regression keyword, if as engineering we take a call to move this BZ for 3.2.z? Then I am ok to ignore the Regression keyword for this special case as it deals with log and no functionality impact. 

> > 
> > and these error messages are not continuous, for every peer detach
> > operation, it will throw those two error messages.

--- Additional comment from Niels de Vos on 2016-10-12 08:18:03 EDT ---

(In reply to Byreddy from comment #6)
> Regression test case in the polarion.
> 
> https://polarion.engineering.redhat.com/polarion/#/project/RHG3-9817

This asks for a login, and I do not seem to have one (or at least non of the Kerberos based ones work).

Can you maybe point to the script itself, in some git repository?

--- Additional comment from Byreddy on 2016-10-13 00:33 EDT ---



--- Additional comment from Byreddy on 2016-10-13 00:36:31 EDT ---

(In reply to Niels de Vos from comment #14)
> (In reply to Byreddy from comment #6)
> > Regression test case in the polarion.
> > 
> > https://polarion.engineering.redhat.com/polarion/#/project/RHG3-9817
> 
> This asks for a login, and I do not seem to have one (or at least non of the
> Kerberos based ones work).
> 
> Can you maybe point to the script itself, in some git repository?

Test case is not scripted, only test steps are there. it's attached here
https://bugzilla.redhat.com/attachment.cgi?id=1209905

--- Additional comment from Niels de Vos on 2016-10-13 04:37:20 EDT ---

(In reply to Byreddy from comment #16)
> Test case is not scripted, only test steps are there. it's attached here
> https://bugzilla.redhat.com/attachment.cgi?id=1209905

If this is not automated, I do not think there is an urgent need to get this addressed.

When developers write a fix, they need to detect the problem with an automated test-case. It would be most welcome if QE can provide such a script that makes it easier/faster to develop+test+verify any needed patches.

--- Additional comment from Byreddy on 2016-10-13 06:44:29 EDT ---

(In reply to Niels de Vos from comment #17)
> (In reply to Byreddy from comment #16)
> > Test case is not scripted, only test steps are there. it's attached here
> > https://bugzilla.redhat.com/attachment.cgi?id=1209905
> 
> If this is not automated, I do not think there is an urgent need to get this
> addressed.
> 
> When developers write a fix, they need to detect the problem with an
> automated test-case. It would be most welcome if QE can provide such a
> script that makes it easier/faster to develop+test+verify any needed patches.

This is not urgent to fix

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-01-01 12:25:28 EST ---

This BZ having been considered, and subsequently not approved to be fixed at the RHGS 3.2.0 release, is being proposed for the next release of RHGS

--- Additional comment from Atin Mukherjee on 2017-02-09 00:22:10 EST ---

Gaurav - Can you start looking into this issue? I'd like to get this fixed in next release.

Comment 1 Worker Ant 2017-02-13 10:19:16 UTC
REVIEW: https://review.gluster.org/16607 (    glusterd : Fix for error mesage while detaching peers) posted (#1) for review on master by Gaurav Yadav (gyadav)

Comment 2 Worker Ant 2017-02-13 17:59:35 UTC
REVIEW: https://review.gluster.org/16607 (glusterd : Fix for error mesage while detaching peers) posted (#2) for review on master by Gaurav Yadav (gyadav)

Comment 3 Worker Ant 2017-02-14 01:42:14 UTC
COMMIT: https://review.gluster.org/16607 committed in master by Atin Mukherjee (amukherj) 
------
commit be44a1bd519af69b21acf682b0908d4d695f868e
Author: Gaurav Yadav <gyadav>
Date:   Mon Feb 13 15:46:24 2017 +0530

    glusterd : Fix for error mesage while detaching peers
    
    When peer is detached from a cluster, an error log is being
    generated in glusterd.log -"Failed to reconfigure all daemon
    services". This log is seen in the originator node where the
    detach is issued.
    
    This happens in two cases.
    Case 1: Detach peer with no volume been created in cluster.
    Case 2: Detach peer after deleting all the volumes which were
    created but never started.
    In any one of the above two cases, in glusterd_check_files_identical()
    GlusterD fails to retrieve nfs-server.vol file from /var/lib/glusterd/nfs
    which gets created only when a volume is in place and and is started.
    
    With this fix both the above cases have been handled by added
    validation to skip reconfigure if there is no volume in started
    state.
    
    Change-Id: I039c0840e3d61ab54575e1e00c6a6a00874d84c0
    BUG: 1421607
    Signed-off-by: Gaurav Yadav <gyadav>
    Reviewed-on: https://review.gluster.org/16607
    Smoke: Gluster Build System <jenkins.org>
    Tested-by: Atin Mukherjee <amukherj>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>

Comment 4 Worker Ant 2017-02-16 13:01:59 UTC
REVIEW: https://review.gluster.org/16645 (glusterd : Fix for error mesage while detaching peers) posted (#1) for review on release-3.10 by Gaurav Yadav (gyadav)

Comment 5 Shyamsundar 2017-05-30 18:42:26 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.