Bug 1132839 - Stopping or restarting glusterd on another node when volume start is in progress gives error messages but volume is started
Summary: Stopping or restarting glusterd on another node when volume start is in progr...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterd
Version: rhgs-3.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Atin Mukherjee
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1146902
TreeView+ depends on / blocked
 
Reported: 2014-08-22 07:38 UTC by senaik
Modified: 2023-09-14 02:46 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1146902 (view as bug list)
Environment:
Last Closed: 2015-04-13 04:43:12 UTC
Embargoed:


Attachments (Terms of Use)

Description senaik 2014-08-22 07:38:46 UTC
Description of problem:
=======================
Scenario 1 : When volume start is in progress, immediately bring down glusterd on another node, volume start fails with error message : "Commit Failed" but volume is started

Scenario 2 : When volume start is in progress, immediately restart glusterd on another node, volume start does not give any output , (return code 146)  but volume is started


Version-Release number of selected component (if applicable):
============================================================
glusterfs 3.6.0.27 


How reproducible:
=================
3/3

Steps to Reproduce:
===================
Scenario 1: 
~~~~~~~~~~~
1.Create a 2x2 dist rep volume and start it. 
2.Stop the volume 
3.Start the volume , while volume start is in progress immediately bring down glusterd on one/more nodes . Volume start fails with below error :

gluster v start vol2
volume start: vol2: failed: Commit failed on 00000000-0000-0000-0000-000000000000. Please check log file for details.
Commit failed on 00000000-0000-0000-0000-000000000000. Please check log file for details.

But volume status shows started 

 gluster v i vol2
 
Volume Name: vol2
Type: Distributed-Replicate
Volume ID: 98eb6a90-fbe5-4512-b560-d299579135d5
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: snapshot13.lab.eng.blr.redhat.com:/var/run/gluster/snaps/0bb578bc09154daab9c14afdc5e7f628/brick1/b3
Brick2: snapshot14.lab.eng.blr.redhat.com:/var/run/gluster/snaps/0bb578bc09154daab9c14afdc5e7f628/brick2/b3
Brick3: snapshot15.lab.eng.blr.redhat.com:/var/run/gluster/snaps/0bb578bc09154daab9c14afdc5e7f628/brick3/b3
Brick4: snapshot16.lab.eng.blr.redhat.com:/var/run/gluster/snaps/0bb578bc09154daab9c14afdc5e7f628/brick4/b3
Options Reconfigured:
performance.readdir-ahead: on
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256


Scenario 2 : 
~~~~~~~~~~~
1) Stop the volume 
2) Start the volume , while volume start is in progress , restart glusterd on another node. Volume start gives no output and return code is shown as 146 
Volume status is shown as 'Started'

[root@snapshot13 /]# gluster v start vol2
[root@snapshot13 /]# echo $?
146


[root@snapshot13 /]# gluster v i vol2
 
Volume Name: vol2
Type: Distributed-Replicate
Volume ID: 98eb6a90-fbe5-4512-b560-d299579135d5
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: snapshot13.lab.eng.blr.redhat.com:/var/run/gluster/snaps/0bb578bc09154daab9c14afdc5e7f628/brick1/b3
Brick2: snapshot14.lab.eng.blr.redhat.com:/var/run/gluster/snaps/0bb578bc09154daab9c14afdc5e7f628/brick2/b3
Brick3: snapshot15.lab.eng.blr.redhat.com:/var/run/gluster/snaps/0bb578bc09154daab9c14afdc5e7f628/brick3/b3
Brick4: snapshot16.lab.eng.blr.redhat.com:/var/run/gluster/snaps/0bb578bc09154daab9c14afdc5e7f628/brick4/b3
Options Reconfigured:
performance.readdir-ahead: on
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256



Actual results:
===============
Volume start gives error message when volume has been started


Expected results:
=================
There should no error message seen 


Additional info:

Comment 3 Atin Mukherjee 2014-08-23 08:18:10 UTC
RCA for Scenario 1
==================

In glusterd commit op phase, if any of the commit op fails the op_ret is set to non-zero and a negative response is sent back CLI. However in this case, the local commit was successful and i.e. why the status was changed to "started". Will investigate further to determine the feasibility of returning positive response to cli if the local commit succeeds even if any of the remote commit op fails.

RCA for Scenario 2 will be shared soon.

Comment 4 Atin Mukherjee 2014-08-25 05:02:07 UTC
Seema, 

Could you please point out on which node (IP) the command was executed for scenario 2 and which node (IP) was restarted?

~Atin

Comment 5 senaik 2014-08-25 10:37:02 UTC
Command was executed on 10.70.40.169 and glusterd was restarted on 10.70.40.170 initially. 

But I have also retried the glusterd restart on other nodes while trying to reproduce the issue

Comment 6 Atin Mukherjee 2014-08-26 09:31:52 UTC
For scenario 2 , it seems like logs for snapshot14 is missing, when I untar snapshot14_sosreport-qaredhat.com-20140822130530-91fd.tar.xz I can see snapshot13, 15 & 16 are present.

Comment 8 Atin Mukherjee 2015-04-01 05:41:56 UTC
(In reply to Atin Mukherjee from comment #3)
> RCA for Scenario 1
> ==================
> 
> In glusterd commit op phase, if any of the commit op fails the op_ret is set
> to non-zero and a negative response is sent back CLI. However in this case,
> the local commit was successful and i.e. why the status was changed to
> "started". Will investigate further to determine the feasibility of
> returning positive response to cli if the local commit succeeds even if any
> of the remote commit op fails.
> 
> RCA for Scenario 2 will be shared soon.

Seema,

For scenario 1, this is expected as per the design. Currently we do not have any rollback mechanism for failed transaction. In this case after local commit was successful, one of the remote commit failed as the glusterd instance on a remote node was brought down. This resulted in volume status to move to Started in few nodes but not in all nodes. Even moving local commit after completing all the remote commits may end up in similar situation as local commit might fail. We would not be able to fix this problem considering the design limitation.

For scenario 2, I still don't see snapshot 14 logs in http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/snapshots/1132839/

Could you double check?

Comment 9 Atin Mukherjee 2015-04-01 12:24:45 UTC
I checked all the logs now and there is no evidence of volume start getting timed out. Do you remember the timestamp of this issue? I know its quite difficult to recollect the information as the BZ is quite old, but with out that its pretty difficult to figure out as I don't see any abnormal things in the log.

Comment 10 Atin Mukherjee 2015-04-13 04:43:12 UTC
(In reply to Atin Mukherjee from comment #9)
> I checked all the logs now and there is no evidence of volume start getting
> timed out. Do you remember the timestamp of this issue? I know its quite
> difficult to recollect the information as the BZ is quite old, but with out
> that its pretty difficult to figure out as I don't see any abnormal things
> in the log.

Closing this bug as I've not got any response from reporter. Kindly re-open if the issue persists.

Comment 11 Red Hat Bugzilla 2023-09-14 02:46:09 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.