958758 – offlined brick process on server1 automatically starts when other server3 in cluster is powered off

Bug 958758 - offlined brick process on server1 automatically starts when other server3 in cluster is powered off

Summary: offlined brick process on server1 automatically starts when other server3 in ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Pranith Kumar K
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	959986
TreeView+	depends on / blocked

Reported:	2013-05-02 11:05 UTC by Rahul Hinduja
Modified:	2013-09-23 22:43 UTC (History)
CC List:	4 users (show)
Fixed In Version:	glusterfs-3.4.0.4rhs-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	959986 (view as bug list)
Environment:
Last Closed:	2013-09-23 22:39:36 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Rahul Hinduja 2013-05-02 11:05:47 UTC

Description of problem:
=======================

offlined brick process on server1 automatically starts when other server3 in cluster is powered off

Version-Release number of selected component (if applicable):
=============================================================

[root@rhs-client11 ~]# rpm -qa | grep gluster
glusterfs-debuginfo-3.4.0.1rhs-1.el6rhs.x86_64
glusterfs-fuse-3.4.0.1rhs-1.el6rhs.x86_64
gluster-swift-container-1.4.8-4.el6.noarch
gluster-swift-1.4.8-4.el6.noarch
gluster-swift-doc-1.4.8-4.el6.noarch
vdsm-gluster-4.10.2-4.0.qa5.el6rhs.noarch
gluster-swift-plugin-1.0-5.noarch
gluster-swift-proxy-1.4.8-4.el6.noarch
gluster-swift-account-1.4.8-4.el6.noarch
glusterfs-geo-replication-3.4.0.1rhs-1.el6rhs.x86_64
org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch
glusterfs-3.4.0.1rhs-1.el6rhs.x86_64
glusterfs-server-3.4.0.1rhs-1.el6rhs.x86_64
glusterfs-rdma-3.4.0.1rhs-1.el6rhs.x86_64
gluster-swift-object-1.4.8-4.el6.noarch
[root@rhs-client11 ~]# 



How reproducible:
=================

1/1


Steps to Reproduce:
===================
1. Create 2*2 setup spanning on 4 storage server (server1 to server4)
2. Offline the brick process by killing its pid on server1
3. Poweroff server3
4. Noticed server1 brick process is online again
  
Actual results:
===============

Process should still be offline.

Log-snippet step by step

1. Volume type:

[root@rhs-client11 ~]# gluster v i dist-rep
 
Volume Name: dist-rep
Type: Distributed-Replicate
Volume ID: 0ff921a8-a01e-4b20-b9aa-c2c612b87a46
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.36.35:/rhs/brick1/dr1
Brick2: 10.70.36.36:/rhs/brick1/dr2
Brick3: 10.70.36.37:/rhs/brick1/dr3
Brick4: 10.70.36.38:/rhs/brick1/dr4


2. Volume status and their pid

[root@rhs-client11 ~]# gluster volume status dist-rep
Status of volume: dist-rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.70.36.35:/rhs/brick1/dr1			49164	Y	12895
Brick 10.70.36.36:/rhs/brick1/dr2			49170	Y	7203
Brick 10.70.36.37:/rhs/brick1/dr3			49168	Y	3349
Brick 10.70.36.38:/rhs/brick1/dr4			49166	Y	1299
NFS Server on localhost					38467	Y	12688
Self-heal Daemon on localhost				N/A	Y	12701
NFS Server on e815f97d-4b98-4501-a19f-c9f1d0a9cd2e	38467	Y	1309
Self-heal Daemon on e815f97d-4b98-4501-a19f-c9f1d0a9cd2
e							N/A	Y	1321
NFS Server on 91d06293-a9d4-467f-bc09-a0d929ebacac	38467	Y	7213
Self-heal Daemon on 91d06293-a9d4-467f-bc09-a0d929ebaca
c							N/A	Y	7225
NFS Server on 53ded8aa-05eb-4d57-a4d4-3db78bbde921	38467	Y	3601
Self-heal Daemon on 53ded8aa-05eb-4d57-a4d4-3db78bbde92
1							N/A	Y	3596
 
There are no active volume tasks



3. Killed the brick process on 10.70.36.35 using "kill -9 <pid>"

[root@rhs-client11 ~]# kill -9 12895
[root@rhs-client11 ~]# 

4. Volume status confirms that the brick is offline

[root@rhs-client11 ~]# gluster volume status dist-rep
Status of volume: dist-rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.70.36.35:/rhs/brick1/dr1			N/A	N	12895
Brick 10.70.36.36:/rhs/brick1/dr2			49170	Y	7203
Brick 10.70.36.37:/rhs/brick1/dr3			49168	Y	3349
Brick 10.70.36.38:/rhs/brick1/dr4			49166	Y	1299
NFS Server on localhost					38467	Y	12688
Self-heal Daemon on localhost				N/A	Y	12701
NFS Server on 53ded8aa-05eb-4d57-a4d4-3db78bbde921	38467	Y	3601
Self-heal Daemon on 53ded8aa-05eb-4d57-a4d4-3db78bbde92
1							N/A	Y	3596
NFS Server on 91d06293-a9d4-467f-bc09-a0d929ebacac	38467	Y	7213
Self-heal Daemon on 91d06293-a9d4-467f-bc09-a0d929ebaca
c							N/A	Y	7225
NFS Server on e815f97d-4b98-4501-a19f-c9f1d0a9cd2e	38467	Y	1309
Self-heal Daemon on e815f97d-4b98-4501-a19f-c9f1d0a9cd2
e							N/A	Y	1321
 
There are no active volume tasks

5. Powered down the server3 (10.70.36.37) using poweroff

6. Executed command gluster volume status which shows the process is Online
[root@rhs-client11 ~]# 
[root@rhs-client11 ~]# gluster volume status dist-rep
Status of volume: dist-rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.70.36.35:/rhs/brick1/dr1			49164	Y	13151
Brick 10.70.36.36:/rhs/brick1/dr2			49170	Y	7203
Brick 10.70.36.38:/rhs/brick1/dr4			49166	Y	1299
NFS Server on localhost					38467	Y	12688
Self-heal Daemon on localhost				N/A	Y	12701
NFS Server on e815f97d-4b98-4501-a19f-c9f1d0a9cd2e	38467	Y	1309
Self-heal Daemon on e815f97d-4b98-4501-a19f-c9f1d0a9cd2
e							N/A	Y	1321
NFS Server on 91d06293-a9d4-467f-bc09-a0d929ebacac	38467	Y	7213
Self-heal Daemon on 91d06293-a9d4-467f-bc09-a0d929ebaca
c							N/A	Y	7225
 
There are no active volume tasks
[root@rhs-client11 ~]# 

Expected results:
=================

It should not start the offline process of different server1.

Comment 4 Rahul Hinduja 2013-05-08 07:09:33 UTC

Verified with the build: glusterfs-3.4.0.4rhs-1.el6rhs.x86_64

Powering off one server is not bringing the brick process online on other server in cluster. Works as expected.

log snippet:
============

[root@rhs-client11 ~]# gluster volume status
Status of volume: vol-dis-rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.70.36.35:/rhs/brick1/b1			N/A	N	5293
Brick 10.70.36.36:/rhs/brick1/b2			49152	Y	5269
Brick 10.70.36.35:/rhs/brick1/b3			N/A	N	5302
Brick 10.70.36.36:/rhs/brick1/b4			49153	Y	5278
Brick 10.70.36.35:/rhs/brick1/b5			N/A	N	5311
Brick 10.70.36.36:/rhs/brick1/b6			49154	Y	5287
Brick 10.70.36.37:/rhs/brick1/b7			49152	Y	5271
Brick 10.70.36.38:/rhs/brick1/b8			49152	Y	5269
Brick 10.70.36.37:/rhs/brick1/b9			49153	Y	5280
Brick 10.70.36.38:/rhs/brick1/b10			49153	Y	5278
Brick 10.70.36.37:/rhs/brick1/b11			49154	Y	5289
Brick 10.70.36.38:/rhs/brick1/b12			49154	Y	5287
NFS Server on localhost					2049	Y	5323
Self-heal Daemon on localhost				N/A	Y	5327
NFS Server on c6b5d4e9-3782-457c-8542-f32b0941ed05	2049	Y	5299
Self-heal Daemon on c6b5d4e9-3782-457c-8542-f32b0941ed0
5							N/A	Y	5303
NFS Server on f9cc4b9c-97e1-4f65-9657-3b050d45296e	2049	Y	5299
Self-heal Daemon on f9cc4b9c-97e1-4f65-9657-3b050d45296
e							N/A	Y	5303
NFS Server on 6962d204-37c8-436b-8ea6-a9698be40ec6	2049	Y	5301
Self-heal Daemon on 6962d204-37c8-436b-8ea6-a9698be40ec
6							N/A	Y	5305
 
There are no active volume tasks
[root@rhs-client11 ~]# 
[root@rhs-client11 ~]# 
[root@rhs-client11 ~]# 
[root@rhs-client11 ~]# gluster volume status
Status of volume: vol-dis-rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.70.36.35:/rhs/brick1/b1			N/A	N	5293
Brick 10.70.36.36:/rhs/brick1/b2			49152	Y	5269
Brick 10.70.36.35:/rhs/brick1/b3			N/A	N	5302
Brick 10.70.36.36:/rhs/brick1/b4			49153	Y	5278
Brick 10.70.36.35:/rhs/brick1/b5			N/A	N	5311
Brick 10.70.36.36:/rhs/brick1/b6			49154	Y	5287
Brick 10.70.36.38:/rhs/brick1/b8			49152	Y	5269
Brick 10.70.36.38:/rhs/brick1/b10			49153	Y	5278
Brick 10.70.36.38:/rhs/brick1/b12			49154	Y	5287
NFS Server on localhost					2049	Y	5323
Self-heal Daemon on localhost				N/A	Y	5327
NFS Server on c6b5d4e9-3782-457c-8542-f32b0941ed05	2049	Y	5299
Self-heal Daemon on c6b5d4e9-3782-457c-8542-f32b0941ed0
5							N/A	Y	5303
NFS Server on f9cc4b9c-97e1-4f65-9657-3b050d45296e	2049	Y	5299
Self-heal Daemon on f9cc4b9c-97e1-4f65-9657-3b050d45296
e							N/A	Y	5303
 
There are no active volume tasks
[root@rhs-client11 ~]#

Comment 5 Scott Haines 2013-09-23 22:39:36 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Comment 6 Scott Haines 2013-09-23 22:43:46 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.