Bug 1441939 - Brick Multiplexing: Ended up with two brick processes on multiplexing setup
Summary: Brick Multiplexing: Ended up with two brick processes on multiplexing setup
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: core
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Samikshan Bairagya
QA Contact: Rahul Hinduja
URL:
Whiteboard: brick-multiplexing
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-04-13 06:56 UTC by Nag Pavan Chilakam
Modified: 2017-08-30 12:42 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-04-19 14:35:41 UTC
Embargoed:


Attachments (Terms of Use)

Description Nag Pavan Chilakam 2017-04-13 06:56:33 UTC
Description of problem:
=====================
Created about 60 1x3 volumes on a 6 node setup as below:
all volumes created using bricks from n1,n2,n3

stated to pump IOs from 3 different clients by writing one file into one volume at a time.
Took snapshots 1 for each volume
While IOs were going on, brought down v1
Then killed brick process on n3(so all volumes have brick3 down)

Then after leaving setup overnight, restarted glusterd on n3 to bring b3 up.

I ended up seeing 2 glusterfsd processes on n3 as below


[root@dhcp35-122 ~]# ps -ef|grep glusterfsd
root     21022     1  8 12:08 ?        00:01:27 /usr/sbin/glusterfsd -s dhcp35-122.lab.eng.blr.redhat.com --volfile-id cross3_10.dhcp35-122.lab.eng.blr.redhat.com.rhs-brick1-cross3_10 -p /var/lib/glusterd/vols/cross3_10/run/dhcp35-122.lab.eng.blr.redhat.com-rhs-brick1-cross3_10.pid -S /var/lib/glusterd/vols/cross3_10/run/daemon-dhcp35-122.lab.eng.blr.redhat.com.socket --brick-name /rhs/brick1/cross3_10 -l /var/log/glusterfs/bricks/rhs-brick1-cross3_10.log --xlator-option *-posix.glusterd-uuid=0d8eaf5c-e629-451b-b6d2-b0a32df473a0 --brick-port 49152 --xlator-option cross3_10-server.listen-port=49152
root     21088     1  0 12:08 ?        00:00:02 /usr/sbin/glusterfsd -s dhcp35-122.lab.eng.blr.redhat.com --volfile-id cross3_30.dhcp35-122.lab.eng.blr.redhat.com.rhs-brick2-cross3_30 -p /var/lib/glusterd/vols/cross3_30/run/dhcp35-122.lab.eng.blr.redhat.com-rhs-brick2-cross3_30.pid -S /var/lib/glusterd/vols/cross3_30/run/daemon-dhcp35-122.lab.eng.blr.redhat.com.socket --brick-name /rhs/brick2/cross3_30 -l /var/log/glusterfs/bricks/rhs-brick2-cross3_30.log --xlator-option *-posix.glusterd-uuid=0d8eaf5c-e629-451b-b6d2-b0a32df473a0 --brick-port 49153 --xlator-option cross3_30-server.listen-port=49153
root     22220 20760  0 12:25 pts/0    00:00:00 grep --color=auto glusterfsd
[root@dhcp35-122 ~]# 




Version-Release number of selected component (if applicable):
======
3.8.4-22

Comment 2 Nag Pavan Chilakam 2017-04-13 06:58:29 UTC
note: the above setup had brickmultiplexing enabled at the start

Comment 3 Atin Mukherjee 2017-04-13 08:41:24 UTC
gluster volume info/status output along with log files please?

Comment 4 Nag Pavan Chilakam 2017-04-13 09:10:31 UTC
Logs of the n3 where I saw the problem is available at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1441939/
It included vol info log too.
Note that I was unable to collect vol status due to bz#1441946 - Brick Multiplexing: volume status showing "Another transaction is in progress"

Also note the setup is currently available(I don;t know for how long I can guarantee that) 10.70.35.122

Comment 5 Samikshan Bairagya 2017-04-19 14:35:41 UTC
The following is my analysis and explains why this isn't a bug:

From the "volume info" data it seems like due to some reason the snapshot create operation failed for volume cross3_30. This can be seen from the snapshot count for cross3_30 which is 0, unlike all the other volumes for which the snapshot count is 1. Also, for all volumes, option "features.barrier" has been reconfigured and set to "disabled", whereas, for cross3_30, this volume isn't configured. This can again be observed from the "volume info" data.

The brick multiplexing feature works in a way that checks for brick compatibility before deciding to attach a brick to a particular brick process that already has one of more bricks. This compatibility check happens in 2 steps. First, volume options are compared between the two volumes that the bricks being checked for compatibility belong to. If the volume options do not match, then the bricks are deemed incompatible and a new brick process is spawned instead. The second step involves other brick specific compatibility checks.

Here the volume options for cross3_30 do not match those of the other volumes, due to the option "features.barrier". A new brick being spawned for brick "dhcp35-122.lab.eng.blr.redhat.com:/rhs/brick2/cross3_30" is thus an expected brick-multiplexing behaviour, resulting from the first compatibility check's failure, and can't be considered as a bug.

Comment 6 Nag Pavan Chilakam 2017-04-20 06:55:54 UTC
(In reply to Samikshan Bairagya from comment #5)
> The following is my analysis and explains why this isn't a bug:
> 
> From the "volume info" data it seems like due to some reason the snapshot
> create operation failed for volume cross3_30. This can be seen from the
> snapshot count for cross3_30 which is 0, unlike all the other volumes for
> which the snapshot count is 1. Also, for all volumes, option
> "features.barrier" has been reconfigured and set to "disabled", whereas, for
> cross3_30, this volume isn't configured. This can again be observed from the
> "volume info" data.
> 
> The brick multiplexing feature works in a way that checks for brick
> compatibility before deciding to attach a brick to a particular brick
> process that already has one of more bricks. This compatibility check
> happens in 2 steps. First, volume options are compared between the two
> volumes that the bricks being checked for compatibility belong to. If the
> volume options do not match, then the bricks are deemed incompatible and a
> new brick process is spawned instead. The second step involves other brick
> specific compatibility checks.

So are we saying that if the volume options of two volumes are different, then they won't be served by same glusterfsd(meaning brick mux won't take into effect?)

> 
> Here the volume options for cross3_30 do not match those of the other
> volumes, due to the option "features.barrier". A new brick being spawned for
> brick "dhcp35-122.lab.eng.blr.redhat.com:/rhs/brick2/cross3_30" is thus an
> expected brick-multiplexing behaviour, resulting from the first
> compatibility check's failure, and can't be considered as a bug.

Comment 7 Atin Mukherjee 2017-04-20 07:15:30 UTC
(In reply to nchilaka from comment #6)
> (In reply to Samikshan Bairagya from comment #5)
> > The following is my analysis and explains why this isn't a bug:
> > 
> > From the "volume info" data it seems like due to some reason the snapshot
> > create operation failed for volume cross3_30. This can be seen from the
> > snapshot count for cross3_30 which is 0, unlike all the other volumes for
> > which the snapshot count is 1. Also, for all volumes, option
> > "features.barrier" has been reconfigured and set to "disabled", whereas, for
> > cross3_30, this volume isn't configured. This can again be observed from the
> > "volume info" data.
> > 
> > The brick multiplexing feature works in a way that checks for brick
> > compatibility before deciding to attach a brick to a particular brick
> > process that already has one of more bricks. This compatibility check
> > happens in 2 steps. First, volume options are compared between the two
> > volumes that the bricks being checked for compatibility belong to. If the
> > volume options do not match, then the bricks are deemed incompatible and a
> > new brick process is spawned instead. The second step involves other brick
> > specific compatibility checks.
> 
> So are we saying that if the volume options of two volumes are different,
> then they won't be served by same glusterfsd(meaning brick mux won't take
> into effect?)

That's right!

> 
> > 
> > Here the volume options for cross3_30 do not match those of the other
> > volumes, due to the option "features.barrier". A new brick being spawned for
> > brick "dhcp35-122.lab.eng.blr.redhat.com:/rhs/brick2/cross3_30" is thus an
> > expected brick-multiplexing behaviour, resulting from the first
> > compatibility check's failure, and can't be considered as a bug.


Note You need to log in before you can comment on or make changes to this bug.