Bug 1620544 - Brick process NOT ONLINE for heketidb and block-hosting volume
Summary: Brick process NOT ONLINE for heketidb and block-hosting volume
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1620469
TreeView+ depends on / blocked
 
Reported: 2018-08-23 07:01 UTC by Atin Mukherjee
Modified: 2018-10-23 15:17 UTC (History)
13 users (show)

Fixed In Version: glusterfs-5.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1620469
Environment:
Last Closed: 2018-10-23 15:17:29 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Comment 1 Atin Mukherjee 2018-08-23 07:03:42 UTC
Description of problem:
++++++++++++++++++++++++
We had an OCP 3.10 + OCS 3.10 setup with gluster-bits=3.12.2-15. Gluser-block version = gluster-block-0.2.1-24.el7rhgs.x86_64. The setup has logging pods configured and the metrics pods couldn't come up.

Steps Performed
==================

1. We updated the docker and gluster client packages on each oc node which resulted in the gluster pod restarts as well.

2. Created around 50 block pvcs in two loops and then attached them to app pods 

For 50 pvcs of 2 gb each, a total of 2 block hosting vols were created:

1. First block-host = vol_9f93ae4c845f3910f5d1558cc5ae9f0a
2. 2nd block-hosting = vol_1fda560284e932cae1e384fe779b430f

Details :
=============
A) For pvc bk101 -bk148, space was allocated from vol_9f93ae4c845f3910f5d1558cc5ae9f0a. Each block-volume creation succeeded(as seen from heketi-logs). 

B) For pvc bk149, a new block-hosting vol was created = vol_1fda560284e932cae1e384fe779b430f and was used for pvcs bk149 and bk150 (Time = 2018/08/20 10:01:10  UTC)


Following issues was seen on gluster pod 10.70.46.150
========================================================

1. for gluster node 10.70.46.150, the brick process for heketidbstorage was DOWN (even though ps -ef |grep glusterfsd reported it as running)
2. for gluster node 10.70.46.150, the brick process for vol_9f93ae4c845f3910f5d1558cc5ae9f0a is DOWN.
3. In a node, With brick-mux enabled, all block-hosting volumes should have same brick PID. But, in 10.70.46.150, 2 block-hosting volumes had 2 different PIDs.

Thus, on creation of 2nd block-hosting volume(vol_1fda560284e932cae1e384fe779b430f) , it should have used PID#540 of 1st block-hosting volume(vol_9f93ae4c845f3910f5d1558cc5ae9f0a). Instead a new PID was used -  12654. This resulted in brick process for vol_9f93ae4c845f3910f5d1558cc5ae9f0a going into NOT ONLINE status.



Some outputs from gluster and heketi end
==========================================

Brick PIDS from PODS
----------------------------
-----------------------------

[root@dhcp46-137 nitin]# for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- ps -ef|grep glusterfsd; echo ""; done

glusterfs-storage-9j9nk
+++++++++++++++++++++++
root       540     1 95 06:03 ?        04:18:55 /usr/sbin/glusterfsd -s 10.70.46.150 --volfile-id vol_9f93ae4c845f3910f5d1558cc5ae9f0a.10.70.46.150.var-lib-heketi-mounts-vg_29d26d418f4ec01cbd8805704313e5e0-brick_0cb321e2f4b4290bda1c2f9ae5085544-brick -p /var/run/gluster/vols/vol_9f93ae4c845f3910f5d1558cc5ae9f0a/10.70.46.150-var-lib-heketi-mounts-vg_29d26d418f4ec01cbd8805704313e5e0-brick_0cb321e2f4b4290bda1c2f9ae5085544-brick.pid -S /var/run/gluster/6a8f7b2cb8da0e6a3ae398160998af29.socket --brick-name /var/lib/heketi/mounts/vg_29d26d418f4ec01cbd8805704313e5e0/brick_0cb321e2f4b4290bda1c2f9ae5085544/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_29d26d418f4ec01cbd8805704313e5e0-brick_0cb321e2f4b4290bda1c2f9ae5085544-brick.log --xlator-option *-posix.glusterd-uuid=cd776ee9-6a31-496d-a8af-072f4c23aee4 --brick-port 49153 --xlator-option vol_9f93ae4c845f3910f5d1558cc5ae9f0a-server.listen-port=49153


root       558     1  0 06:03 ?        00:00:04 /usr/sbin/glusterfsd -s 10.70.46.150 --volfile-id heketidbstorage.10.70.46.150.var-lib-heketi-mounts-vg_6064162e01514ddd000da6dafdc79216-brick_c8dd81dd3761dd8212327131c4009716-brick -p /var/run/gluster/vols/heketidbstorage/10.70.46.150-var-lib-heketi-mounts-vg_6064162e01514ddd000da6dafdc79216-brick_c8dd81dd3761dd8212327131c4009716-brick.pid -S /var/run/gluster/6fe959498daa7ffa24cb0ec026f845e3.socket --brick-name /var/lib/heketi/mounts/vg_6064162e01514ddd000da6dafdc79216/brick_c8dd81dd3761dd8212327131c4009716/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_6064162e01514ddd000da6dafdc79216-brick_c8dd81dd3761dd8212327131c4009716-brick.log --xlator-option *-posix.glusterd-uuid=cd776ee9-6a31-496d-a8af-072f4c23aee4 --brick-port 49152 --xlator-option heketidbstorage-server.listen-port=49152


root     12654     1 36 10:01 ?        00:11:54 /usr/sbin/glusterfsd -s 10.70.46.150 --volfile-id vol_1fda560284e932cae1e384fe779b430f.10.70.46.150.var-lib-heketi-mounts-vg_29d26d418f4ec01cbd8805704313e5e0-brick_013ad53a5ed578fd8f1275525e2c5916-brick -p /var/run/gluster/vols/vol_1fda560284e932cae1e384fe779b430f/10.70.46.150-var-lib-heketi-mounts-vg_29d26d418f4ec01cbd8805704313e5e0-brick_013ad53a5ed578fd8f1275525e2c5916-brick.pid -S /var/run/gluster/d16b6a7370a888823fc43060eeed1b2e.socket --brick-name /var/lib/heketi/mounts/vg_29d26d418f4ec01cbd8805704313e5e0/brick_013ad53a5ed578fd8f1275525e2c5916/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_29d26d418f4ec01cbd8805704313e5e0-brick_013ad53a5ed578fd8f1275525e2c5916-brick.log --xlator-option *-posix.glusterd-uuid=cd776ee9-6a31-496d-a8af-072f4c23aee4 --brick-port 49154 --xlator-option vol_1fda560284e932cae1e384fe779b430f-server.listen-port=49154



glusterfs-storage-hr8ht
+++++++++++++++++++++++

root       527     1  0 07:49 ?        00:00:05 /usr/sbin/glusterfsd -s 10.70.46.219 --volfile-id heketidbstorage.10.70.46.219.var-lib-heketi-mounts-vg_a1297c7a138dcac578308e8afada5161-brick_8484360626b40cc47f707c683391f8b8-brick -p /var/run/gluster/vols/heketidbstorage/10.70.46.219-var-lib-heketi-mounts-vg_a1297c7a138dcac578308e8afada5161-brick_8484360626b40cc47f707c683391f8b8-brick.pid -S /var/run/gluster/49760ea523e2d92425cd374e21f7c6a6.socket --brick-name /var/lib/heketi/mounts/vg_a1297c7a138dcac578308e8afada5161/brick_8484360626b40cc47f707c683391f8b8/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_a1297c7a138dcac578308e8afada5161-brick_8484360626b40cc47f707c683391f8b8-brick.log --xlator-option *-posix.glusterd-uuid=09aed8ae-858b-4d53-b7fa-745aa9443f18 --brick-port 49152 --xlator-option heketidbstorage-server.listen-port=49152


root       535     1 99 07:49 ?        04:15:43 /usr/sbin/glusterfsd -s 10.70.46.219 --volfile-id vol_9f93ae4c845f3910f5d1558cc5ae9f0a.10.70.46.219.var-lib-heketi-mounts-vg_a32d29646c91834eeac64529870c71cd-brick_3319cdb494d7c201d2991173d18f2575-brick -p /var/run/gluster/vols/vol_9f93ae4c845f3910f5d1558cc5ae9f0a/10.70.46.219-var-lib-heketi-mounts-vg_a32d29646c91834eeac64529870c71cd-brick_3319cdb494d7c201d2991173d18f2575-brick.pid -S /var/run/gluster/c3f1138994a30170b2d5759b8bdbc313.socket --brick-name /var/lib/heketi/mounts/vg_a32d29646c91834eeac64529870c71cd/brick_3319cdb494d7c201d2991173d18f2575/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_a32d29646c91834eeac64529870c71cd-brick_3319cdb494d7c201d2991173d18f2575-brick.log --xlator-option *-posix.glusterd-uuid=09aed8ae-858b-4d53-b7fa-745aa9443f18 --brick-port 49153 --xlator-option vol_9f93ae4c845f3910f5d1558cc5ae9f0a-server.listen-port=49153



glusterfs-storage-q22cl
+++++++++++++++++++++++

root       549     1  0 07:10 ?        00:00:05 /usr/sbin/glusterfsd -s 10.70.46.231 --volfile-id heketidbstorage.10.70.46.231.var-lib-heketi-mounts-vg_357556739aad4d3b81d3e935a27339dc-brick_6ee29707729f1c338abc1604473d6059-brick -p /var/run/gluster/vols/heketidbstorage/10.70.46.231-var-lib-heketi-mounts-vg_357556739aad4d3b81d3e935a27339dc-brick_6ee29707729f1c338abc1604473d6059-brick.pid -S /var/run/gluster/26c7477db0159fa810a1574951aa87cc.socket --brick-name /var/lib/heketi/mounts/vg_357556739aad4d3b81d3e935a27339dc/brick_6ee29707729f1c338abc1604473d6059/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_357556739aad4d3b81d3e935a27339dc-brick_6ee29707729f1c338abc1604473d6059-brick.log --xlator-option *-posix.glusterd-uuid=2018bea2-c934-4cb1-b19f-df8fb79752cc --brick-port 49152 --xlator-option heketidbstorage-server.listen-port=49152


root       557     1 99 07:10 ?        04:33:11 /usr/sbin/glusterfsd -s 10.70.46.231 --volfile-id vol_9f93ae4c845f3910f5d1558cc5ae9f0a.10.70.46.231.var-lib-heketi-mounts-vg_4525875aacd97d4337fb6a9c5f13eba6-brick_8c0881f317eb11607eff74a16027663d-brick -p /var/run/gluster/vols/vol_9f93ae4c845f3910f5d1558cc5ae9f0a/10.70.46.231-var-lib-heketi-mounts-vg_4525875aacd97d4337fb6a9c5f13eba6-brick_8c0881f317eb11607eff74a16027663d-brick.pid -S /var/run/gluster/e88f8f4b09da12a41b04762db8e06ada.socket --brick-name /var/lib/heketi/mounts/vg_4525875aacd97d4337fb6a9c5f13eba6/brick_8c0881f317eb11607eff74a16027663d/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_4525875aacd97d4337fb6a9c5f13eba6-brick_8c0881f317eb11607eff74a16027663d-brick.log --xlator-option *-posix.glusterd-uuid=2018bea2-c934-4cb1-b19f-df8fb79752cc --brick-port 49153 --xlator-option vol_9f93ae4c845f3910f5d1558cc5ae9f0a-server.listen-port=49153


_________________________________________________________________________________


Gluster v status
-------------------
--------------------


glusterfs-storage-9j9nk
+++++++++++++++++++++++
#gluster v status


Status of volume: heketidbstorage
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.231:/var/lib/heketi/mounts/v
g_357556739aad4d3b81d3e935a27339dc/brick_6e
e29707729f1c338abc1604473d6059/brick        49152     0          Y       549  
Brick 10.70.46.219:/var/lib/heketi/mounts/v
g_a1297c7a138dcac578308e8afada5161/brick_84
84360626b40cc47f707c683391f8b8/brick        49152     0          Y       527  
Brick 10.70.46.150:/var/lib/heketi/mounts/v
g_6064162e01514ddd000da6dafdc79216/brick_c8
dd81dd3761dd8212327131c4009716/brick        N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       12681
Self-heal Daemon on 10.70.46.231            N/A       N/A        Y       10348
Self-heal Daemon on 10.70.46.219            N/A       N/A        Y       9154 
 
Task Status of Volume heketidbstorage
------------------------------------------------------------------------------
There are no active volume tasks


 
Status of volume: vol_1fda560284e932cae1e384fe779b430f
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.219:/var/lib/heketi/mounts/v
g_a32d29646c91834eeac64529870c71cd/brick_a4
2d82892274b1e498d54eb791bec7e5/brick        49153     0          Y       535  
Brick 10.70.46.150:/var/lib/heketi/mounts/v
g_29d26d418f4ec01cbd8805704313e5e0/brick_01
3ad53a5ed578fd8f1275525e2c5916/brick        49154     0          Y       12654
Brick 10.70.46.231:/var/lib/heketi/mounts/v
g_357556739aad4d3b81d3e935a27339dc/brick_e1
fa6fec4ca03e73f937ed35bcfd51a3/brick        49153     0          Y       557  
Self-heal Daemon on localhost               N/A       N/A        Y       12681
Self-heal Daemon on 10.70.46.219            N/A       N/A        Y       9154 
Self-heal Daemon on 10.70.46.231            N/A       N/A        Y       10348
 
Task Status of Volume vol_1fda560284e932cae1e384fe779b430f
------------------------------------------------------------------------------
There are no active volume tasks
 


Status of volume: vol_9f93ae4c845f3910f5d1558cc5ae9f0a
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.231:/var/lib/heketi/mounts/v
g_4525875aacd97d4337fb6a9c5f13eba6/brick_8c
0881f317eb11607eff74a16027663d/brick        49153     0          Y       557  
Brick 10.70.46.219:/var/lib/heketi/mounts/v
g_a32d29646c91834eeac64529870c71cd/brick_33
19cdb494d7c201d2991173d18f2575/brick        49153     0          Y       535  
Brick 10.70.46.150:/var/lib/heketi/mounts/v
g_29d26d418f4ec01cbd8805704313e5e0/brick_0c
b321e2f4b4290bda1c2f9ae5085544/brick        N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       12681
Self-heal Daemon on 10.70.46.219            N/A       N/A        Y       9154 
Self-heal Daemon on 10.70.46.231            N/A       N/A        Y       10348
 
Task Status of Volume vol_9f93ae4c845f3910f5d1558cc5ae9f0a
------------------------------------------------------------------------------
There are no active volume tasks



Brick log for heketidb volume
===============================

[2018-08-21 09:22:44.795543] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.17"
[2018-08-21 09:22:44.795622] I [login.c:111:gf_auth] 0-auth/login: allowed user names: fa8aecfc-cee7-473a-9dcd-ac190b15b3e4
[2018-08-21 09:22:44.795649] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-17.lab.eng.blr.redhat.com-25683-2018/08/21-06:05:30:237593-heketidbstorage-client-0-0-1 (version: 3.12.2)
[2018-08-21 09:23:02.108401] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.177"
[2018-08-21 09:23:02.108426] I [login.c:111:gf_auth] 0-auth/login: allowed user names: fa8aecfc-cee7-473a-9dcd-ac190b15b3e4
[2018-08-21 09:23:02.108437] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-177.lab.eng.blr.redhat.com-24679-2018/08/21-06:05:30:293619-heketidbstorage-client-0-0-1 (version: 3.12.2)
[2018-08-21 09:23:06.493428] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.172"
[2018-08-21 09:23:06.493480] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-172.lab.eng.blr.redhat.com-24267-2018/08/16-14:09:05:424515-heketidbstorage-client-0-0-1 (version: 3.8.4)
[2018-08-21 09:23:07.802951] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.172"
[2018-08-21 09:23:07.802992] I [login.c:111:gf_auth] 0-auth/login: allowed user names: fa8aecfc-cee7-473a-9dcd-ac190b15b3e4
[2018-08-21 09:23:07.803004] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-172.lab.eng.blr.redhat.com-496-2018/08/21-09:22:39:669378-heketidbstorage-client-0-0-0 (version: 3.12.2)
[2018-08-21 09:26:41.833795] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.172"
[2018-08-21 09:26:41.833856] I [login.c:111:gf_auth] 0-auth/login: allowed user names: fa8aecfc-cee7-473a-9dcd-ac190b15b3e4
[2018-08-21 09:26:41.833881] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-172.lab.eng.blr.redhat.com-998-2018/08/21-09:26:41:760999-heketidbstorage-client-0-0-0 (version: 3.12.2)
[2018-08-21 09:26:41.857008] I [MSGID: 115036] [server.c:549:server_rpc_notify] 0-heketidbstorage-server: disconnecting connection from dhcp46-172.lab.eng.blr.redhat.com-998-2018/08/21-09:26:41:760999-heketidbstorage-client-0-0-0
[2018-08-21 09:26:41.857339] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-heketidbstorage-server: Shutting down connection dhcp46-172.lab.eng.blr.redhat.com-998-2018/08/21-09:26:41:760999-heketidbstorage-client-0-0-0
[2018-08-21 09:36:35.275709] I [MSGID: 115036] [server.c:549:server_rpc_notify] 0-heketidbstorage-server: disconnecting connection from dhcp46-17.lab.eng.blr.redhat.com-25683-2018/08/21-06:05:30:237593-heketidbstorage-client-0-0-1
[2018-08-21 09:36:35.277107] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-heketidbstorage-server: Shutting down connection dhcp46-17.lab.eng.blr.redhat.com-25683-2018/08/21-06:05:30:237593-heketidbstorage-client-0-0-1
[2018-08-21 09:39:42.629219] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.17"
[2018-08-21 09:39:42.629289] I [login.c:111:gf_auth] 0-auth/login: allowed user names: fa8aecfc-cee7-473a-9dcd-ac190b15b3e4
[2018-08-21 09:39:42.629313] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-17.lab.eng.blr.redhat.com-499-2018/08/21-09:39:38:268882-heketidbstorage-client-0-0-0 (version: 3.12.2)
[2018-08-21 09:39:49.676875] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.172"
[2018-08-21 09:39:49.676922] I [login.c:111:gf_auth] 0-auth/login: allowed user names: fa8aecfc-cee7-473a-9dcd-ac190b15b3e4
[2018-08-21 09:39:49.676935] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-172.lab.eng.blr.redhat.com-1615-2018/08/21-09:39:49:652520-heketidbstorage-client-0-0-0 (version: 3.12.2)
[2018-08-21 09:39:49.700266] I [MSGID: 115036] [server.c:549:server_rpc_notify] 0-heketidbstorage-server: disconnecting connection from dhcp46-172.lab.eng.blr.redhat.com-1615-2018/08/21-09:39:49:652520-heketidbstorage-client-0-0-0
[2018-08-21 09:39:49.700391] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-heketidbstorage-server: Shutting down connection dhcp46-172.lab.eng.blr.redhat.com-1615-2018/08/21-09:39:49:652520-heketidbstorage-client-0-0-0
[2018-08-21 09:41:28.103946] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.17"
[2018-08-21 09:41:28.103994] I [login.c:111:gf_auth] 0-auth/login: allowed user names: fa8aecfc-cee7-473a-9dcd-ac190b15b3e4
[2018-08-21 09:41:28.104007] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-17.lab.eng.blr.redhat.com-883-2018/08/21-09:41:27:982577-heketidbstorage-client-0-0-0 (version: 3.12.2)
[2018-08-21 09:41:28.150568] I [MSGID: 115036] [server.c:549:server_rpc_notify] 0-heketidbstorage-server: disconnecting connection from dhcp46-17.lab.eng.blr.redhat.com-883-2018/08/21-09:41:27:982577-heketidbstorage-client-0-0-0
[2018-08-21 09:41:28.150858] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-heketidbstorage-server: Shutting down connection dhcp46-17.lab.eng.blr.redhat.com-883-2018/08/21-09:41:27:982577-heketidbstorage-client-0-0-0
[2018-08-21 10:03:50.329491] I [MSGID: 115036] [server.c:549:server_rpc_notify] 0-heketidbstorage-server: disconnecting connection from dhcp46-177.lab.eng.blr.redhat.com-24679-2018/08/21-06:05:30:293619-heketidbstorage-client-0-0-1
[2018-08-21 10:03:50.329797] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-heketidbstorage-server: Shutting down connection dhcp46-177.lab.eng.blr.redhat.com-24679-2018/08/21-06:05:30:293619-heketidbstorage-client-0-0-1
[2018-08-21 10:05:30.011423] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.177"
[2018-08-21 10:05:30.011483] I [login.c:111:gf_auth] 0-auth/login: allowed user names: fa8aecfc-cee7-473a-9dcd-ac190b15b3e4
[2018-08-21 10:05:30.011530] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-177.lab.eng.blr.redhat.com-483-2018/08/21-10:05:27:105998-heketidbstorage-client-0-0-0 (version: 3.12.2)
[2018-08-21 10:29:07.921950] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.177"
[2018-08-21 10:29:07.922022] I [login.c:111:gf_auth] 0-auth/login: allowed user names: fa8aecfc-cee7-473a-9dcd-ac190b15b3e4
[2018-08-21 10:29:07.922047] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-177.lab.eng.blr.redhat.com-1583-2018/08/21-10:29:07:678540-heketidbstorage-client-0-0-0 (version: 3.12.2)
[2018-08-21 10:29:08.029658] I [MSGID: 115036] [server.c:549:server_rpc_notify] 0-heketidbstorage-server: disconnecting connection from dhcp46-177.lab.eng.blr.redhat.com-1583-2018/08/21-10:29:07:678540-heketidbstorage-client-0-0-0
[2018-08-21 10:29:08.029839] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-heketidbstorage-server: Shutting down connection dhcp46-177.lab.eng.blr.redhat.com-1583-2018/08/21-10:29:07:678540-heketidbstorage-client-0-0-0
[2018-08-21 10:54:46.397618] I [MSGID: 100030] [glusterfsd.c:2504:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.12.2 (args: /usr/sbin/glusterfsd -s 10.70.46.172 --volfile-id heketidbstorage.10.70.46.172.var-lib-heketi-mounts-vg_c2d019ac0e1e2a1f14d93fa86469ad99-brick_9132f7aaa37343987bab759a3c165645-brick -p /var/run/gluster/vols/heketidbstorage/10.70.46.172-var-lib-heketi-mounts-vg_c2d019ac0e1e2a1f14d93fa86469ad99-brick_9132f7aaa37343987bab759a3c165645-brick.pid -S /var/run/gluster/81fedf6447aee0c69785f94f7880d634.socket --brick-name /var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_c2d019ac0e1e2a1f14d93fa86469ad99-brick_9132f7aaa37343987bab759a3c165645-brick.log --xlator-option *-posix.glusterd-uuid=ce5ee1b5-b1a5-4a89-bb60-47e01f7d54ae --brick-port 49156 --xlator-option heketidbstorage-server.listen-port=49156)
[2018-08-21 10:54:46.404525] W [MSGID: 101002] [options.c:995:xl_opt_validate] 0-glusterfs: option 'address-family' is deprecated, preferred is 'transport.address-family', continuing with correction
[2018-08-21 10:54:46.408045] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2018-08-21 10:55:14.934991] W [socket.c:593:__socket_rwv] 0-glusterfs: readv on 10.70.46.172:24007 failed (Connection reset by peer)
[2018-08-21 10:55:14.935125] I [glusterfsd-mgmt.c:2349:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.46.172
[2018-08-21 10:55:14.935162] I [glusterfsd-mgmt.c:2370:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
[2018-08-21 10:55:14.935484] W [glusterfsd.c:1367:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(rpc_clnt_notify+0xab) [0x7f0bdc309b0b] -->/usr/sbin/glusterfsd(+0x117fd) [0x55b5bb2d27fd] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x55b5bb2cb3eb] ) 0-: received signum (1), shutting down
[2018-08-21 10:55:14.935539] I [socket.c:3599:socket_submit_request] 0-glusterfs: not connected (priv->connected = -1)
[2018-08-21 10:55:14.935548] W [rpc-clnt.c:1679:rpc_clnt_submit] 0-glusterfs: failed to submit rpc-request (XID: 0x3 Program: Gluster Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs)
sh-4.2#

Comment 2 Atin Mukherjee 2018-08-23 07:08:08 UTC
Root cause:

Because of a refactoring in glusterd_brick_start () code path, in case of friend handshake if brick is just supposed to connect to the brick when quorum is not yet met, we were ending up restarting bricks. Given brick restart is attempted from multiple threads, such situations can lead to a deadlock for bringing up the very first brick in case of brick multiplexing.

Comment 3 Worker Ant 2018-08-23 12:01:34 UTC
COMMIT: https://review.gluster.org/20935 committed in master by "Atin Mukherjee" <amukherj@redhat.com> with a commit message- glusterd: glusterd_brick_start shouldn't try to bring up brick if only_connect is true

With the latest refactoring in glusterd_brick_start () function in case
we run into a situation where is_gf_service_running () return a valid
pid which is running but doesn't belong to a gluster process, even in
case of only_connect flag passed as gf_true we'd end up trying to start
a brick which would cause a deadlock in brick multiplexing as both
glusterd_restart_bricks () and glusterd_do_volume_quorum_action () would
cause context switching with each other for the same brick. The
following bt shows the same:

(gdb) t a a bt

Thread 8 (Thread 0x7fcced48a700 (LWP 11959)):
    srch_vol=srch_vol@entry=0xbe0410, comp_vol=comp_vol@entry=0xc03680,
    brickinfo=brickinfo@entry=0xc14ef0) at glusterd-utils.c:5834
    brickinfo=0xc14ef0, volinfo=0xc03680, conf=<optimized out>)
    at glusterd-utils.c:5902
    brickinfo=brickinfo@entry=0xc14ef0, wait=wait@entry=_gf_false,
    only_connect=only_connect@entry=_gf_true) at glusterd-utils.c:6251
    volinfo=0xc03680, meets_quorum=_gf_true) at glusterd-server-quorum.c:402
    at glusterd-server-quorum.c:443
    iov=iov@entry=0x7fcce0004040, count=count@entry=1,
    myframe=myframe@entry=0x7fcce00023a0) at glusterd-rpc-ops.c:542
    iov=0x7fcce0004040, count=1, myframe=0x7fcce00023a0,
    fn=0x7fccf12403d0 <__glusterd_friend_add_cbk>) at glusterd-rpc-ops.c:223
---Type <return> to continue, or q <return> to quit---
    at rpc-transport.c:538

Thread 7 (Thread 0x7fccedc8b700 (LWP 11958)):

Thread 6 (Thread 0x7fccf1d67700 (LWP 11877)):
    brickinfo=brickinfo@entry=0xc14ef0) at glusterd-utils.c:5834
    at glusterd-utils.c:6251

Thread 5 (Thread 0x7fccf2568700 (LWP 11876)):

Thread 4 (Thread 0x7fccf2d69700 (LWP 11875)):

Thread 3 (Thread 0x7fccf356a700 (LWP 11874)):

Thread 2 (Thread 0x7fccf3d6b700 (LWP 11873)):
---Type <return> to continue, or q <return> to quit---

Thread 1 (Thread 0x7fccf68a8780 (LWP 11872)):

Fix:

The solution is to ensure we don't restart bricks if only_connect is
true and just ensure that the brick is attempted to be connected.

Test:

Simulated a code change to ensure gf_is_service_running () always return
to true to hit the scenario.

Change-Id: Iec184e6c9e8aabef931d310f931f4d7a580f0f48
Fixes: bz#1620544
Signed-off-by: Atin Mukherjee <amukherj@redhat.com>

Comment 4 Shyamsundar 2018-10-23 15:17:29 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.

glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.