Description of problem: ++++++++++++++++++++++++ We had an OCP 3.10 + OCS 3.10 setup with gluster-bits=3.12.2-15. Gluser-block version = gluster-block-0.2.1-24.el7rhgs.x86_64. The setup has logging pods configured and the metrics pods couldn't come up. Steps Performed ================== 1. We updated the docker and gluster client packages on each oc node which resulted in the gluster pod restarts as well. 2. Created around 50 block pvcs in two loops and then attached them to app pods For 50 pvcs of 2 gb each, a total of 2 block hosting vols were created: 1. First block-host = vol_9f93ae4c845f3910f5d1558cc5ae9f0a 2. 2nd block-hosting = vol_1fda560284e932cae1e384fe779b430f Details : ============= A) For pvc bk101 -bk148, space was allocated from vol_9f93ae4c845f3910f5d1558cc5ae9f0a. Each block-volume creation succeeded(as seen from heketi-logs). B) For pvc bk149, a new block-hosting vol was created = vol_1fda560284e932cae1e384fe779b430f and was used for pvcs bk149 and bk150 (Time = 2018/08/20 10:01:10 UTC) Following issues was seen on gluster pod 10.70.46.150 ======================================================== 1. for gluster node 10.70.46.150, the brick process for heketidbstorage was DOWN (even though ps -ef |grep glusterfsd reported it as running) 2. for gluster node 10.70.46.150, the brick process for vol_9f93ae4c845f3910f5d1558cc5ae9f0a is DOWN. 3. In a node, With brick-mux enabled, all block-hosting volumes should have same brick PID. But, in 10.70.46.150, 2 block-hosting volumes had 2 different PIDs. Thus, on creation of 2nd block-hosting volume(vol_1fda560284e932cae1e384fe779b430f) , it should have used PID#540 of 1st block-hosting volume(vol_9f93ae4c845f3910f5d1558cc5ae9f0a). Instead a new PID was used - 12654. This resulted in brick process for vol_9f93ae4c845f3910f5d1558cc5ae9f0a going into NOT ONLINE status. Some outputs from gluster and heketi end ========================================== Brick PIDS from PODS ---------------------------- ----------------------------- [root@dhcp46-137 nitin]# for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- ps -ef|grep glusterfsd; echo ""; done glusterfs-storage-9j9nk +++++++++++++++++++++++ root 540 1 95 06:03 ? 04:18:55 /usr/sbin/glusterfsd -s 10.70.46.150 --volfile-id vol_9f93ae4c845f3910f5d1558cc5ae9f0a.10.70.46.150.var-lib-heketi-mounts-vg_29d26d418f4ec01cbd8805704313e5e0-brick_0cb321e2f4b4290bda1c2f9ae5085544-brick -p /var/run/gluster/vols/vol_9f93ae4c845f3910f5d1558cc5ae9f0a/10.70.46.150-var-lib-heketi-mounts-vg_29d26d418f4ec01cbd8805704313e5e0-brick_0cb321e2f4b4290bda1c2f9ae5085544-brick.pid -S /var/run/gluster/6a8f7b2cb8da0e6a3ae398160998af29.socket --brick-name /var/lib/heketi/mounts/vg_29d26d418f4ec01cbd8805704313e5e0/brick_0cb321e2f4b4290bda1c2f9ae5085544/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_29d26d418f4ec01cbd8805704313e5e0-brick_0cb321e2f4b4290bda1c2f9ae5085544-brick.log --xlator-option *-posix.glusterd-uuid=cd776ee9-6a31-496d-a8af-072f4c23aee4 --brick-port 49153 --xlator-option vol_9f93ae4c845f3910f5d1558cc5ae9f0a-server.listen-port=49153 root 558 1 0 06:03 ? 00:00:04 /usr/sbin/glusterfsd -s 10.70.46.150 --volfile-id heketidbstorage.10.70.46.150.var-lib-heketi-mounts-vg_6064162e01514ddd000da6dafdc79216-brick_c8dd81dd3761dd8212327131c4009716-brick -p /var/run/gluster/vols/heketidbstorage/10.70.46.150-var-lib-heketi-mounts-vg_6064162e01514ddd000da6dafdc79216-brick_c8dd81dd3761dd8212327131c4009716-brick.pid -S /var/run/gluster/6fe959498daa7ffa24cb0ec026f845e3.socket --brick-name /var/lib/heketi/mounts/vg_6064162e01514ddd000da6dafdc79216/brick_c8dd81dd3761dd8212327131c4009716/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_6064162e01514ddd000da6dafdc79216-brick_c8dd81dd3761dd8212327131c4009716-brick.log --xlator-option *-posix.glusterd-uuid=cd776ee9-6a31-496d-a8af-072f4c23aee4 --brick-port 49152 --xlator-option heketidbstorage-server.listen-port=49152 root 12654 1 36 10:01 ? 00:11:54 /usr/sbin/glusterfsd -s 10.70.46.150 --volfile-id vol_1fda560284e932cae1e384fe779b430f.10.70.46.150.var-lib-heketi-mounts-vg_29d26d418f4ec01cbd8805704313e5e0-brick_013ad53a5ed578fd8f1275525e2c5916-brick -p /var/run/gluster/vols/vol_1fda560284e932cae1e384fe779b430f/10.70.46.150-var-lib-heketi-mounts-vg_29d26d418f4ec01cbd8805704313e5e0-brick_013ad53a5ed578fd8f1275525e2c5916-brick.pid -S /var/run/gluster/d16b6a7370a888823fc43060eeed1b2e.socket --brick-name /var/lib/heketi/mounts/vg_29d26d418f4ec01cbd8805704313e5e0/brick_013ad53a5ed578fd8f1275525e2c5916/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_29d26d418f4ec01cbd8805704313e5e0-brick_013ad53a5ed578fd8f1275525e2c5916-brick.log --xlator-option *-posix.glusterd-uuid=cd776ee9-6a31-496d-a8af-072f4c23aee4 --brick-port 49154 --xlator-option vol_1fda560284e932cae1e384fe779b430f-server.listen-port=49154 glusterfs-storage-hr8ht +++++++++++++++++++++++ root 527 1 0 07:49 ? 00:00:05 /usr/sbin/glusterfsd -s 10.70.46.219 --volfile-id heketidbstorage.10.70.46.219.var-lib-heketi-mounts-vg_a1297c7a138dcac578308e8afada5161-brick_8484360626b40cc47f707c683391f8b8-brick -p /var/run/gluster/vols/heketidbstorage/10.70.46.219-var-lib-heketi-mounts-vg_a1297c7a138dcac578308e8afada5161-brick_8484360626b40cc47f707c683391f8b8-brick.pid -S /var/run/gluster/49760ea523e2d92425cd374e21f7c6a6.socket --brick-name /var/lib/heketi/mounts/vg_a1297c7a138dcac578308e8afada5161/brick_8484360626b40cc47f707c683391f8b8/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_a1297c7a138dcac578308e8afada5161-brick_8484360626b40cc47f707c683391f8b8-brick.log --xlator-option *-posix.glusterd-uuid=09aed8ae-858b-4d53-b7fa-745aa9443f18 --brick-port 49152 --xlator-option heketidbstorage-server.listen-port=49152 root 535 1 99 07:49 ? 04:15:43 /usr/sbin/glusterfsd -s 10.70.46.219 --volfile-id vol_9f93ae4c845f3910f5d1558cc5ae9f0a.10.70.46.219.var-lib-heketi-mounts-vg_a32d29646c91834eeac64529870c71cd-brick_3319cdb494d7c201d2991173d18f2575-brick -p /var/run/gluster/vols/vol_9f93ae4c845f3910f5d1558cc5ae9f0a/10.70.46.219-var-lib-heketi-mounts-vg_a32d29646c91834eeac64529870c71cd-brick_3319cdb494d7c201d2991173d18f2575-brick.pid -S /var/run/gluster/c3f1138994a30170b2d5759b8bdbc313.socket --brick-name /var/lib/heketi/mounts/vg_a32d29646c91834eeac64529870c71cd/brick_3319cdb494d7c201d2991173d18f2575/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_a32d29646c91834eeac64529870c71cd-brick_3319cdb494d7c201d2991173d18f2575-brick.log --xlator-option *-posix.glusterd-uuid=09aed8ae-858b-4d53-b7fa-745aa9443f18 --brick-port 49153 --xlator-option vol_9f93ae4c845f3910f5d1558cc5ae9f0a-server.listen-port=49153 glusterfs-storage-q22cl +++++++++++++++++++++++ root 549 1 0 07:10 ? 00:00:05 /usr/sbin/glusterfsd -s 10.70.46.231 --volfile-id heketidbstorage.10.70.46.231.var-lib-heketi-mounts-vg_357556739aad4d3b81d3e935a27339dc-brick_6ee29707729f1c338abc1604473d6059-brick -p /var/run/gluster/vols/heketidbstorage/10.70.46.231-var-lib-heketi-mounts-vg_357556739aad4d3b81d3e935a27339dc-brick_6ee29707729f1c338abc1604473d6059-brick.pid -S /var/run/gluster/26c7477db0159fa810a1574951aa87cc.socket --brick-name /var/lib/heketi/mounts/vg_357556739aad4d3b81d3e935a27339dc/brick_6ee29707729f1c338abc1604473d6059/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_357556739aad4d3b81d3e935a27339dc-brick_6ee29707729f1c338abc1604473d6059-brick.log --xlator-option *-posix.glusterd-uuid=2018bea2-c934-4cb1-b19f-df8fb79752cc --brick-port 49152 --xlator-option heketidbstorage-server.listen-port=49152 root 557 1 99 07:10 ? 04:33:11 /usr/sbin/glusterfsd -s 10.70.46.231 --volfile-id vol_9f93ae4c845f3910f5d1558cc5ae9f0a.10.70.46.231.var-lib-heketi-mounts-vg_4525875aacd97d4337fb6a9c5f13eba6-brick_8c0881f317eb11607eff74a16027663d-brick -p /var/run/gluster/vols/vol_9f93ae4c845f3910f5d1558cc5ae9f0a/10.70.46.231-var-lib-heketi-mounts-vg_4525875aacd97d4337fb6a9c5f13eba6-brick_8c0881f317eb11607eff74a16027663d-brick.pid -S /var/run/gluster/e88f8f4b09da12a41b04762db8e06ada.socket --brick-name /var/lib/heketi/mounts/vg_4525875aacd97d4337fb6a9c5f13eba6/brick_8c0881f317eb11607eff74a16027663d/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_4525875aacd97d4337fb6a9c5f13eba6-brick_8c0881f317eb11607eff74a16027663d-brick.log --xlator-option *-posix.glusterd-uuid=2018bea2-c934-4cb1-b19f-df8fb79752cc --brick-port 49153 --xlator-option vol_9f93ae4c845f3910f5d1558cc5ae9f0a-server.listen-port=49153 _________________________________________________________________________________ Gluster v status ------------------- -------------------- glusterfs-storage-9j9nk +++++++++++++++++++++++ #gluster v status Status of volume: heketidbstorage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.46.231:/var/lib/heketi/mounts/v g_357556739aad4d3b81d3e935a27339dc/brick_6e e29707729f1c338abc1604473d6059/brick 49152 0 Y 549 Brick 10.70.46.219:/var/lib/heketi/mounts/v g_a1297c7a138dcac578308e8afada5161/brick_84 84360626b40cc47f707c683391f8b8/brick 49152 0 Y 527 Brick 10.70.46.150:/var/lib/heketi/mounts/v g_6064162e01514ddd000da6dafdc79216/brick_c8 dd81dd3761dd8212327131c4009716/brick N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 12681 Self-heal Daemon on 10.70.46.231 N/A N/A Y 10348 Self-heal Daemon on 10.70.46.219 N/A N/A Y 9154 Task Status of Volume heketidbstorage ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: vol_1fda560284e932cae1e384fe779b430f Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.46.219:/var/lib/heketi/mounts/v g_a32d29646c91834eeac64529870c71cd/brick_a4 2d82892274b1e498d54eb791bec7e5/brick 49153 0 Y 535 Brick 10.70.46.150:/var/lib/heketi/mounts/v g_29d26d418f4ec01cbd8805704313e5e0/brick_01 3ad53a5ed578fd8f1275525e2c5916/brick 49154 0 Y 12654 Brick 10.70.46.231:/var/lib/heketi/mounts/v g_357556739aad4d3b81d3e935a27339dc/brick_e1 fa6fec4ca03e73f937ed35bcfd51a3/brick 49153 0 Y 557 Self-heal Daemon on localhost N/A N/A Y 12681 Self-heal Daemon on 10.70.46.219 N/A N/A Y 9154 Self-heal Daemon on 10.70.46.231 N/A N/A Y 10348 Task Status of Volume vol_1fda560284e932cae1e384fe779b430f ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: vol_9f93ae4c845f3910f5d1558cc5ae9f0a Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.46.231:/var/lib/heketi/mounts/v g_4525875aacd97d4337fb6a9c5f13eba6/brick_8c 0881f317eb11607eff74a16027663d/brick 49153 0 Y 557 Brick 10.70.46.219:/var/lib/heketi/mounts/v g_a32d29646c91834eeac64529870c71cd/brick_33 19cdb494d7c201d2991173d18f2575/brick 49153 0 Y 535 Brick 10.70.46.150:/var/lib/heketi/mounts/v g_29d26d418f4ec01cbd8805704313e5e0/brick_0c b321e2f4b4290bda1c2f9ae5085544/brick N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 12681 Self-heal Daemon on 10.70.46.219 N/A N/A Y 9154 Self-heal Daemon on 10.70.46.231 N/A N/A Y 10348 Task Status of Volume vol_9f93ae4c845f3910f5d1558cc5ae9f0a ------------------------------------------------------------------------------ There are no active volume tasks Brick log for heketidb volume =============================== [2018-08-21 09:22:44.795543] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.17" [2018-08-21 09:22:44.795622] I [login.c:111:gf_auth] 0-auth/login: allowed user names: fa8aecfc-cee7-473a-9dcd-ac190b15b3e4 [2018-08-21 09:22:44.795649] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-17.lab.eng.blr.redhat.com-25683-2018/08/21-06:05:30:237593-heketidbstorage-client-0-0-1 (version: 3.12.2) [2018-08-21 09:23:02.108401] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.177" [2018-08-21 09:23:02.108426] I [login.c:111:gf_auth] 0-auth/login: allowed user names: fa8aecfc-cee7-473a-9dcd-ac190b15b3e4 [2018-08-21 09:23:02.108437] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-177.lab.eng.blr.redhat.com-24679-2018/08/21-06:05:30:293619-heketidbstorage-client-0-0-1 (version: 3.12.2) [2018-08-21 09:23:06.493428] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.172" [2018-08-21 09:23:06.493480] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-172.lab.eng.blr.redhat.com-24267-2018/08/16-14:09:05:424515-heketidbstorage-client-0-0-1 (version: 3.8.4) [2018-08-21 09:23:07.802951] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.172" [2018-08-21 09:23:07.802992] I [login.c:111:gf_auth] 0-auth/login: allowed user names: fa8aecfc-cee7-473a-9dcd-ac190b15b3e4 [2018-08-21 09:23:07.803004] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-172.lab.eng.blr.redhat.com-496-2018/08/21-09:22:39:669378-heketidbstorage-client-0-0-0 (version: 3.12.2) [2018-08-21 09:26:41.833795] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.172" [2018-08-21 09:26:41.833856] I [login.c:111:gf_auth] 0-auth/login: allowed user names: fa8aecfc-cee7-473a-9dcd-ac190b15b3e4 [2018-08-21 09:26:41.833881] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-172.lab.eng.blr.redhat.com-998-2018/08/21-09:26:41:760999-heketidbstorage-client-0-0-0 (version: 3.12.2) [2018-08-21 09:26:41.857008] I [MSGID: 115036] [server.c:549:server_rpc_notify] 0-heketidbstorage-server: disconnecting connection from dhcp46-172.lab.eng.blr.redhat.com-998-2018/08/21-09:26:41:760999-heketidbstorage-client-0-0-0 [2018-08-21 09:26:41.857339] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-heketidbstorage-server: Shutting down connection dhcp46-172.lab.eng.blr.redhat.com-998-2018/08/21-09:26:41:760999-heketidbstorage-client-0-0-0 [2018-08-21 09:36:35.275709] I [MSGID: 115036] [server.c:549:server_rpc_notify] 0-heketidbstorage-server: disconnecting connection from dhcp46-17.lab.eng.blr.redhat.com-25683-2018/08/21-06:05:30:237593-heketidbstorage-client-0-0-1 [2018-08-21 09:36:35.277107] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-heketidbstorage-server: Shutting down connection dhcp46-17.lab.eng.blr.redhat.com-25683-2018/08/21-06:05:30:237593-heketidbstorage-client-0-0-1 [2018-08-21 09:39:42.629219] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.17" [2018-08-21 09:39:42.629289] I [login.c:111:gf_auth] 0-auth/login: allowed user names: fa8aecfc-cee7-473a-9dcd-ac190b15b3e4 [2018-08-21 09:39:42.629313] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-17.lab.eng.blr.redhat.com-499-2018/08/21-09:39:38:268882-heketidbstorage-client-0-0-0 (version: 3.12.2) [2018-08-21 09:39:49.676875] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.172" [2018-08-21 09:39:49.676922] I [login.c:111:gf_auth] 0-auth/login: allowed user names: fa8aecfc-cee7-473a-9dcd-ac190b15b3e4 [2018-08-21 09:39:49.676935] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-172.lab.eng.blr.redhat.com-1615-2018/08/21-09:39:49:652520-heketidbstorage-client-0-0-0 (version: 3.12.2) [2018-08-21 09:39:49.700266] I [MSGID: 115036] [server.c:549:server_rpc_notify] 0-heketidbstorage-server: disconnecting connection from dhcp46-172.lab.eng.blr.redhat.com-1615-2018/08/21-09:39:49:652520-heketidbstorage-client-0-0-0 [2018-08-21 09:39:49.700391] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-heketidbstorage-server: Shutting down connection dhcp46-172.lab.eng.blr.redhat.com-1615-2018/08/21-09:39:49:652520-heketidbstorage-client-0-0-0 [2018-08-21 09:41:28.103946] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.17" [2018-08-21 09:41:28.103994] I [login.c:111:gf_auth] 0-auth/login: allowed user names: fa8aecfc-cee7-473a-9dcd-ac190b15b3e4 [2018-08-21 09:41:28.104007] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-17.lab.eng.blr.redhat.com-883-2018/08/21-09:41:27:982577-heketidbstorage-client-0-0-0 (version: 3.12.2) [2018-08-21 09:41:28.150568] I [MSGID: 115036] [server.c:549:server_rpc_notify] 0-heketidbstorage-server: disconnecting connection from dhcp46-17.lab.eng.blr.redhat.com-883-2018/08/21-09:41:27:982577-heketidbstorage-client-0-0-0 [2018-08-21 09:41:28.150858] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-heketidbstorage-server: Shutting down connection dhcp46-17.lab.eng.blr.redhat.com-883-2018/08/21-09:41:27:982577-heketidbstorage-client-0-0-0 [2018-08-21 10:03:50.329491] I [MSGID: 115036] [server.c:549:server_rpc_notify] 0-heketidbstorage-server: disconnecting connection from dhcp46-177.lab.eng.blr.redhat.com-24679-2018/08/21-06:05:30:293619-heketidbstorage-client-0-0-1 [2018-08-21 10:03:50.329797] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-heketidbstorage-server: Shutting down connection dhcp46-177.lab.eng.blr.redhat.com-24679-2018/08/21-06:05:30:293619-heketidbstorage-client-0-0-1 [2018-08-21 10:05:30.011423] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.177" [2018-08-21 10:05:30.011483] I [login.c:111:gf_auth] 0-auth/login: allowed user names: fa8aecfc-cee7-473a-9dcd-ac190b15b3e4 [2018-08-21 10:05:30.011530] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-177.lab.eng.blr.redhat.com-483-2018/08/21-10:05:27:105998-heketidbstorage-client-0-0-0 (version: 3.12.2) [2018-08-21 10:29:07.921950] I [addr.c:55:compare_addr_and_update] 0-/var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick: allowed = "*", received addr = "10.70.46.177" [2018-08-21 10:29:07.922022] I [login.c:111:gf_auth] 0-auth/login: allowed user names: fa8aecfc-cee7-473a-9dcd-ac190b15b3e4 [2018-08-21 10:29:07.922047] I [MSGID: 115029] [server-handshake.c:817:server_setvolume] 0-heketidbstorage-server: accepted client from dhcp46-177.lab.eng.blr.redhat.com-1583-2018/08/21-10:29:07:678540-heketidbstorage-client-0-0-0 (version: 3.12.2) [2018-08-21 10:29:08.029658] I [MSGID: 115036] [server.c:549:server_rpc_notify] 0-heketidbstorage-server: disconnecting connection from dhcp46-177.lab.eng.blr.redhat.com-1583-2018/08/21-10:29:07:678540-heketidbstorage-client-0-0-0 [2018-08-21 10:29:08.029839] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-heketidbstorage-server: Shutting down connection dhcp46-177.lab.eng.blr.redhat.com-1583-2018/08/21-10:29:07:678540-heketidbstorage-client-0-0-0 [2018-08-21 10:54:46.397618] I [MSGID: 100030] [glusterfsd.c:2504:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.12.2 (args: /usr/sbin/glusterfsd -s 10.70.46.172 --volfile-id heketidbstorage.10.70.46.172.var-lib-heketi-mounts-vg_c2d019ac0e1e2a1f14d93fa86469ad99-brick_9132f7aaa37343987bab759a3c165645-brick -p /var/run/gluster/vols/heketidbstorage/10.70.46.172-var-lib-heketi-mounts-vg_c2d019ac0e1e2a1f14d93fa86469ad99-brick_9132f7aaa37343987bab759a3c165645-brick.pid -S /var/run/gluster/81fedf6447aee0c69785f94f7880d634.socket --brick-name /var/lib/heketi/mounts/vg_c2d019ac0e1e2a1f14d93fa86469ad99/brick_9132f7aaa37343987bab759a3c165645/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_c2d019ac0e1e2a1f14d93fa86469ad99-brick_9132f7aaa37343987bab759a3c165645-brick.log --xlator-option *-posix.glusterd-uuid=ce5ee1b5-b1a5-4a89-bb60-47e01f7d54ae --brick-port 49156 --xlator-option heketidbstorage-server.listen-port=49156) [2018-08-21 10:54:46.404525] W [MSGID: 101002] [options.c:995:xl_opt_validate] 0-glusterfs: option 'address-family' is deprecated, preferred is 'transport.address-family', continuing with correction [2018-08-21 10:54:46.408045] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2018-08-21 10:55:14.934991] W [socket.c:593:__socket_rwv] 0-glusterfs: readv on 10.70.46.172:24007 failed (Connection reset by peer) [2018-08-21 10:55:14.935125] I [glusterfsd-mgmt.c:2349:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.70.46.172 [2018-08-21 10:55:14.935162] I [glusterfsd-mgmt.c:2370:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers [2018-08-21 10:55:14.935484] W [glusterfsd.c:1367:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(rpc_clnt_notify+0xab) [0x7f0bdc309b0b] -->/usr/sbin/glusterfsd(+0x117fd) [0x55b5bb2d27fd] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x55b5bb2cb3eb] ) 0-: received signum (1), shutting down [2018-08-21 10:55:14.935539] I [socket.c:3599:socket_submit_request] 0-glusterfs: not connected (priv->connected = -1) [2018-08-21 10:55:14.935548] W [rpc-clnt.c:1679:rpc_clnt_submit] 0-glusterfs: failed to submit rpc-request (XID: 0x3 Program: Gluster Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs) sh-4.2#
Root cause: Because of a refactoring in glusterd_brick_start () code path, in case of friend handshake if brick is just supposed to connect to the brick when quorum is not yet met, we were ending up restarting bricks. Given brick restart is attempted from multiple threads, such situations can lead to a deadlock for bringing up the very first brick in case of brick multiplexing.
COMMIT: https://review.gluster.org/20935 committed in master by "Atin Mukherjee" <amukherj> with a commit message- glusterd: glusterd_brick_start shouldn't try to bring up brick if only_connect is true With the latest refactoring in glusterd_brick_start () function in case we run into a situation where is_gf_service_running () return a valid pid which is running but doesn't belong to a gluster process, even in case of only_connect flag passed as gf_true we'd end up trying to start a brick which would cause a deadlock in brick multiplexing as both glusterd_restart_bricks () and glusterd_do_volume_quorum_action () would cause context switching with each other for the same brick. The following bt shows the same: (gdb) t a a bt Thread 8 (Thread 0x7fcced48a700 (LWP 11959)): srch_vol=srch_vol@entry=0xbe0410, comp_vol=comp_vol@entry=0xc03680, brickinfo=brickinfo@entry=0xc14ef0) at glusterd-utils.c:5834 brickinfo=0xc14ef0, volinfo=0xc03680, conf=<optimized out>) at glusterd-utils.c:5902 brickinfo=brickinfo@entry=0xc14ef0, wait=wait@entry=_gf_false, only_connect=only_connect@entry=_gf_true) at glusterd-utils.c:6251 volinfo=0xc03680, meets_quorum=_gf_true) at glusterd-server-quorum.c:402 at glusterd-server-quorum.c:443 iov=iov@entry=0x7fcce0004040, count=count@entry=1, myframe=myframe@entry=0x7fcce00023a0) at glusterd-rpc-ops.c:542 iov=0x7fcce0004040, count=1, myframe=0x7fcce00023a0, fn=0x7fccf12403d0 <__glusterd_friend_add_cbk>) at glusterd-rpc-ops.c:223 ---Type <return> to continue, or q <return> to quit--- at rpc-transport.c:538 Thread 7 (Thread 0x7fccedc8b700 (LWP 11958)): Thread 6 (Thread 0x7fccf1d67700 (LWP 11877)): brickinfo=brickinfo@entry=0xc14ef0) at glusterd-utils.c:5834 at glusterd-utils.c:6251 Thread 5 (Thread 0x7fccf2568700 (LWP 11876)): Thread 4 (Thread 0x7fccf2d69700 (LWP 11875)): Thread 3 (Thread 0x7fccf356a700 (LWP 11874)): Thread 2 (Thread 0x7fccf3d6b700 (LWP 11873)): ---Type <return> to continue, or q <return> to quit--- Thread 1 (Thread 0x7fccf68a8780 (LWP 11872)): Fix: The solution is to ensure we don't restart bricks if only_connect is true and just ensure that the brick is attempted to be connected. Test: Simulated a code change to ensure gf_is_service_running () always return to true to hit the scenario. Change-Id: Iec184e6c9e8aabef931d310f931f4d7a580f0f48 Fixes: bz#1620544 Signed-off-by: Atin Mukherjee <amukherj>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report. glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html [2] https://www.gluster.org/pipermail/gluster-users/