Bug 1595320 - gluster wrongly reports bricks online, even when brick path is not available
Summary: gluster wrongly reports bricks online, even when brick path is not available
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Mohit Agrawal
QA Contact:
URL:
Whiteboard:
Depends On: 1589279
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-26 15:33 UTC by Mohit Agrawal
Modified: 2021-09-09 14:45 UTC (History)
10 users (show)

Fixed In Version: glusterfs-5.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1589279
Environment:
Last Closed: 2018-10-23 15:12:07 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Comment 1 Atin Mukherjee 2018-07-17 14:44:10 UTC
Description of problem:

gluster wrongly reports bricks on-line, even when brick path is not available


Version-Release number of selected component (if applicable):

mainline

When we do a node restart in CNS cluster, all path in /var/lib/heketi/fstab is not getting mounted ( Brick path)

$ cat fstab |wc -l
127


$ cat df_output |wc -l
86

But gluster volume status shows bricks are on line :

$ cat vol_status |grep -i vol_117382c88c4337df0b0ee35a3cb7ca51 -A15
Status of volume: vol_117382c88c4337df0b0ee35a3cb7ca51
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.16.77.21:/var/lib/heketi/mounts/vg
_809af91663966a9fd655d5955bc1ad31/brick_e9a
b265ff2a9f7608e66e17a1e90cf3d/brick         49155     0          Y       8944
Brick 10.16.77.20:/var/lib/heketi/mounts/vg
_0a7e1052758ea35c3a27b5842e14e8b4/brick_a28
118e271db63e880e2ac5f06609617/brick         49153     0          Y       27905
Brick 10.16.77.23:/var/lib/heketi/mounts/vg
_a7f22615f3be390d5f8648cbe32ed001/brick_c15
1ec7fb3a34b1a3daa361e127f5c76/brick         49152     0          Y       1075
Self-heal Daemon on localhost               N/A       N/A        Y       31715
Self-heal Daemon on 10.16.77.25             N/A       N/A        Y       30533
Self-heal Daemon on crp-prod-glusterfs02.sr
v.allianz                                   N/A       N/A        Y       17016
--
Task Status of Volume vol_117382c88c4337df0b0ee35a3cb7ca51
------------------------------------------------------------------------------
There are no active volume tasks


But when we run, gluster volume heal info on the volumes, it shows transport end point not connected which is correct, since brick path is not available at all.

Note : Volume taken for example, I have manually mounted the brick corresponding to node 10.16.77.23


Brick log snippet :


~~~
[2018-06-01 08:54:28.356370] E [index.c:2342:init] 8-vol_117382c88c4337df0b0ee35a3cb7ca51-index: Failed to find index basepath /var/lib/heketi/mounts/vg_a7f22615f3be390d5f8648cbe32ed001/brick_c151ec7fb3a34b1a3daa361e127f5c76/brick/.glusterfs/indices.
[2018-06-01 08:54:28.356403] W [graph.c:1192:glusterfs_graph_attach] 0-glusterfs: failed to initialize graph for xlator /var/lib/heketi/mounts/vg_a7f22615f3be390d5f8648cbe32ed001/brick_c151ec7fb3a34b1a3daa361e127f5c76/brick
[2018-06-01 09:07:51.362685] I [glusterfsd-mgmt.c:864:glusterfs_handle_attach] 0-glusterfs: got attach for /var/lib/glusterd/vols/vol_117382c88c4337df0b0ee35a3cb7ca51/vol_117382c88c4337df0b0ee35a3cb7ca51.10.16.77.23.var-lib-heketi-mounts-vg_a7f22615f3be390d5f8648cbe32ed001-brick_c151ec7fb3a34b1a3daa361e127f5c76-brick.vol
[2018-06-01 09:07:51.370311] E [index.c:2342:init] 7-vol_117382c88c4337df0b0ee35a3cb7ca51-index: Failed to find index basepath /var/lib/heketi/mounts/vg_a7f22615f3be390d5f8648cbe32ed001/brick_c151ec7fb3a34b1a3daa361e127f5c76/brick/.glusterfs/indices.
[2018-06-01 09:07:51.370363] W [graph.c:1192:glusterfs_graph_attach] 0-glusterfs: failed to initialize graph for xlator /var/lib/heketi/mounts/vg_a7f22615f3be390d5f8648cbe32ed001/brick_c151ec7fb3a34b1a3daa361e127f5c76/brick
~~~

Comment 2 Worker Ant 2018-07-27 01:24:38 UTC
COMMIT: https://review.gluster.org/20202 committed in master by "Atin Mukherjee" <amukherj> with a commit message- glusterd: Add multiple checks before attach/start a brick

Problem: In brick mux scenario sometime glusterd is not able
         to start/attach a brick and gluster v status shows
         brick is already running

Solution:
          1) To make sure brick is running check brick_path in
             /proc/<pid>/fd , if a brick is consumed by the brick
             process it means brick stack is come up otherwise not
          2) Before start/attach a brick check if a brick is mounted
             or not
          3) At the time of printing volume status check brick is
             consumed by any brick process

Test:  To test the same followed procedure
       1) Setup brick mux environment on a vm
       2) Put a breaking point in gdb in function posix_health_check_thread_proc
          at the time of notify GF_EVENT_CHILD_DOWN event
       3) unmount anyone brick path forcefully
       4) check gluster v status it will show N/A for the brick
       5) Try to start volume with force option, glusterd throw
          message "No device available for mount brick"
       6) Mount the brick_root path
       7) Try to start volume with force option
       8) down brick is started successfully

Change-Id: I91898dad21d082ebddd12aa0d1f7f0ed012bdf69
fixes: bz#1595320
Signed-off-by: Mohit Agrawal <moagrawa>

Comment 3 Worker Ant 2018-08-07 05:18:06 UTC
REVIEW: https://review.gluster.org/20651 (glusterd: more stricter checks of if brick is running in multiplex mode) posted (#1) for review on master by Atin Mukherjee

Comment 4 Worker Ant 2018-08-09 02:53:24 UTC
COMMIT: https://review.gluster.org/20651 committed in master by "Atin Mukherjee" <amukherj> with a commit message- glusterd: more stricter checks of if brick is running in multiplex mode

While gf_attach () utility can help in detaching a brick instance from
the brick process which the kill_brick () function in tests/volume.rc
uses it has a caveat which is as follows:
1. It doesn't ensure the respective brick is marked as stopped which
glusterd does from glusterd_brick_stop
2. Sometimes if kill_brick () is executed just after a brick stack is
up, the mgmt_rpc_notify () can take some time before marking
priv->connected to 1 and before it if kill_brick () is executed, brick
will fail to initiate the pmap_signout which would inturn cleans up the
pidfile.

To avoid such possibilities, a more stricter check on if a brick is
running or not in brick multiplexing has been brought in now where it
not only checks for its pid's existance but checks if the respective
process has the brick instance associated with it before checking for
brick's status.

Change-Id: I98b92df949076663b9686add7aab4ec2f24ad5ab
Fixes: bz#1595320
Signed-off-by: Atin Mukherjee <amukherj>

Comment 5 Shyamsundar 2018-10-23 15:12:07 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.

glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.