Bug 1812849 - Setting volume option when one of the glusterd is stopped in the cluster, post glusterd restart seeing couldn't find vol info in glusterd logs and shd, brick process offline
Summary: Setting volume option when one of the glusterd is stopped in the cluster, pos...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 7
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
Assignee: Sanju
QA Contact:
URL:
Whiteboard:
Depends On: 1733425
Blocks: 1727269
TreeView+ depends on / blocked
 
Reported: 2020-03-12 10:44 UTC by Sanju
Modified: 2020-03-16 08:26 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1733425
Environment:
Last Closed: 2020-03-16 08:26:13 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gluster.org Gerrit 24220 0 None Merged glusterd: stop stale bricks during handshaking in brick mux mode 2020-03-16 08:26:11 UTC

Description Sanju 2020-03-12 10:44:32 UTC
Description of problem:
When glusterd is stopped on one node and performed volume set operation on only two volumes(one is upgraded and another volume is newly created.). Restarted glusterd on the node where it stopped on that node seeing "Couldn't find volinfo" errors and shd and brick process on the two volume where volume option is set are offline.


Version-Release number of selected component (if applicable):
glusterfs-6.0-7

How reproducible:
1/1

Steps to Reproduce:
1. Create 1000 volumes and start them
2. Performed in-service upgrade on the cluster. 
3. Create 1000 more volumes (1X3 replicate) and start them
4. Stop glusterd on node N1.
5. Perform volume set operation on two volumes from n2 'performance.readdir-ahead on'. One which is being used for upgrade and another newly created volume. 
6. Start glusterd on node N1.

Actual results:
glusterfsd and shd on these two volumes are offline.
seeing "couldn't find vol info" in glusterd logs. 

Expected results:
glusterfsd and shd should be online and should not see any errors with "Couldn't find volinfo" in glusterd log.

Additional info:

glusterd log snippet
#########################################
[2019-07-05 06:49:29.078858] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:29.078892] E [MSGID: 106619] [glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management: Unable to stat  (No such file or directory)
[2019-07-05 06:49:30.079325] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:31.079862] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:32.080355] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:33.080899] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:34.081374] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
The message "E [MSGID: 106619] [glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management: Unable to stat  (No such file or directory)" repeated 5 times between [2019-07-05 06:49:29.078892] and [2019-07-05 06:49:34.081406]
[2019-07-05 06:49:35.081508] W [MSGID: 106619] [glusterd-svc-helper.c:931:glusterd_attach_svc] 0-glusterd: attach failed for glustershd(volume=ronnie_-999)
[2019-07-05 06:49:35.081583] E [MSGID: 106048] [glusterd-shd-svc.c:477:glusterd_shdsvc_start] 0-glusterd: Failed to attach shd svc(volume=ronnie_-999) to pid=-1
[2019-07-05 06:49:44.928310] E [MSGID: 106558] [glusterd-volgen.c:3289:volgen_graph_build_clients] 0-glusterd: volume inconsistency: brick count is 0
[2019-07-05 06:49:44.928395] E [MSGID: 106068] [glusterd-shd-svc.c:191:glusterd_shdsvc_create_volfile] 0-management: Failed to create volfile
[2019-07-05 06:49:44.941696] I [glusterd-utils.c:6560:glusterd_brick_start] 0-management: discovered already-running brick /bricks/brick9/testvol_-999
[2019-07-05 06:49:44.941768] I [MSGID: 106142] [glusterd-pmap.c:290:pmap_registry_bind] 0-pmap: adding brick /bricks/brick9/testvol_-999 on port 49155
[2019-07-05 06:49:44.941824] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-07-05 06:49:44.945117] I [MSGID: 106620] [glusterd-svc-helper.c:900:glusterd_attach_svc] 0-glusterd: adding svc glustershd (volume=testvol_-999) to existing process with pid 590
[2019-07-05 06:49:44.945285] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:44.945316] E [MSGID: 106619] [glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management: Unable to stat  (No such file or directory)
[2019-07-05 06:49:45.018727] I [socket.c:3837:socket_submit_outgoing_msg] 0-socket.management: not connected (priv->connected = -1)
[2019-07-05 06:49:45.018748] E [rpcsvc.c:1583:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc cli, ProgVers: 2, Proc: 27) to rpc-transport (socket.management)
[2019-07-05 06:49:45.018761] E [MSGID: 106430] [glusterd-utils.c:600:glusterd_submit_reply] 0-glusterd: Reply submission failed
[2019-07-05 06:49:45.023858] W [MSGID: 106056] [glusterd-snapshot-utils.c:376:glusterd_snap_volinfo_find_by_volume_id] 0-management: Snap volume not found
[2019-07-05 06:49:45.945716] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:45.945750] E [MSGID: 106619] [glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management: Unable to stat  (No such file or directory)
[2019-07-05 06:49:46.447404] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:46.946292] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:47.946837] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:48.947321] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:49.947840] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:50.948339] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:51.948895] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:52.949406] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:53.950006] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:54.950527] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:55.951051] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:56.951573] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:57.952095] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:58.952597] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
The message "E [MSGID: 106619] [glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management: Unable to stat  (No such file or directory)" repeated 13 times between [2019-07-05 06:49:45.945750] and [2019-07-05 06:49:58.952676]
[2019-07-05 06:49:59.952756] W [MSGID: 106619] [glusterd-svc-helper.c:931:glusterd_attach_svc] 0-glusterd: attach failed for glustershd(volume=testvol_-999)
[2019-07-05 06:49:59.953137] E [MSGID: 106048] [glusterd-shd-svc.c:477:glusterd_shdsvc_start] 0-glusterd: Failed to attach shd svc(volume=testvol_-999) to pid=590

--- Additional comment from RHEL Product and Program Management on 2019-07-05 10:08:50 UTC ---

This bug is automatically being proposed for the next minor release of Red Hat Gluster Storage by setting the release flag 'rhgs‑3.5.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Bala Konda Reddy M on 2019-07-05 12:26:34 UTC ---

Sosreports are available here: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/bmekala/bug.1727269/

--- Additional comment from Atin Mukherjee on 2019-07-08 06:09:51 UTC ---

Proposing this as blocker.

--- Additional comment from Atin Mukherjee on 2019-07-09 04:06:01 UTC ---

This is reproducible. Based on my testing following is the observation:

1. "Couldn't find volinfo" message comes for the volumes where they have been modified (i.e. volinfo->version is bumped up in other nodes) and when the glusterd instance comes up online and do a handshake during shd volfile building, following error is seen:

[2019-07-09 03:56:24.715943] E [MSGID: 106558] [glusterd-volgen.c:3289:volgen_graph_build_clients] 0-glusterd: volume inconsistency: brick count is 0
[2019-07-09 03:56:24.716003] E [MSGID: 106068] [glusterd-shd-svc.c:191:glusterd_shdsvc_create_volfile] 0-management: Failed to create volfile
[2019-07-09 03:56:24.802515] I [glusterd-utils.c:6562:glusterd_brick_start] 0-management: discovered already-running brick /tmp/b5
[2019-07-09 03:56:24.802604] I [MSGID: 106142] [glusterd-pmap.c:290:pmap_registry_bind] 0-pmap: adding brick /tmp/b5 on port 49152
[2019-07-09 03:56:24.802687] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-07-09 03:56:24.806137] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: glustershd already stopped
[2019-07-09 03:56:24.820172] I [MSGID: 106620] [glusterd-svc-helper.c:922:glusterd_attach_svc] 0-glusterd: adding svc glustershd (volume=test-vol5) to existing process with pid 18713
[2019-07-09 03:56:24.820288] E [MSGID: 106048] [glusterd-handshake.c:344:build_volfile_path] 0-management: Couldn't find volinfo for volid=test-vol5
[2019-07-09 03:56:24.820340] E [MSGID: 106619] [glusterd-svc-helper.c:796:__glusterd_send_svc_configure_req] 0-management: Unable to stat  (No such file or directory)
 
And then we see [2019-07-09 03:56:24.820288] E [MSGID: 106048] [glusterd-handshake.c:344:build_volfile_path] 0-management: Couldn't find volinfo for volid=test-vol5 (not to get confused, we modified the log to ensure volume name is also dumped)

2. This happens only when brick mux is on as per my testing, so probably the race is more exposed with brick mux.

This is definitely a release blocker and need to be fixed.

--- Additional comment from RHEL Product and Program Management on 2019-07-09 16:15:59 UTC ---

This BZ is being approved for the RHGS 3.5.0 release, upon receipt of the 3 ACKs (PM,Devel,QA) for the release flag 'rhgs-3.5.0', and on being attached to an official RHGS 3.5.0 BZ Tracker

--- Additional comment from Atin Mukherjee on 2019-07-12 06:37:08 UTC ---

This bug needs to be dropped from 3.5.0 as it's originated due to shd multiplexing and it was agreed upon that this feature will be dropped from 3.5.0 given it still needs some more soak time and stabilization effort.

--- Additional comment from Sunil Kumar Acharya on 2019-07-12 17:46:52 UTC ---

Moving this bug out of RHGS-3.5.0 as SHD-MUX has been dropped from RHGS-3.5.0 based on the discussion carried out during the program call held on 10-07-2019.

--- Additional comment from INVALID USER on 2019-07-26 10:39:14 IST ---

Description of problem:
When glusterd is stopped on one node and performed volume set operation on only two volumes(one is upgraded and another volume is newly created.). Restarted glusterd on the node where it stopped on that node seeing "Couldn't find volinfo" errors and shd and brick process on the two volume where volume option is set are offline.


Version-Release number of selected component (if applicable):
mainline

How reproducible:
1/1

Steps to Reproduce:
1. Create 1000 volumes and start them
2. Performed in-service upgrade on the cluster. 
3. Create 1000 more volumes (1X3 replicate) and start them
4. Stop glusterd on node N1.
5. Perform volume set operation on two volumes from n2 'performance.readdir-ahead on'. One which is being used for upgrade and another newly created volume. 
6. Start glusterd on node N1.

Actual results:
glusterfsd and shd on these two volumes are offline.
seeing "couldn't find vol info" in glusterd logs. 

Expected results:
glusterfsd and shd should be online and should not see any errors with "Couldn't find volinfo" in glusterd log.

Additional info:

glusterd log snippet
#########################################
[2019-07-05 06:49:29.078858] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:29.078892] E [MSGID: 106619] [glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management: Unable to stat  (No such file or directory)
[2019-07-05 06:49:30.079325] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:31.079862] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:32.080355] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:33.080899] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:34.081374] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
The message "E [MSGID: 106619] [glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management: Unable to stat  (No such file or directory)" repeated 5 times between [2019-07-05 06:49:29.078892] and [2019-07-05 06:49:34.081406]
[2019-07-05 06:49:35.081508] W [MSGID: 106619] [glusterd-svc-helper.c:931:glusterd_attach_svc] 0-glusterd: attach failed for glustershd(volume=ronnie_-999)
[2019-07-05 06:49:35.081583] E [MSGID: 106048] [glusterd-shd-svc.c:477:glusterd_shdsvc_start] 0-glusterd: Failed to attach shd svc(volume=ronnie_-999) to pid=-1
[2019-07-05 06:49:44.928310] E [MSGID: 106558] [glusterd-volgen.c:3289:volgen_graph_build_clients] 0-glusterd: volume inconsistency: brick count is 0
[2019-07-05 06:49:44.928395] E [MSGID: 106068] [glusterd-shd-svc.c:191:glusterd_shdsvc_create_volfile] 0-management: Failed to create volfile
[2019-07-05 06:49:44.941696] I [glusterd-utils.c:6560:glusterd_brick_start] 0-management: discovered already-running brick /bricks/brick9/testvol_-999
[2019-07-05 06:49:44.941768] I [MSGID: 106142] [glusterd-pmap.c:290:pmap_registry_bind] 0-pmap: adding brick /bricks/brick9/testvol_-999 on port 49155
[2019-07-05 06:49:44.941824] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-07-05 06:49:44.945117] I [MSGID: 106620] [glusterd-svc-helper.c:900:glusterd_attach_svc] 0-glusterd: adding svc glustershd (volume=testvol_-999) to existing process with pid 590
[2019-07-05 06:49:44.945285] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:44.945316] E [MSGID: 106619] [glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management: Unable to stat  (No such file or directory)
[2019-07-05 06:49:45.018727] I [socket.c:3837:socket_submit_outgoing_msg] 0-socket.management: not connected (priv->connected = -1)
[2019-07-05 06:49:45.018748] E [rpcsvc.c:1583:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc cli, ProgVers: 2, Proc: 27) to rpc-transport (socket.management)
[2019-07-05 06:49:45.018761] E [MSGID: 106430] [glusterd-utils.c:600:glusterd_submit_reply] 0-glusterd: Reply submission failed
[2019-07-05 06:49:45.023858] W [MSGID: 106056] [glusterd-snapshot-utils.c:376:glusterd_snap_volinfo_find_by_volume_id] 0-management: Snap volume not found
[2019-07-05 06:49:45.945716] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:45.945750] E [MSGID: 106619] [glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management: Unable to stat  (No such file or directory)
[2019-07-05 06:49:46.447404] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:46.946292] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:47.946837] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:48.947321] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:49.947840] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:50.948339] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:51.948895] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:52.949406] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:53.950006] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:54.950527] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:55.951051] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:56.951573] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:57.952095] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
[2019-07-05 06:49:58.952597] E [glusterd-handshake.c:343:build_volfile_path] 0-management: Couldn't find volinfo
The message "E [MSGID: 106619] [glusterd-svc-helper.c:791:__glusterd_send_svc_configure_req] 0-management: Unable to stat  (No such file or directory)" repeated 13 times between [2019-07-05 06:49:45.945750] and [2019-07-05 06:49:58.952676]
[2019-07-05 06:49:59.952756] W [MSGID: 106619] [glusterd-svc-helper.c:931:glusterd_attach_svc] 0-glusterd: attach failed for glustershd(volume=testvol_-999)
[2019-07-05 06:49:59.953137] E [MSGID: 106048] [glusterd-shd-svc.c:477:glusterd_shdsvc_start] 0-glusterd: Failed to attach shd svc(volume=testvol_-999) to pid=590

--- Additional comment from Worker Ant on 2019-07-26 10:44:31 IST ---

REVIEW: https://review.gluster.org/23042 (glusterd: stop stale bricks during handshaking in brick mux mode) posted (#3) for review on master by Atin Mukherjee

--- Additional comment from Worker Ant on 2019-08-26 11:57:50 IST ---

REVIEW: https://review.gluster.org/23042 (glusterd: stop stale bricks during handshaking in brick mux mode) merged (#6) on master by Atin Mukherjee

Comment 1 Worker Ant 2020-03-12 10:50:41 UTC
REVIEW: https://review.gluster.org/24220 (glusterd: stop stale bricks during handshaking in brick mux mode) posted (#1) for review on release-7 by Sanju Rakonde

Comment 2 Worker Ant 2020-03-16 08:26:13 UTC
REVIEW: https://review.gluster.org/24220 (glusterd: stop stale bricks during handshaking in brick mux mode) merged (#2) on release-7 by Rinku Kothiya


Note You need to log in before you can comment on or make changes to this bug.