Created attachment 1582578 [details] Samba client log Description of problem: Gluster VFS client logs print ' E [MSGID: 108006] [afr-common.c:5413:__afr_handle_child_down_event] 0-mcv01-replicate-8: All subvolumes are down. Going offline until at least one of them comes back up.' repeatedly. When checking 'gluster vol status' all bricks are online. There are no obvious errors in the glusterd logs. The errors go back months, from when we had a failed storage array, where all the bricks had to be removed and replaced. Volume is a distributed-replicated (x2) type. In a few of the brick logs, there are a few RPC errors. Other than the logs being flooded with this message, the cluster seems to be operating normally. Iperf tests between the nodes does not flag any issues, with 0 TCP retries on all. Version-Release number of selected component (if applicable): Gluster 4.1.8 Samba 4.9.4
Created attachment 1582579 [details] Brick log
Could you paste the gluster v status output? Are all bricks running fine? The error logs indicate that client is unable to connect to the bricks. Have you checked if any firewall settings blocking your client to connect to the bricks?
Hi Atin, Here's the output you requested: I believe the issue is from when we had a lot of bricks (50%) which needed replacing, and then syncing with 'gluster volume sync', and it seems to be remembering something. Status of volume: ctdbv01 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick mcn01:/mnt/ctdb/data 49157 0 Y 6118 Brick mcn02:/mnt/ctdb/data 49157 0 Y 3294 Brick mcn03:/mnt/ctdb/data 49181 0 Y 49199 Brick mcn04:/mnt/ctdb/data 49181 0 Y 194533 Self-heal Daemon on localhost N/A N/A Y 70714 Self-heal Daemon on mcn04 N/A N/A Y 146680 Self-heal Daemon on mcn03 N/A N/A Y 141658 Self-heal Daemon on mcn01 N/A N/A Y 147639 Task Status of Volume ctdbv01 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: ctdbv02 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick mcn01:/mnt/ctdb/dmz_data 49158 0 Y 6126 Brick mcn02:/mnt/ctdb/dmz_data 49158 0 Y 3302 Brick mcn03:/mnt/ctdb/dmz_data 49182 0 Y 49570 Brick mcn04:/mnt/ctdb/dmz_data 49182 0 Y 194825 Self-heal Daemon on localhost N/A N/A Y 70714 Self-heal Daemon on mcn04 N/A N/A Y 146680 Self-heal Daemon on mcn03 N/A N/A Y 141658 Self-heal Daemon on mcn01 N/A N/A Y 147639 Task Status of Volume ctdbv02 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: dmzv01 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick mcn01:/mnt/h1a/dmz_data 49179 0 Y 7495 Brick mcn03:/mnt/h1b/dmz_data 49183 0 Y 54973 Brick mcn01:/mnt/h2a/dmz_data 49180 0 Y 7614 Brick mcn03:/mnt/h2b/dmz_data 49184 0 Y 54996 Brick mcn01:/mnt/h3a/dmz_data 49181 0 Y 7803 Brick mcn03:/mnt/h3b/dmz_data 49185 0 Y 55019 Brick mcn01:/mnt/h4a/dmz_data 49182 0 Y 7936 Brick mcn03:/mnt/h4b/dmz_data 49186 0 Y 55042 Brick mcn01:/mnt/h5a/dmz_data 49183 0 Y 8169 Brick mcn03:/mnt/h5b/dmz_data 49187 0 Y 55065 Brick mcn02:/mnt/h6a/dmz_data 49159 0 Y 3312 Brick mcn04:/mnt/h6b/dmz_data 49179 0 Y 2924 Brick mcn02:/mnt/h7a/dmz_data 49160 0 Y 3321 Brick mcn04:/mnt/h7b/dmz_data 49180 0 Y 2947 Brick mcn02:/mnt/h8a/dmz_data 49161 0 Y 3329 Brick mcn04:/mnt/h8b/dmz_data 49183 0 Y 2970 Brick mcn02:/mnt/h9a/dmz_data 49162 0 Y 3338 Brick mcn04:/mnt/h9b/dmz_data 49184 0 Y 2998 Brick mcn02:/mnt/h10a/dmz_data 49163 0 Y 3346 Brick mcn04:/mnt/h10b/dmz_data 49185 0 Y 3024 Self-heal Daemon on localhost N/A N/A Y 70714 Quota Daemon on localhost N/A N/A Y 70730 Self-heal Daemon on mcn03 N/A N/A Y 141658 Quota Daemon on mcn03 N/A N/A Y 141670 Self-heal Daemon on mcn04 N/A N/A Y 146680 Quota Daemon on mcn04 N/A N/A Y 146691 Self-heal Daemon on mcn01 N/A N/A Y 147639 Quota Daemon on mcn01 N/A N/A Y 147654 Task Status of Volume dmzv01 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: mcv01 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick mcn01:/mnt/h1a/mcv01_data 49174 0 Y 6690 Brick mcn03:/mnt/h1b/mcv01_data 49157 0 Y 54637 Brick mcn01:/mnt/h2a/mcv01_data 49175 0 Y 6842 Brick mcn03:/mnt/h2b/mcv01_data 49158 0 Y 54663 Brick mcn01:/mnt/h3a/mcv01_data 49176 0 Y 6936 Brick mcn03:/mnt/h3b/mcv01_data 49159 0 Y 54689 Brick mcn01:/mnt/h4a/mcv01_data 49177 0 Y 7153 Brick mcn03:/mnt/h4b/mcv01_data 49160 0 Y 54715 Brick mcn01:/mnt/h5a/mcv01_data 49178 0 Y 7303 Brick mcn03:/mnt/h5b/mcv01_data 49161 0 Y 54741 Brick mcn02:/mnt/h6a/mcv01_data 49164 0 Y 3358 Brick mcn04:/mnt/h6b/mcv01_data 49154 0 Y 119700 Brick mcn02:/mnt/h7a/mcv01_data 49165 0 Y 3366 Brick mcn04:/mnt/h7b/mcv01_data 49152 0 Y 165703 Brick mcn02:/mnt/h8a/mcv01_data 49169 0 Y 3608 Brick mcn04:/mnt/h8b/mcv01_data 49153 0 Y 37527 Brick mcn02:/mnt/h9a/mcv01_data 49170 0 Y 3616 Brick mcn04:/mnt/h9b/mcv01_data 49160 0 Y 2746 Brick mcn02:/mnt/h10a/mcv01_data 49171 0 Y 3624 Brick mcn04:/mnt/h10b/mcv01_data 49161 0 Y 2773 Self-heal Daemon on localhost N/A N/A Y 70714 Quota Daemon on localhost N/A N/A Y 70730 Self-heal Daemon on mcn03 N/A N/A Y 141658 Quota Daemon on mcn03 N/A N/A Y 141670 Self-heal Daemon on mcn04 N/A N/A Y 146680 Quota Daemon on mcn04 N/A N/A Y 146691 Self-heal Daemon on mcn01 N/A N/A Y 147639 Quota Daemon on mcn01 N/A N/A Y 147654 Task Status of Volume mcv01 ------------------------------------------------------------------------------ Task : Rebalance ID : a4cabb46-efb1-4909-b059-c115d811b519 Status : completed Best, Ryan
@Karthik, can you please look into this?
Hi all, Is there anything I can do to progress this? Best, Ryan
Hi Karthik, I'm moving this bug to afr as we don't have any issue from glusterd here. If anything is needed from glusterd side, feel free to reach out. Thanks, Sanju
Hi, Sorry for the delay. Can you please answer/provide the following? - Is this problem seen only with volume "mcv01"? - Logs are showing that bricks on nodes "mcn01" & "mcn03" are down. There are other volumes with bricks on these nodes. Are they not flooding the client logs with similar messages? - Are you able to do IO from this client without any errors? - Whether the bricks which needed replacing were on these nodes and are they the first 10 bricks of volume mcv01? - Provide the output of "gluster volume info mcv01". - Provide the statedump of the client processes which are showing these messages. (https://docs.gluster.org/en/latest/Troubleshooting/statedump/#generate-a-statedump) - Give the client vol file which will be present inside "/var/lib/glusterd/vols/<volume-name>/" Regards, Karthik
Since there is no update on this bug for almost 3 months and also the branch 4.1 is EOL, I am closing this bug for now. If this issue is still seen on any of the maintained branches, please feel free to re-open this or file a new bug with the gluster logs and all the information requested in comment #7.
Hi @ksubrahm
Hi @ksubrahm, Sorry, I didn't see a notification from this bug regarding the update. We're still seeing this issue with Gluster 8.5. Please find the requested information below: - Is this problem seen only with volume "mcv01"? MCV01 is the only volume that we have exported via Samba, the other volume is just for CTDB. We've seen this on other clusters that have 2x2 replicated volumes though. - Logs are showing that bricks on nodes "mcn01" & "mcn03" are down. There are other volumes with bricks on these nodes. Are they not flooding the client logs with similar messages? The logs say this, but as far as I can tell, all bricks in the volume are healthy and online. I'm not seeing these messages via the FUSE client logs. We have MCV01 mounted on all nodes via FUSE, however clients never access the volume via this as we use the VFS module. - Are you able to do IO from this client without any errors? Yes, the system is fully functional we can read and write from the volume as expected. Replication looks to be working fine too, as there are no pending heals etc. - Whether the bricks which needed replacing were on these nodes and are they the first 10 bricks of volume mcv01? It seems that all replicated sub-volumes on all nodes are being effected. [2022-02-07 10:28:15.890242] E [MSGID: 108006] [afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up. [2022-02-07 10:28:15.890549] E [MSGID: 108006] [afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-1: All subvolumes are down. Going offline until at least one of them comes back up. [2022-02-07 10:28:15.890771] E [MSGID: 108006] [afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-3: All subvolumes are down. Going offline until at least one of them comes back up. [2022-02-07 10:28:15.890933] E [MSGID: 108006] [afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-4: All subvolumes are down. Going offline until at least one of them comes back up. [2022-02-07 10:28:15.891049] E [MSGID: 108006] [afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-2: All subvolumes are down. Going offline until at least one of them comes back up. [2022-02-07 10:28:15.891225] E [MSGID: 108006] [afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-5: All subvolumes are down. Going offline until at least one of them comes back up. [2022-02-07 10:28:15.891355] E [MSGID: 108006] [afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-6: All subvolumes are down. Going offline until at least one of them comes back up. [2022-02-07 10:28:15.891528] E [MSGID: 108006] [afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-7: All subvolumes are down. Going offline until at least one of them comes back up. [2022-02-07 10:28:15.891655] E [MSGID: 108006] [afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-8: All subvolumes are down. Going offline until at least one of them comes back up. [2022-02-07 10:28:15.891788] E [MSGID: 108006] [afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-9: All subvolumes are down. Going offline until at least one of them comes back up. - Provide the output of "gluster volume info mcv01". Please find this uploaded to the ticket with filename mcv01_info_07022022.txt - Provide the statedump of the client processes which are showing these messages. (https://docs.gluster.org/en/latest/Troubleshooting/statedump/#generate-a-statedump) Could you advise how to get a statedump from a VFS client? I tried the usual way but the dump was not being generated. - Give the client vol file which will be present inside "/var/lib/glusterd/vols/<volume-name>/" There were quite a few client volfiles. I've compressed them and attached them to this ticket with the filename mcv01_client_volfiles_07022022.zip. I did have one thought: Could the 'auth.allow: 172.30.30.*' be at fault here? We use that to prevent non-cluster nodes from mounting the volume, but I'm wondering if we need to include 127.0.0.1 and other local loopback addresses in this, as the Samba server is installed on all Gluster nodes, so it may be that the connection is coming over a local loopback IP rather than their backend network IP.
Created attachment 1859553 [details] MCV01 volume info
Created attachment 1859554 [details] MCV01 client volfiles
Hi Ryan, Thank you for reaching out on this bug. We have moved from bugzilla to github issues sometime back. Glusterfs-8 is EOL and the maintained branches are Glusterfs-9 & 10 [1]. Please upgrade to a maintained version and if you still face the same, please file a new issue on github [2] with all the necessary details. (Reverting the bug status back to closed.) [1] https://www.gluster.org/release-schedule/ [2] https://github.com/gluster/glusterfs/issues/new Regards, Karthik