Bug 1633669
Summary: | Gluster bricks fails frequently | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Jaime Dulzura <jaime.dulzura> | ||||||||||
Component: | glusterd | Assignee: | bugs <bugs> | ||||||||||
Status: | CLOSED DEFERRED | QA Contact: | |||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | 4.1 | CC: | amukherj, bugs, jaime.dulzura, pasik | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2018-11-22 00:22:12 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Jaime Dulzura
2018-09-27 13:58:33 UTC
I forgot to mention, We are aiming to use the Gluster Native Client but it seems glusterfs is consuming too much memory on the client side. We've added NFS-Ganesha to avoid Gluster client overload. But the memory used by NFS-Ganesha is accumulating overtime. Hi Jaime, When you mention that "bricks fail frequently" do you refer that brick processes go down or something else? At this point of time from the bug description it's not clear to me that what's the exact issue being highlighted here. So we need bit more details along with glusterd, brick log file and volume status output. Thanks, Atin Hi Atin, I was referring to bricks processes go down. And it never goes up automatically. To make it available again, I just invoke "gluster v start <volume> force: command. or restart glusterd. Installed Packages for gluster storage: # rpm -qa | grep -E "gluster|ganesha" glusterfs-libs-4.1.5-1.el7.x86_64 glusterfs-events-4.1.5-1.el7.x86_64 nfs-ganesha-2.6.3-1.el7.x86_64 glusterfs-cli-4.1.5-1.el7.x86_64 centos-release-gluster41-1.0-1.el7.centos.x86_64 tendrl-gluster-integration-1.6.3-10.el7.noarch glusterfs-client-xlators-4.1.5-1.el7.x86_64 glusterfs-server-4.1.5-1.el7.x86_64 nfs-ganesha-xfs-2.6.3-1.el7.x86_64 glusterfs-coreutils-0.2.0-1.el7.x86_64 glusterfs-fuse-4.1.5-1.el7.x86_64 python2-gluster-4.1.5-1.el7.x86_64 glusterfs-api-4.1.5-1.el7.x86_64 glusterfs-4.1.5-1.el7.x86_64 glusterfs-extra-xlators-4.1.5-1.el7.x86_64 nfs-ganesha-gluster-2.6.3-1.el7.x86_64 # gluster v info Volume Name: CL_Shared Type: Replicate Volume ID: ac1f0338-2af8-41b3-af61-7eb7f1c3696e Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: iahdvlgfsa001:/local/bricks/volume02/CL_Shared Brick2: iahdvlgfsb001:/local/bricks/volume02/CL_Shared Brick3: iahdvlgfsc001:/local/bricks/volume02/CL_Shared Options Reconfigured: performance.client-io-threads: off nfs.disable: on transport.address-family: inet auth.allow: iahdvlgfsc001,iahdvlgfsb001,iahdvlgfsa001,localhost Volume Name: tibco Type: Replicate Volume ID: abc14a06-852d-46c2-8e70-a1f09136bc08 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: iahdvlgfsa001:/local/bricks/volume01/tibco Brick2: iahdvlgfsb001:/local/bricks/volume01/tibco Brick3: iahdvlgfsc001:/local/bricks/volume01/tibco Options Reconfigured: performance.stat-prefetch: on performance.md-cache-timeout: 600 performance.cache-invalidation: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on auth.allow: 127.0.0.1,10.1.25.*,10.1.26.*,10.1.34.* nfs.disable: on diagnostics.latency-measurement: on diagnostics.count-fop-hits: on performance.strict-o-direct: on performance.strict-write-ordering: on How could I upload the bricks logs? Created attachment 1490765 [details]
Brick Logs
required brick logs for this bug report
Created attachment 1490767 [details]
Brick Logs
Required brick logs for this bug report
[root@iahdvlgfsa001 ~]# gluster v status Status of volume: CL_Shared Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick iahdvlgfsa001:/local/bricks/volume02/ CL_Shared 49152 0 Y 4890 Brick iahdvlgfsb001:/local/bricks/volume02/ CL_Shared 49152 0 Y 1021 Brick iahdvlgfsc001:/local/bricks/volume02/ CL_Shared 49152 0 Y 20186 Self-heal Daemon on localhost N/A N/A Y 32017 Self-heal Daemon on iahdvlgfsc001 N/A N/A Y 20211 Self-heal Daemon on iahdvlgfsb001 N/A N/A Y 1068 Task Status of Volume CL_Shared ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: tibco Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick iahdvlgfsa001:/local/bricks/volume01/ tibco 49153 0 Y 4990 Brick iahdvlgfsb001:/local/bricks/volume01/ tibco 49153 0 Y 1750 Brick iahdvlgfsc001:/local/bricks/volume01/ tibco 49153 0 Y 6873 Self-heal Daemon on localhost N/A N/A Y 32017 Self-heal Daemon on iahdvlgfsc001 N/A N/A Y 20211 Self-heal Daemon on iahdvlgfsb001 N/A N/A Y 1068 Task Status of Volume tibco ------------------------------------------------------------------------------ There are no active volume tasks Created attachment 1490768 [details]
glsuterd.log
glusterd.log
Created attachment 1490778 [details]
Latest brick process down logs.
Status of failing brick:
[root@iahdvlgfsc001 cevaroot]# gluster v status CL_Shared
Status of volume: CL_Shared
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick iahdvlgfsa001:/local/bricks/volume02/
CL_Shared 49152 0 Y 4890
Brick iahdvlgfsb001:/local/bricks/volume02/
CL_Shared 49152 0 Y 1021
Brick iahdvlgfsc001:/local/bricks/volume02/
CL_Shared N/A N/A N N/A
Self-heal Daemon on localhost N/A N/A Y 20211
Self-heal Daemon on iahdvlgfsa001.logistics
.corp N/A N/A Y 32017
Self-heal Daemon on iahdvlgfsb001 N/A N/A Y 1068
Task Status of Volume CL_Shared
------------------------------------------------------------------------------
There are no active volume tasks
glusterd status
[root@iahdvlgfsc001 cevaroot]# systemctl status glusterd -l
● glusterd.service - GlusterFS, a clustered file-system server
Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2018-10-05 00:43:34 CDT; 2h 53min ago
Process: 1324 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 1332 (glusterd)
CGroup: /system.slice/glusterd.service
├─ 1332 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
├─ 6873 /usr/sbin/glusterfsd -s iahdvlgfsc001 --volfile-id tibco.iahdvlgfsc001.local-bricks-volume01-tibco -p /var/run/gluster/vols/tibco/iahdvlgfsc001-local-bricks-volume01-tibco.pid -S /var/run/gluster/843d10f6ac486e3e.socket --brick-name /local/bricks/volume01/tibco -l /var/log/glusterfs/bricks/local-bricks-volume01-tibco.log --xlator-option *-posix.glusterd-uuid=6af863cd-43f6-448e-936d-889766c1a655 --process-name brick --brick-port 49153 --xlator-option tibco-server.listen-port=49153
└─20211 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/8abfe66e3fb78dec.socket --xlator-option *replicate*.node-uuid=6af863cd-43f6-448e-936d-889766c1a655 --process-name glustershd
Oct 05 03:15:11 iahdvlgfsc001.logistics.corp local-bricks-volume02-CL_Shared[20186]: dlfcn 1
Oct 05 03:15:11 iahdvlgfsc001.logistics.corp local-bricks-volume02-CL_Shared[20186]: libpthread 1
Oct 05 03:15:11 iahdvlgfsc001.logistics.corp local-bricks-volume02-CL_Shared[20186]: llistxattr 1
Oct 05 03:15:11 iahdvlgfsc001.logistics.corp local-bricks-volume02-CL_Shared[20186]: setfsid 1
Oct 05 03:15:11 iahdvlgfsc001.logistics.corp local-bricks-volume02-CL_Shared[20186]: spinlock 1
Oct 05 03:15:11 iahdvlgfsc001.logistics.corp local-bricks-volume02-CL_Shared[20186]: epoll.h 1
Oct 05 03:15:11 iahdvlgfsc001.logistics.corp local-bricks-volume02-CL_Shared[20186]: xattr.h 1
Oct 05 03:15:11 iahdvlgfsc001.logistics.corp local-bricks-volume02-CL_Shared[20186]: st_atim.tv_nsec 1
Oct 05 03:15:11 iahdvlgfsc001.logistics.corp local-bricks-volume02-CL_Shared[20186]: package-string: glusterfs 4.1.5
Oct 05 03:15:11 iahdvlgfsc001.logistics.corp local-bricks-volume02-CL_Shared[20186]: ---------
i forgot to mention that on our initial setup we include monitoring agent "TENDRL". We install the monitoring with default configuration based on the installation instruction. We are happy with the monitoring since it gives us too much information we need. In oour observation, there is a core-dump referring to glusterfsd process was killed due to accessing restricted memory and at /var/log/messages there was a brute force access of tendrl agent to the shared volume. With the above observation, we've decided to reinstall everything and remove TENDRL monitoring in the equation. Then viola, there was no bricks failing for more than a months now. It could be a bug or not but after we remove tendrl agent, the bricks never fails again. And yet we are now facing the well known bug from previous releases of NFS-Ganesha with GLUSTERFS volume exports, the OOM killer kills ganesha daemon. I will raise a separate bug report based on how we acquire the ganesha process being forcefully killed by OOM killer. |