Bug 1475340
Summary: | Glusterfs mount inside POD gets terminated when scaled with brick multiplexing enabled | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Shekhar Berry <shberry> |
Component: | kubernetes | Assignee: | Humble Chirammal <hchiramm> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | krishnaram Karthick <kramdoss> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | rhgs-3.0 | CC: | amukherj, aos-bugs, aos-storage-staff, csaba, ekuric, hchiramm, jeder, jsafrane, madam, mpillai, nchilaka, pprakash, psuriset, rhs-bugs, rsussman, rtalur, shberry, storage-qa-internal |
Target Milestone: | --- | ||
Target Release: | CNS 3.6 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | aos-scalability-36 | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-01-03 10:22:19 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1477024 | ||
Bug Blocks: | 1445448 |
Description
Shekhar Berry
2017-07-26 13:13:11 UTC
Shekhar, Is this issue only visible when you enable 'brick multiplex on' in a cluster? OpenShift puts glusterfs mount logs to /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs, inspecting these on the faulty node (and publishing them) might be useful too. (In reply to Humble Chirammal from comment #2) > Shekhar, Is this issue only visible when you enable 'brick multiplex on' in > a cluster? I have not seen this with brick multiplexing disabled as of now. (In reply to Shekhar Berry from comment #4) > (In reply to Humble Chirammal from comment #2) > > Shekhar, Is this issue only visible when you enable 'brick multiplex on' in > > a cluster? > > I have not seen this with brick multiplexing disabled as of now. Thanks. Just to isolate, you are not able to mount and use this share manually in any of the nodes. Isnt it ? Also, can you please capture iptables rules which are active from the nodes ? (In reply to Jan Safranek from comment #3) > OpenShift puts glusterfs mount logs to > /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs, > inspecting these on the faulty node (and publishing them) might be useful > too. Here's the link to mount logs of all PVC from one of the fault nodes: http://perf1.perf.lab.eng.bos.redhat.com/pub/shberry/tranport_end_point/worker_mount_logs/ (In reply to Humble Chirammal from comment #6) > (In reply to Shekhar Berry from comment #4) > > (In reply to Humble Chirammal from comment #2) > > > Shekhar, Is this issue only visible when you enable 'brick multiplex on' in > > > a cluster? > > > > I have not seen this with brick multiplexing disabled as of now. > > Thanks. Just to isolate, you are not able to mount and use this share > manually in any of the nodes. Isnt it ? Also, can you please capture > iptables rules which are active from the nodes ? Yes, its true that I am unable to mount the share locally on the glusterfs pod itself. Here's the link to iptables_List from one of the nodes: http://perf1.perf.lab.eng.bos.redhat.com/pub/shberry/tranport_end_point/iptables_L Here the link to brick log file mentioned in comment 13 http://perf1.perf.lab.eng.bos.redhat.com/pub/shberry/tranport_end_point/cns_var_log/glusterfs/bricks/ I did a quick look at this setup and iic, there is no "mount process" running in the node. The fuse mount processes are gone somehow. Shekhar, any node service restart or simliar executed in this setup? On further check, I have noticed rpc errors in this setup, which is already a known bug in RHGS: var-lib-heketi-mounts-vg_84a07855b88ead2326fb1f557beac8fd-brick_c6b4373a2612f5edd579dacf926e4343-brick.log-20170723:[2017-07-20 12:18:43.335874] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully var-lib-heketi-mounts-vg_3b8548f910b60688eef1582f69a3fee6-brick_d79122071ae433b2b2ab336f4a287bf5-brick.log-20170723:[2017-07-20 12:18:40.233833] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully As these errors are noticed when the issue happend I am attaching this to the same. We are also recreating the setup once again. (In reply to Humble Chirammal from comment #23) > On further check, I have noticed rpc errors in this setup, which is already > a known bug in RHGS: > > var-lib-heketi-mounts-vg_84a07855b88ead2326fb1f557beac8fd- > brick_c6b4373a2612f5edd579dacf926e4343-brick.log-20170723:[2017-07-20 > 12:18:43.335874] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc > actor failed to complete successfully > > > var-lib-heketi-mounts-vg_3b8548f910b60688eef1582f69a3fee6- > brick_d79122071ae433b2b2ab336f4a287bf5-brick.log-20170723:[2017-07-20 > 12:18:40.233833] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc > actor failed to complete successfully > > > As these errors are noticed when the issue happend I am attaching this to > the same. > Mentioned bug # BZ#1477024 > We are also recreating the setup once again. Regardless of the root cause, one thing which I would like to clarify here. The 'auto_unmount' option has been set on the volume mounts and for some reason the fuse process were vanished in this setup, however the mounts ( mount command ) were intact in these servers. I was in an impression that, 'auto_unmount' would take down the subjected mounts if there is a failure in original mount process. It could also be that, the monitoring process were also taken down/failed when the issue happened. However I would like to clarify/make sure the "unmount" happens when there is a failure in the original mount process. csaba, can you please take a look at this scenario and share your thought? The logs are available till comment#21. (In reply to Humble Chirammal from comment #39) > Regardless of the root cause, one thing which I would like to clarify here. > > The 'auto_unmount' option has been set on the volume mounts and for some > reason the fuse process were vanished in this setup, however the mounts ( > mount command ) were intact in these servers. I was in an impression that, > 'auto_unmount' would take down the subjected mounts if there is a failure in > original mount process. It could also be that, the monitoring process were > also taken down/failed when the issue happened. However I would like to > clarify/make sure the "unmount" happens when there is a failure in the > original mount process. > > csaba, can you please take a look at this scenario and share your thought? > The logs are available till comment#21. AFAIK, Shekhar was running 3.2 client bits where auto_unmount feature is not in place. How is this relevant then? (In reply to Atin Mukherjee from comment #40) > (In reply to Humble Chirammal from comment #39) > > Regardless of the root cause, one thing which I would like to clarify here. > > > > The 'auto_unmount' option has been set on the volume mounts and for some > > reason the fuse process were vanished in this setup, however the mounts ( > > mount command ) were intact in these servers. I was in an impression that, > > 'auto_unmount' would take down the subjected mounts if there is a failure in > > original mount process. It could also be that, the monitoring process were > > also taken down/failed when the issue happened. However I would like to > > clarify/make sure the "unmount" happens when there is a failure in the > > original mount process. > > > > csaba, can you please take a look at this scenario and share your thought? > > The logs are available till comment#21. > > AFAIK, Shekhar was running 3.2 client bits where auto_unmount feature is not > in place. How is this relevant then? How did you verify the auto_unmount is not in place ? I can clearly see that auto_unmount is in place and also the glusterfs version in this setup is the package which has the support. (In reply to Humble Chirammal from comment #41) > (In reply to Atin Mukherjee from comment #40) > > (In reply to Humble Chirammal from comment #39) > > > Regardless of the root cause, one thing which I would like to clarify here. > > > > > > The 'auto_unmount' option has been set on the volume mounts and for some > > > reason the fuse process were vanished in this setup, however the mounts ( > > > mount command ) were intact in these servers. I was in an impression that, > > > 'auto_unmount' would take down the subjected mounts if there is a failure in > > > original mount process. It could also be that, the monitoring process were > > > also taken down/failed when the issue happened. However I would like to > > > clarify/make sure the "unmount" happens when there is a failure in the > > > original mount process. > > > > > > csaba, can you please take a look at this scenario and share your thought? > > > The logs are available till comment#21. > > > > AFAIK, Shekhar was running 3.2 client bits where auto_unmount feature is not > > in place. How is this relevant then? > > How did you verify the auto_unmount is not in place ? I can clearly see that > auto_unmount is in place and also the glusterfs version in this setup is the > package which has the support. Shekhar mentioned the version details to me earlier. And I'm not sure currently what you are looking at is the same setup. Shekhar might have upgraded it? (In reply to Atin Mukherjee from comment #43) > (In reply to Humble Chirammal from comment #41) > > (In reply to Atin Mukherjee from comment #40) > > > (In reply to Humble Chirammal from comment #39) > > > > Regardless of the root cause, one thing which I would like to clarify here. > > > > > > > > The 'auto_unmount' option has been set on the volume mounts and for some > > > > reason the fuse process were vanished in this setup, however the mounts ( > > > > mount command ) were intact in these servers. I was in an impression that, > > > > 'auto_unmount' would take down the subjected mounts if there is a failure in > > > > original mount process. It could also be that, the monitoring process were > > > > also taken down/failed when the issue happened. However I would like to > > > > clarify/make sure the "unmount" happens when there is a failure in the > > > > original mount process. > > > > > > > > csaba, can you please take a look at this scenario and share your thought? > > > > The logs are available till comment#21. > > > > > > AFAIK, Shekhar was running 3.2 client bits where auto_unmount feature is not > > > in place. How is this relevant then? > > > > How did you verify the auto_unmount is not in place ? I can clearly see that > > auto_unmount is in place and also the glusterfs version in this setup is the > > package which has the support. > > Shekhar mentioned the version details to me earlier. And I'm not sure > currently what you are looking at is the same setup. Shekhar might have > upgraded it? Afaict, no upgrade yet and from start its running with same version. I was looking at the logs which he mentioned earlier. Shekhar can confirm though. (In reply to Humble Chirammal from comment #44) > (In reply to Atin Mukherjee from comment #43) > > (In reply to Humble Chirammal from comment #41) > > > (In reply to Atin Mukherjee from comment #40) > > > > (In reply to Humble Chirammal from comment #39) > > > > > Regardless of the root cause, one thing which I would like to clarify here. > > > > > > > > > > The 'auto_unmount' option has been set on the volume mounts and for some > > > > > reason the fuse process were vanished in this setup, however the mounts ( > > > > > mount command ) were intact in these servers. I was in an impression that, > > > > > 'auto_unmount' would take down the subjected mounts if there is a failure in > > > > > original mount process. It could also be that, the monitoring process were > > > > > also taken down/failed when the issue happened. However I would like to > > > > > clarify/make sure the "unmount" happens when there is a failure in the > > > > > original mount process. > > > > > > > > > > csaba, can you please take a look at this scenario and share your thought? > > > > > The logs are available till comment#21. > > > > > > > > AFAIK, Shekhar was running 3.2 client bits where auto_unmount feature is not > > > > in place. How is this relevant then? > > > > > > How did you verify the auto_unmount is not in place ? I can clearly see that > > > auto_unmount is in place and also the glusterfs version in this setup is the > > > package which has the support. > > > > Shekhar mentioned the version details to me earlier. And I'm not sure > > currently what you are looking at is the same setup. Shekhar might have > > upgraded it? > > Afaict, no upgrade yet and from start its running with same version. I was > looking at the logs which he mentioned earlier. Shekhar can confirm though. No upgrade has been done yet. The environment is exactly the same as it was when issue occurred. Just checked again my setup with Atin, my client version is RHGS 3.2 Async which had that auto_unmount patch. Setting needinfo back on Csaba to check (In reply to Humble Chirammal from comment #39) > Regardless of the root cause, one thing which I would like to clarify here. > > The 'auto_unmount' option has been set on the volume mounts and for some > reason the fuse process were vanished in this setup, however the mounts ( > mount command ) were intact in these servers. I was in an impression that, > 'auto_unmount' would take down the subjected mounts if there is a failure in > original mount process. It could also be that, the monitoring process were > also taken down/failed when the issue happened. However I would like to > clarify/make sure the "unmount" happens when there is a failure in the > original mount process. > > csaba, can you please take a look at this scenario and share your thought? > The logs are available till comment#21. I have seen in my local testing (as shared in comment 28, point 4) if we do a sigkill of a mount process I do see the mount point entries but not the process. Setting needinfo on Csaba to further comment here. I hit the issue again even with selinux disabled while I was trying to scale with brick multiplex enabled. In next step, I upgraded my OCP to latest bits and also upgraded RHGS client to 3.3 from 3.2. oc version oc v3.6.173.0.5 kubernetes v1.6.1+5115d708d7 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://gprfs013.sbu.lab.eng.bos.redhat.com:8443 openshift v3.6.173.0.5 kubernetes v1.6.1+5115d708d7 rpm -qa | grep gluster glusterfs-client-xlators-3.8.4-34.el7rhgs.x86_64 glusterfs-3.8.4-34.el7rhgs.x86_64 glusterfs-fuse-3.8.4-34.el7rhgs.x86_64 glusterfs-libs-3.8.4-34.el7rhgs.x86_64 After upgrading I have not hit the issue again. I scaled upto 1000 volumes with brick multiplex enabled and did concurrent IOs on all 1000 volumes but no gluster mount failure is seen. Latest setup is running for 72 hours without any gluster mount failing. Thanks for the update Shekhar! comment#47 need to be addressed, so we have to open another bug on "auto_unmount". However as the issue reported here is different which has not seen with latest build, I am moving this bug to ON_QA for now. verified this bug in cns-deploy-5.0.0-38.el7rhgs.x86_64. The issue reported in this bug is not seen. Moving the bug to verified. [root@dhcp46-207 ~]# oc rsh mongodb-92-1-vrz6l sh-4.2# sh-4.2# sh-4.2# sh-4.2# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/docker-8:17-341-1d5a28c57b66145fe303ab86e321673136b7bc0e6d062f3109c6284258f6db98 10G 598M 9.4G 6% / tmpfs 24G 0 24G 0% /dev tmpfs 24G 0 24G 0% /sys/fs/cgroup /dev/sdb1 40G 1.7G 39G 5% /etc/hosts shm 64M 0 64M 0% /dev/shm 10.70.46.193:vol_69eaf705b69b91d1aa5ba816e14b2c14 1016M 236M 780M 24% /var/lib/mongodb/data tmpfs 24G 16K 24G 1% /run/secrets/kubernetes.io/serviceaccount sh-4.2# uptime 16:14:01 up 1 day, 7:46, 0 users, load average: 36.13, 16.40, 9.06 |