Description of problem: In a CNS 3.5 setup, with volumes scaled up to 200, gluster pod not coming up after a node reboot on one of the 3 nodes. ------------- # oc get nodes NAME STATUS AGE dhcp46-143.lab.eng.blr.redhat.com Ready 8h dhcp46-145.lab.eng.blr.redhat.com Ready,SchedulingDisabled 8h dhcp46-188.lab.eng.blr.redhat.com Ready 8h dhcp46-52.lab.eng.blr.redhat.com Ready 8h # oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE glusterfs-2gq69 1/1 Running 0 5h 10.70.46.52 dhcp46-52.lab.eng.blr.redhat.com glusterfs-lsxj8 0/1 CrashLoopBackOff 15 5h 10.70.46.143 dhcp46-143.lab.eng.blr.redhat.com glusterfs-lznfs 1/1 Running 0 5h 10.70.46.188 dhcp46-188.lab.eng.blr.redhat.com heketi-1-rb99d 1/1 Running 0 5h 10.129.0.13 dhcp46-52.lab.eng.blr.redhat.com storage-project-router-1-r70ph 1/1 Running 0 5h 10.70.46.52 dhcp46-52.lab.eng.blr.redhat.com ------------- Version-Release number of selected component (if applicable): heketi-client-4.0.0-6.el7rhgs.x86_64 cns-deploy-4.0.0-13.el7rhgs.x86_64 openshift v3.5.5.5 kubernetes v1.5.2+43a9be4 rhgs3/rhgs-server-rhel7:3.2.0-4 rhgs3/rhgs-volmanager-rhel7:3.2.0-6 How reproducible: Seen in 2 different scale setups Steps to Reproduce: 1. Create a 3 node CNS 3.5 cluster (Memory allocated for each of the work nodes: 94 GB) 2. Create around 200 volumes using dynamic provisioning 3. Ensure all the 200 volumes are created succesfully, 3 RHGS nodes are in READY Staus and the corresponding gluster pods are in Running Status 4. Now, reboot one of the three nodes and wait for the node to be back in READY status . Actual results: gluster pod doesn't seems to come back properly after the node reboot. In this case, pod "glusterfs-lsxj8" on 10.70.46.143 Expected results: Node reboot should spawn all the containers hosted by the node successfully. Additional info: ############## # oc describe pod glusterfs-lsxj8 Name: glusterfs-lsxj8 Namespace: storage-project Security Policy: privileged Node: dhcp46-143.lab.eng.blr.redhat.com/10.70.46.143 Start Time: Tue, 11 Apr 2017 17:31:41 +0530 Labels: glusterfs-node=pod Status: Running IP: 10.70.46.143 Controllers: DaemonSet/glusterfs Containers: glusterfs: Container ID: docker://0de9132e658a264f3e113b231f0b52231642e4a529c0855ece5e8434b1c1ee64 Image: rhgs3/rhgs-server-rhel7:3.2.0-4 Image ID: docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-server-rhel7@sha256:8f1a0acb03061b829c36d764bc7fc1a66b24993f6da738d6d36df9a91fb69667 Port: State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 137 Started: Tue, 11 Apr 2017 23:04:00 +0530 Finished: Tue, 11 Apr 2017 23:06:47 +0530 Ready: False Restart Count: 17 Liveness: exec [/bin/bash -c systemctl status glusterd.service] delay=100s timeout=3s period=10s #success=1 #failure=3 Readiness: exec [/bin/bash -c systemctl status glusterd.service] delay=100s timeout=3s period=10s #success=1 #failure=3 Volume Mounts: /dev from glusterfs-dev (rw) /etc/glusterfs from glusterfs-etc (rw) /etc/ssl from glusterfs-ssl (ro) /run from glusterfs-run (rw) /run/lvm from glusterfs-lvm (rw) /sys/fs/cgroup from glusterfs-cgroup (ro) /var/lib/glusterd from glusterfs-config (rw) /var/lib/heketi from glusterfs-heketi (rw) /var/lib/misc/glusterfsd from glusterfs-misc (rw) /var/log/glusterfs from glusterfs-logs (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-p5nb5 (ro) Environment Variables: <none> Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: glusterfs-heketi: Type: HostPath (bare host directory volume) Path: /var/lib/heketi glusterfs-run: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: glusterfs-lvm: Type: HostPath (bare host directory volume) Path: /run/lvm glusterfs-etc: Type: HostPath (bare host directory volume) Path: /etc/glusterfs glusterfs-logs: Type: HostPath (bare host directory volume) Path: /var/log/glusterfs glusterfs-config: Type: HostPath (bare host directory volume) Path: /var/lib/glusterd glusterfs-dev: Type: HostPath (bare host directory volume) Path: /dev glusterfs-misc: Type: HostPath (bare host directory volume) Path: /var/lib/misc/glusterfsd glusterfs-cgroup: Type: HostPath (bare host directory volume) Path: /sys/fs/cgroup glusterfs-ssl: Type: HostPath (bare host directory volume) Path: /etc/ssl default-token-p5nb5: Type: Secret (a volume populated by a Secret) SecretName: default-token-p5nb5 QoS Class: BestEffort Tolerations: <none> Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Created Created container with docker id 41ea5bcd2a0c; Security:[seccomp=unconfined] 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Started Started container with docker id 41ea5bcd2a0c 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Warning Unhealthy Liveness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: activating (start) since Tue 2017-04-11 16:24:58 UTC; 37ms ago Control: 1789 (glusterd) CGroup: /system.slice/docker-41ea5bcd2a0cbe08057dc72cef01d12d93683b92da321521d540706fa8caca7f.scope/system.slice/glusterd.service └─1789 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO Apr 11 16:24:58 dhcp46-143.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server... 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Warning Unhealthy Readiness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: activating (start) since Tue 2017-04-11 16:24:58 UTC; 9s ago Control: 1789 (glusterd) CGroup: /system.slice/docker-41ea5bcd2a0cbe08057dc72cef01d12d93683b92da321521d540706fa8caca7f.scope/system.slice/glusterd.service ├─1789 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO └─1790 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO Apr 11 16:24:58 dhcp46-143.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server... 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Warning Unhealthy Liveness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: activating (start) since Tue 2017-04-11 16:24:58 UTC; 10s ago Control: 1789 (glusterd) CGroup: /system.slice/docker-41ea5bcd2a0cbe08057dc72cef01d12d93683b92da321521d540706fa8caca7f.scope/system.slice/glusterd.service ├─1789 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO └─1790 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO Apr 11 16:24:58 dhcp46-143.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server... 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Killing Killing container with docker id 41ea5bcd2a0c: pod "glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" container "glusterfs" is unhealthy, it will be killed and re-created. 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Created Created container with docker id 1945a250a3bd; Security:[seccomp=unconfined] 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Started Started container with docker id 1945a250a3bd 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Warning Unhealthy Readiness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: activating (start) since Tue 2017-04-11 16:28:00 UTC; 7s ago Control: 1794 (glusterd) CGroup: /system.slice/docker-1945a250a3bdf05f247483b67b0f3c7f9a6afc78ef72983bdb86acb333f5084c.scope/system.slice/glusterd.service ├─1794 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO └─1795 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO Apr 11 16:28:00 dhcp46-143.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server... 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Killing Killing container with docker id 1945a250a3bd: pod "glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" container "glusterfs" is unhealthy, it will be killed and re-created. 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Created Created container with docker id 058064fa673e; Security:[seccomp=unconfined] 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Started Started container with docker id 058064fa673e 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Warning Unhealthy Readiness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: activating (start) since Tue 2017-04-11 16:30:41 UTC; 6s ago Control: 1787 (glusterd) CGroup: /system.slice/docker-058064fa673e5c2dc6401051bf2f6469a4417b87fc667f0987862e5f5003432d.scope/system.slice/glusterd.service ├─1787 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO └─1788 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO Apr 11 16:30:41 dhcp46-143.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server... 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Killing Killing container with docker id 058064fa673e: pod "glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" container "glusterfs" is unhealthy, it will be killed and re-created. 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Created Created container with docker id be30a0bce92e; Security:[seccomp=unconfined] 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Started Started container with docker id be30a0bce92e 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Warning Unhealthy Readiness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: activating (start) since Tue 2017-04-11 16:33:34 UTC; 3s ago Control: 1787 (glusterd) CGroup: /system.slice/docker-be30a0bce92e3e2efc5db49e89698351c2d4cf55fed81f745617c8ac66ff0252.scope/system.slice/glusterd.service ├─1787 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO └─1788 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO Apr 11 16:33:34 dhcp46-143.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server... 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Killing Killing container with docker id be30a0bce92e: pod "glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" container "glusterfs" is unhealthy, it will be killed and re-created. 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Created Created container with docker id 6439bf201663; Security:[seccomp=unconfined] 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Started Started container with docker id 6439bf201663 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Killing Killing container with docker id 6439bf201663: pod "glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" container "glusterfs" is unhealthy, it will be killed and re-created. 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Created Created container with docker id f2a9da90f0d1; Security:[seccomp=unconfined] 1h 1h 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Started Started container with docker id f2a9da90f0d1 59m 59m 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Killing Killing container with docker id f2a9da90f0d1: pod "glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" container "glusterfs" is unhealthy, it will be killed and re-created. 59m 59m 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Created Created container with docker id 34bbd3e340c6; Security:[seccomp=unconfined] 59m 59m 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Started Started container with docker id 34bbd3e340c6 56m 56m 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Killing Killing container with docker id 34bbd3e340c6: pod "glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" container "glusterfs" is unhealthy, it will be killed and re-created. 50m 50m 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Created Created container with docker id 21930e5b77d1; Security:[seccomp=unconfined] 50m 50m 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Started Started container with docker id 21930e5b77d1 48m 48m 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Killing Killing container with docker id 21930e5b77d1: pod "glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" container "glusterfs" is unhealthy, it will be killed and re-created. 48m 48m 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Created Created container with docker id 7213bc989772; Security:[seccomp=unconfined] 48m 48m 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Started Started container with docker id 7213bc989772 45m 45m 1 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Killing Killing container with docker id 7213bc989772: pod "glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" container "glusterfs" is unhealthy, it will be killed and re-created. 1h 6m 16 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Warning Unhealthy Liveness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: inactive (dead) 1h 5m 37 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Warning Unhealthy Readiness probe failed: rpc error: code = 13 desc = invalid header field value "oci runtime error: exec failed: container_linux.go:247: starting container process caused \"process_linux.go:83: executing setns process caused \\\"exit status 16\\\"\"\n" 1h 5m 17 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Pulled Container image "rhgs3/rhgs-server-rhel7:3.2.0-4" already present on machine 40m 5m 8 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Created (events with common reason combined) 40m 5m 8 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Started (events with common reason combined) 1h 3m 20 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Warning Unhealthy Readiness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: inactive (dead) 1h 2m 20 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Warning Unhealthy (events with common reason combined) 37m 2m 8 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Killing (events with common reason combined) 56m 13s 134 {kubelet dhcp46-143.lab.eng.blr.redhat.com} spec.containers{glusterfs} Warning BackOff Back-off restarting failed docker container 56m 13s 134 {kubelet dhcp46-143.lab.eng.blr.redhat.com} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "glusterfs" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=glusterfs pod=glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" ####################
docker version: # rpm -qa |grep docker atomic-openshift-docker-excluder-3.5.5.5-1.git.0.f2e87ab.el7.noarch cockpit-docker-135-4.el7.x86_64 docker-client-1.12.6-16.el7.x86_64 docker-common-1.12.6-16.el7.x86_64 docker-rhel-push-plugin-1.12.6-16.el7.x86_64 docker-1.12.6-16.el7.x86_64
Atin and I were debugging the issue little more with the help of prasanth's setup. To isolate glusterd, I increased the timeout's on which glusterd process health checking happens. Currently gluster pods are waited for 100 sec(this time was decided for 100 volumes in last release which involves time of devmapper to load on host then the lv's to come up and then mount of the lv's happen successfully in the pod) and then check the glusterd process state. When I increased the timeout and docker exec in the pod directly and check glusterd procress after some time and glusterd is running successfully. Prasanth and I have pointed out the time that it took for the number of volumes in the scale testing. for 100 volumes, 100 seconds. for 200 volumes, 160 seconds. for 300 volumes, 220 seconds.
(In reply to Mohamed Ashiq from comment #5) > Atin and I were debugging the issue little more with the help of prasanth's > setup. To isolate glusterd, I increased the timeout's on which glusterd > process health checking happens. Currently gluster pods are waited for 100 > sec(this time was decided for 100 volumes in last release which involves > time of devmapper to load on host then the lv's to come up and then mount of > the lv's happen successfully in the pod) and then check the glusterd process > state. When I increased the timeout and docker exec in the pod directly and > check glusterd procress after some time and glusterd is running successfully. > > Prasanth and I have pointed out the time that it took for the number of > volumes in the scale testing. > > for 100 volumes, 100 seconds. > for 200 volumes, 160 seconds. > for 300 volumes, 220 seconds. The above numbers are exact. We can document in the guide like 100 volumes, X RAM and 100 sec timeout required. 200 volumes, Y RAM and 200 sec timeout required. 300 volumes, Z RAM and 250 sec timeout required. Or the gluster template change is required. readinessProbe: timeoutSeconds: 3 initialDelaySeconds: 100 exec: command: - "/bin/bash" - "-c" - systemctl status glusterd.service #### This should be increased from 10 to 20. Which is delay from one check to the next. #### periodSeconds: 10 successThreshold: 1 #### Increase from 3 to 10. This is the retires in case of failure. #### failureThreshold: 10 This change requires rebuild of cns-deploy and this change will remove the requirement of changing the timeout As this will delay and check for more time.
(In reply to Mohamed Ashiq from comment #6) > (In reply to Mohamed Ashiq from comment #5) > > Atin and I were debugging the issue little more with the help of prasanth's > > setup. To isolate glusterd, I increased the timeout's on which glusterd > > process health checking happens. Currently gluster pods are waited for 100 > > sec(this time was decided for 100 volumes in last release which involves > > time of devmapper to load on host then the lv's to come up and then mount of > > the lv's happen successfully in the pod) and then check the glusterd process > > state. When I increased the timeout and docker exec in the pod directly and > > check glusterd procress after some time and glusterd is running successfully. > > > > Prasanth and I have pointed out the time that it took for the number of > > volumes in the scale testing. > > > > for 100 volumes, 100 seconds. > > for 200 volumes, 160 seconds. > > for 300 volumes, 220 seconds. > > The above numbers are exact. We can document in the guide like > > 100 volumes, X RAM and 100 sec timeout required. > 200 volumes, Y RAM and 200 sec timeout required. > 300 volumes, Z RAM and 250 sec timeout required. > > Or the gluster template change is required. > > readinessProbe: > timeoutSeconds: 3 > initialDelaySeconds: 100 > exec: > command: > - "/bin/bash" > - "-c" > - systemctl status glusterd.service > #### > This should be increased from 10 to 20. Which is delay from one check to the > next. > #### > periodSeconds: 10 > successThreshold: 1 > #### > Increase from 3 to 10. This is the retires in case of failure. > #### > failureThreshold: 10 > > This change requires rebuild of cns-deploy and this change will remove the > requirement of changing the timeout As this will delay and check for more > time. Ideally, adjusting 'failurethreshold and periodseconds' should not cause bad user experience if they only have less number of volumes as these checks are done later or once the first attempt fails. We could also think about making these values configurable for the template, so that based on the scale testing we perform on different CNS releases admin could adjust it.
Prasanth and I verified the Initial time to start glusterd in a pod with no volumes. It was 26 sec to be exact.(Nothing to load in case of devicemapper as no custom fstab). So sticking to 40 delay seconds as we will have a heketidbstorage volume as default. 25 Period sec 15 Failure Threshold This will give (25 x 15) + 40 = 415 seconds(~7 minutes) to fail in a failure case. This will be good in case our scaling increasing for more volumes.
Patch has been merged usptream :: https://github.com/gluster/gluster-kubernetes/pull/252
Verified as fixed in cns-deploy-4.0.0-15.el7rhgs gluster pod is now coming up properly after a node reboot @200+ volumes and even @300 volumes ################ [root@dhcp46-53 ~]# docker ps |grep rhgs a1946ae9c809 rhgs3/rhgs-server-rhel7:3.2.0-4 "/usr/sbin/init" 26 minutes ago Up 25 minutes k8s_glusterfs.d9b1a406_glusterfs-q2r8s_storage-project_7d5fa3fb-2340-11e7-941a-005056b35fb4_00c4a624 [root@dhcp46-53 ~]# docker exec -ti a1946ae9c809 /bin/bash [root@dhcp46-53 /]# cat /etc/redhat-storage-release Red Hat Gluster Storage Server 3.2 (Container) [root@dhcp46-53 /]# free -g total used free shared buff/cache available Mem: 47 20 18 0 8 25 Swap: 26 0 26 [root@dhcp46-53 /]# systemctl status glusterd ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2017-04-17 13:33:35 UTC; 19min ago Process: 3032 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 3033 (glusterd) CGroup: /system.slice/docker-a1946ae9c8092e72ccaa100c5c8379734ef7f34084763667904a1091f756c12e.scope/system.slice/glusterd.service ├─3033 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO ├─3438 /usr/sbin/glusterfsd -s 10.70.46.53 --volfile-id heketidbstorage.10.70.46.53.var-lib-heketi-mounts-vg_614ad25373b8566fc359a0252c565b9c-brick_158dc6f90077f232fd7b57b0655... ├─3450 /usr/sbin/glusterfsd -s 10.70.46.53 --volfile-id vol_002d9693aefe67960c60bd6aee01bd2a.10.70.46.53.var-lib-heketi-mounts- <snip> ... ..... </snip> └─5888 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/glu... Apr 17 13:32:59 dhcp46-53.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server... Apr 17 13:33:35 dhcp46-53.lab.eng.blr.redhat.com systemd[1]: Started GlusterFS, a clustered file-system server. ################
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1112