1441351 – [Scale Testing] With volumes scaled to 200, gluster pod not coming up post reboot of a node

Bug 1441351 - [Scale Testing] With volumes scaled to 200, gluster pod not coming up post reboot of a node

Summary: [Scale Testing] With volumes scaled to 200, gluster pod not coming up post re...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	cns-deploy-tool
Sub Component:
Version:	cns-3.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	CNS 3.5
Assignee:	Mohamed Ashiq
QA Contact:	Prasanth
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1415600
TreeView+	depends on / blocked

Reported:	2017-04-11 18:06 UTC by Prasanth
Modified:	2019-02-13 08:59 UTC (History)
CC List:	12 users (show)
Fixed In Version:	cns-deploy-4.0.0-15.el7rhgs
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-04-20 18:29:28 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1395656	0	unspecified	CLOSED	One of the node in 3 node CNS system fails to respin gluster container after reboot	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2017:1112	0	normal	SHIPPED_LIVE	cns-deploy-tool bug fix and enhancement update	2017-04-20 22:25:47 UTC

Internal Links: 1395656

Description Prasanth 2017-04-11 18:06:38 UTC

Description of problem:

In a CNS 3.5 setup, with volumes scaled up to 200, gluster pod not coming up after a node reboot on one of the 3 nodes.

-------------
# oc get nodes
NAME                                STATUS                     AGE
dhcp46-143.lab.eng.blr.redhat.com   Ready                      8h
dhcp46-145.lab.eng.blr.redhat.com   Ready,SchedulingDisabled   8h
dhcp46-188.lab.eng.blr.redhat.com   Ready                      8h
dhcp46-52.lab.eng.blr.redhat.com    Ready                      8h


# oc get pods -o wide
NAME                             READY     STATUS             RESTARTS   AGE       IP             NODE
glusterfs-2gq69                  1/1       Running            0          5h        10.70.46.52    dhcp46-52.lab.eng.blr.redhat.com
glusterfs-lsxj8                  0/1       CrashLoopBackOff   15         5h        10.70.46.143   dhcp46-143.lab.eng.blr.redhat.com
glusterfs-lznfs                  1/1       Running            0          5h        10.70.46.188   dhcp46-188.lab.eng.blr.redhat.com
heketi-1-rb99d                   1/1       Running            0          5h        10.129.0.13    dhcp46-52.lab.eng.blr.redhat.com
storage-project-router-1-r70ph   1/1       Running            0          5h        10.70.46.52    dhcp46-52.lab.eng.blr.redhat.com
-------------

Version-Release number of selected component (if applicable):
heketi-client-4.0.0-6.el7rhgs.x86_64
cns-deploy-4.0.0-13.el7rhgs.x86_64

openshift v3.5.5.5
kubernetes v1.5.2+43a9be4

rhgs3/rhgs-server-rhel7:3.2.0-4
rhgs3/rhgs-volmanager-rhel7:3.2.0-6


How reproducible: Seen in 2 different scale setups


Steps to Reproduce:
1. Create a 3 node CNS 3.5 cluster (Memory allocated for each of the work nodes: 94 GB)
2. Create around 200 volumes using dynamic provisioning
3. Ensure all the 200 volumes are created succesfully, 3 RHGS nodes are in READY Staus and the corresponding gluster pods are in Running Status
4. Now, reboot one of the three nodes and wait for the node to be back in READY status
.

Actual results: gluster pod doesn't seems to come back properly after the node reboot. In this case, pod "glusterfs-lsxj8" on 10.70.46.143


Expected results: Node reboot should spawn all the containers hosted by the node successfully.


Additional info:

##############
# oc describe pod glusterfs-lsxj8
Name:                   glusterfs-lsxj8
Namespace:              storage-project
Security Policy:        privileged
Node:                   dhcp46-143.lab.eng.blr.redhat.com/10.70.46.143
Start Time:             Tue, 11 Apr 2017 17:31:41 +0530
Labels:                 glusterfs-node=pod
Status:                 Running
IP:                     10.70.46.143
Controllers:            DaemonSet/glusterfs
Containers:
  glusterfs:
    Container ID:       docker://0de9132e658a264f3e113b231f0b52231642e4a529c0855ece5e8434b1c1ee64
    Image:              rhgs3/rhgs-server-rhel7:3.2.0-4
    Image ID:           docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-server-rhel7@sha256:8f1a0acb03061b829c36d764bc7fc1a66b24993f6da738d6d36df9a91fb69667
    Port:
    State:              Waiting
      Reason:           CrashLoopBackOff
    Last State:         Terminated
      Reason:           Error
      Exit Code:        137
      Started:          Tue, 11 Apr 2017 23:04:00 +0530
      Finished:         Tue, 11 Apr 2017 23:06:47 +0530
    Ready:              False
    Restart Count:      17
    Liveness:           exec [/bin/bash -c systemctl status glusterd.service] delay=100s timeout=3s period=10s #success=1 #failure=3
    Readiness:          exec [/bin/bash -c systemctl status glusterd.service] delay=100s timeout=3s period=10s #success=1 #failure=3
    Volume Mounts:
      /dev from glusterfs-dev (rw)
      /etc/glusterfs from glusterfs-etc (rw)
      /etc/ssl from glusterfs-ssl (ro)
      /run from glusterfs-run (rw)
      /run/lvm from glusterfs-lvm (rw)
      /sys/fs/cgroup from glusterfs-cgroup (ro)
      /var/lib/glusterd from glusterfs-config (rw)
      /var/lib/heketi from glusterfs-heketi (rw)
      /var/lib/misc/glusterfsd from glusterfs-misc (rw)
      /var/log/glusterfs from glusterfs-logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-p5nb5 (ro)
    Environment Variables:      <none>
Conditions:
  Type          Status
  Initialized   True 
  Ready         False 
  PodScheduled  True 
Volumes:
  glusterfs-heketi:
    Type:       HostPath (bare host directory volume)
    Path:       /var/lib/heketi
  glusterfs-run:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  glusterfs-lvm:
    Type:       HostPath (bare host directory volume)
    Path:       /run/lvm
  glusterfs-etc:
    Type:       HostPath (bare host directory volume)
    Path:       /etc/glusterfs
  glusterfs-logs:
    Type:       HostPath (bare host directory volume)
    Path:       /var/log/glusterfs
  glusterfs-config:
    Type:       HostPath (bare host directory volume)
    Path:       /var/lib/glusterd
  glusterfs-dev:
    Type:       HostPath (bare host directory volume)
    Path:       /dev
  glusterfs-misc:
    Type:       HostPath (bare host directory volume)
    Path:       /var/lib/misc/glusterfsd
  glusterfs-cgroup:
    Type:       HostPath (bare host directory volume)
    Path:       /sys/fs/cgroup
  glusterfs-ssl:
    Type:       HostPath (bare host directory volume)
    Path:       /etc/ssl
  default-token-p5nb5:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-p5nb5
QoS Class:      BestEffort
Tolerations:    <none>
Events:
  FirstSeen     LastSeen        Count   From                                            SubObjectPath                   Type            Reason          Message
  ---------     --------        -----   ----                                            -------------                   --------        ------          -------
  1h            1h              1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal          Created         Created container with docker id 41ea5bcd2a0c; Security:[seccomp=unconfined]
  1h            1h              1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal          Started         Started container with docker id 41ea5bcd2a0c
  1h            1h              1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Warning         Unhealthy       Liveness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: activating (start) since Tue 2017-04-11 16:24:58 UTC; 37ms ago
  Control: 1789 (glusterd)
   CGroup: /system.slice/docker-41ea5bcd2a0cbe08057dc72cef01d12d93683b92da321521d540706fa8caca7f.scope/system.slice/glusterd.service
           └─1789 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO

Apr 11 16:24:58 dhcp46-143.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server...

  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Warning Unhealthy       Readiness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: activating (start) since Tue 2017-04-11 16:24:58 UTC; 9s ago
  Control: 1789 (glusterd)
   CGroup: /system.slice/docker-41ea5bcd2a0cbe08057dc72cef01d12d93683b92da321521d540706fa8caca7f.scope/system.slice/glusterd.service
           ├─1789 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
           └─1790 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO

Apr 11 16:24:58 dhcp46-143.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server...

  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Warning Unhealthy       Liveness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: activating (start) since Tue 2017-04-11 16:24:58 UTC; 10s ago
  Control: 1789 (glusterd)
   CGroup: /system.slice/docker-41ea5bcd2a0cbe08057dc72cef01d12d93683b92da321521d540706fa8caca7f.scope/system.slice/glusterd.service
           ├─1789 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
           └─1790 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO

Apr 11 16:24:58 dhcp46-143.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server...

  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Killing         Killing container with docker id 41ea5bcd2a0c: pod "glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" container "glusterfs" is unhealthy, it will be killed and re-created.
  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Created         Created container with docker id 1945a250a3bd; Security:[seccomp=unconfined]
  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Started         Started container with docker id 1945a250a3bd
  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Warning Unhealthy       Readiness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: activating (start) since Tue 2017-04-11 16:28:00 UTC; 7s ago
  Control: 1794 (glusterd)
   CGroup: /system.slice/docker-1945a250a3bdf05f247483b67b0f3c7f9a6afc78ef72983bdb86acb333f5084c.scope/system.slice/glusterd.service
           ├─1794 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
           └─1795 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO

Apr 11 16:28:00 dhcp46-143.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server...

  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Killing         Killing container with docker id 1945a250a3bd: pod "glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" container "glusterfs" is unhealthy, it will be killed and re-created.
  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Created         Created container with docker id 058064fa673e; Security:[seccomp=unconfined]
  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Started         Started container with docker id 058064fa673e
  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Warning Unhealthy       Readiness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: activating (start) since Tue 2017-04-11 16:30:41 UTC; 6s ago
  Control: 1787 (glusterd)
   CGroup: /system.slice/docker-058064fa673e5c2dc6401051bf2f6469a4417b87fc667f0987862e5f5003432d.scope/system.slice/glusterd.service
           ├─1787 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
           └─1788 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO

Apr 11 16:30:41 dhcp46-143.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server...

  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Killing         Killing container with docker id 058064fa673e: pod "glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" container "glusterfs" is unhealthy, it will be killed and re-created.
  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Created         Created container with docker id be30a0bce92e; Security:[seccomp=unconfined]
  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Started         Started container with docker id be30a0bce92e
  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Warning Unhealthy       Readiness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: activating (start) since Tue 2017-04-11 16:33:34 UTC; 3s ago
  Control: 1787 (glusterd)
   CGroup: /system.slice/docker-be30a0bce92e3e2efc5db49e89698351c2d4cf55fed81f745617c8ac66ff0252.scope/system.slice/glusterd.service
           ├─1787 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
           └─1788 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO

Apr 11 16:33:34 dhcp46-143.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server...

  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Killing         Killing container with docker id be30a0bce92e: pod "glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" container "glusterfs" is unhealthy, it will be killed and re-created.
  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Created         Created container with docker id 6439bf201663; Security:[seccomp=unconfined]
  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Started         Started container with docker id 6439bf201663
  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Killing         Killing container with docker id 6439bf201663: pod "glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" container "glusterfs" is unhealthy, it will be killed and re-created.
  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Created         Created container with docker id f2a9da90f0d1; Security:[seccomp=unconfined]
  1h    1h      1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Started         Started container with docker id f2a9da90f0d1
  59m   59m     1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Killing         Killing container with docker id f2a9da90f0d1: pod "glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" container "glusterfs" is unhealthy, it will be killed and re-created.
  59m   59m     1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Created         Created container with docker id 34bbd3e340c6; Security:[seccomp=unconfined]
  59m   59m     1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Started         Started container with docker id 34bbd3e340c6
  56m   56m     1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Killing         Killing container with docker id 34bbd3e340c6: pod "glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" container "glusterfs" is unhealthy, it will be killed and re-created.
  50m   50m     1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Created         Created container with docker id 21930e5b77d1; Security:[seccomp=unconfined]
  50m   50m     1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Started         Started container with docker id 21930e5b77d1
  48m   48m     1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Killing         Killing container with docker id 21930e5b77d1: pod "glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" container "glusterfs" is unhealthy, it will be killed and re-created.
  48m   48m     1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Created         Created container with docker id 7213bc989772; Security:[seccomp=unconfined]
  48m   48m     1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Started         Started container with docker id 7213bc989772
  45m   45m     1       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Killing         Killing container with docker id 7213bc989772: pod "glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)" container "glusterfs" is unhealthy, it will be killed and re-created.
  1h    6m      16      {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Warning Unhealthy       Liveness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: inactive (dead)

  1h    5m      37      {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Warning Unhealthy       Readiness probe failed: rpc error: code = 13 desc = invalid header field value "oci runtime error: exec failed: container_linux.go:247: starting container process caused \"process_linux.go:83: executing setns process caused \\\"exit status 16\\\"\"\n"

  1h    5m      17      {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Pulled          Container image "rhgs3/rhgs-server-rhel7:3.2.0-4" already present on machine
  40m   5m      8       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Created         (events with common reason combined)
  40m   5m      8       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Started         (events with common reason combined)
  1h    3m      20      {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Warning Unhealthy       Readiness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: inactive (dead)

  1h    2m      20      {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Warning Unhealthy       (events with common reason combined)
  37m   2m      8       {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Normal  Killing         (events with common reason combined)
  56m   13s     134     {kubelet dhcp46-143.lab.eng.blr.redhat.com}     spec.containers{glusterfs}      Warning BackOff         Back-off restarting failed docker container
  56m   13s     134     {kubelet dhcp46-143.lab.eng.blr.redhat.com}                                     Warning FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "glusterfs" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=glusterfs pod=glusterfs-lsxj8_storage-project(a02adb3b-1eae-11e7-a794-005056b38171)"
####################

Comment 2 Prasanth 2017-04-11 18:10:52 UTC

docker version:

# rpm -qa |grep docker
atomic-openshift-docker-excluder-3.5.5.5-1.git.0.f2e87ab.el7.noarch
cockpit-docker-135-4.el7.x86_64
docker-client-1.12.6-16.el7.x86_64
docker-common-1.12.6-16.el7.x86_64
docker-rhel-push-plugin-1.12.6-16.el7.x86_64
docker-1.12.6-16.el7.x86_64

Comment 5 Mohamed Ashiq 2017-04-12 12:28:41 UTC

Atin and I were debugging the issue little more with the help of prasanth's setup. To isolate glusterd, I increased the timeout's on which glusterd process health checking happens. Currently gluster pods are waited for 100 sec(this time was decided for 100 volumes in last release which involves time of devmapper to load on host then the lv's to come up and then mount of the lv's happen successfully in the pod) and then check the glusterd process state. When I increased the timeout and docker exec in the pod directly and check glusterd procress after some time and glusterd is running successfully.

Prasanth and I have pointed out the time that it took for the number of volumes in the scale testing.

for 100 volumes, 100 seconds.
for 200 volumes, 160 seconds.
for 300 volumes, 220 seconds.

Comment 6 Mohamed Ashiq 2017-04-12 13:23:12 UTC

(In reply to Mohamed Ashiq from comment #5)
> Atin and I were debugging the issue little more with the help of prasanth's
> setup. To isolate glusterd, I increased the timeout's on which glusterd
> process health checking happens. Currently gluster pods are waited for 100
> sec(this time was decided for 100 volumes in last release which involves
> time of devmapper to load on host then the lv's to come up and then mount of
> the lv's happen successfully in the pod) and then check the glusterd process
> state. When I increased the timeout and docker exec in the pod directly and
> check glusterd procress after some time and glusterd is running successfully.
> 
> Prasanth and I have pointed out the time that it took for the number of
> volumes in the scale testing.
> 
> for 100 volumes, 100 seconds.
> for 200 volumes, 160 seconds.
> for 300 volumes, 220 seconds.

The above numbers are exact. We can document in the guide like 

100 volumes, X RAM and 100 sec timeout required.
200 volumes, Y RAM and 200 sec timeout required.
300 volumes, Z RAM and 250 sec timeout required.

Or the gluster template change is required.

          readinessProbe:
            timeoutSeconds: 3
            initialDelaySeconds: 100
            exec:
              command:
              - "/bin/bash"
              - "-c"
              - systemctl status glusterd.service
####
This should be increased from 10 to 20. Which is delay from one check to the next.
####
            periodSeconds: 10
            successThreshold: 1
####
Increase from 3 to 10. This is the retires in case of failure.
####
            failureThreshold: 10

This change requires rebuild of cns-deploy and this change will remove the requirement of changing the timeout As this will delay and check for more time.

Comment 7 Humble Chirammal 2017-04-13 08:39:19 UTC

(In reply to Mohamed Ashiq from comment #6)
> (In reply to Mohamed Ashiq from comment #5)
> > Atin and I were debugging the issue little more with the help of prasanth's
> > setup. To isolate glusterd, I increased the timeout's on which glusterd
> > process health checking happens. Currently gluster pods are waited for 100
> > sec(this time was decided for 100 volumes in last release which involves
> > time of devmapper to load on host then the lv's to come up and then mount of
> > the lv's happen successfully in the pod) and then check the glusterd process
> > state. When I increased the timeout and docker exec in the pod directly and
> > check glusterd procress after some time and glusterd is running successfully.
> > 
> > Prasanth and I have pointed out the time that it took for the number of
> > volumes in the scale testing.
> > 
> > for 100 volumes, 100 seconds.
> > for 200 volumes, 160 seconds.
> > for 300 volumes, 220 seconds.
> 
> The above numbers are exact. We can document in the guide like 
> 
> 100 volumes, X RAM and 100 sec timeout required.
> 200 volumes, Y RAM and 200 sec timeout required.
> 300 volumes, Z RAM and 250 sec timeout required.
> 
> Or the gluster template change is required.
> 
>           readinessProbe:
>             timeoutSeconds: 3
>             initialDelaySeconds: 100
>             exec:
>               command:
>               - "/bin/bash"
>               - "-c"
>               - systemctl status glusterd.service
> ####
> This should be increased from 10 to 20. Which is delay from one check to the
> next.
> ####
>             periodSeconds: 10
>             successThreshold: 1
> ####
> Increase from 3 to 10. This is the retires in case of failure.
> ####
>             failureThreshold: 10
> 
> This change requires rebuild of cns-deploy and this change will remove the
> requirement of changing the timeout As this will delay and check for more
> time.

Ideally, adjusting 'failurethreshold and periodseconds' should not cause bad user experience if they only have less number of volumes as these checks are done later or once the first attempt fails. We could also think about making these values configurable for the template, so that based on the scale testing we perform on different CNS releases admin could adjust it.

Comment 10 Mohamed Ashiq 2017-04-13 14:49:25 UTC

Prasanth and I verified the Initial time to start glusterd in a pod with no volumes. 

It was 26 sec to be exact.(Nothing to load in case of devicemapper as no custom fstab).

So sticking to 40 delay seconds as we will have a heketidbstorage volume as default.

25 Period sec
15 Failure Threshold

This will give 

(25 x 15) + 40 = 415 seconds(~7 minutes) to fail in a failure case.

This will be good in case our scaling increasing for more volumes.

Comment 11 Ramakrishna Reddy Yekulla 2017-04-13 18:31:28 UTC

Patch has been merged usptream :: https://github.com/gluster/gluster-kubernetes/pull/252

Comment 12 Prasanth 2017-04-17 13:56:35 UTC

Verified as fixed in cns-deploy-4.0.0-15.el7rhgs

gluster pod is now coming up properly after a node reboot @200+ volumes and even @300 volumes

################
[root@dhcp46-53 ~]# docker ps |grep rhgs
a1946ae9c809        rhgs3/rhgs-server-rhel7:3.2.0-4                                                                                                   "/usr/sbin/init"         26 minutes ago      Up 25 minutes                           k8s_glusterfs.d9b1a406_glusterfs-q2r8s_storage-project_7d5fa3fb-2340-11e7-941a-005056b35fb4_00c4a624


[root@dhcp46-53 ~]# docker exec -ti a1946ae9c809 /bin/bash

[root@dhcp46-53 /]# cat /etc/redhat-storage-release 
Red Hat Gluster Storage Server 3.2 (Container)

[root@dhcp46-53 /]# free -g
              total        used        free      shared  buff/cache   available
Mem:             47          20          18           0           8          25
Swap:            26           0          26


[root@dhcp46-53 /]# systemctl status glusterd
● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-04-17 13:33:35 UTC; 19min ago
  Process: 3032 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 3033 (glusterd)
   CGroup: /system.slice/docker-a1946ae9c8092e72ccaa100c5c8379734ef7f34084763667904a1091f756c12e.scope/system.slice/glusterd.service
           ├─3033 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
           ├─3438 /usr/sbin/glusterfsd -s 10.70.46.53 --volfile-id heketidbstorage.10.70.46.53.var-lib-heketi-mounts-vg_614ad25373b8566fc359a0252c565b9c-brick_158dc6f90077f232fd7b57b0655...
           ├─3450 /usr/sbin/glusterfsd -s 10.70.46.53 --volfile-id vol_002d9693aefe67960c60bd6aee01bd2a.10.70.46.53.var-lib-heketi-mounts-
<snip>
...
.....
</snip>
           └─5888 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/glu...

Apr 17 13:32:59 dhcp46-53.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server...
Apr 17 13:33:35 dhcp46-53.lab.eng.blr.redhat.com systemd[1]: Started GlusterFS, a clustered file-system server.
################

Comment 13 errata-xmlrpc 2017-04-20 18:29:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1112

Note You need to log in before you can comment on or make changes to this bug.