Bug 1395656

Summary: One of the node in 3 node CNS system fails to respin gluster container after reboot
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: krishnaram Karthick <kramdoss>
Component: CNS-deploymentAssignee: Humble Chirammal <hchiramm>
Status: CLOSED ERRATA QA Contact: krishnaram Karthick <kramdoss>
Severity: high Docs Contact:
Priority: unspecified    
Version: cns-3.4CC: akhakhar, annair, hchiramm, jarrpa, madam, mliyazud, mzywusko, pprakash, rreddy, rtalur, vinug
Target Milestone: ---Flags: hchiramm: needinfo-
Target Release: CNS 3.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-18 21:57:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1385247    
Attachments:
Description Flags
oc_describe_glusterfs-dc-dhcp46-123.output
none
heketi_volumelist.output
none
oc_get_pvc.output
none
oc_get_events.output none

Description krishnaram Karthick 2016-11-16 11:08:41 UTC
Description of problem:
Rebooting a node of three node CNS cluster failed to respawn gluster container hosted in that node. The gluster's trusted storage pool had 91 volumes including the heketidbstorage vol.

Time of reboot: Wed Nov 16 15:33:25 IST 2016
Memory of each of the work nodes: 48GB

oc get pods
NAME                                                     READY     STATUS             RESTARTS   AGE
glusterfs-dc-dhcp46-118.lab.eng.blr.redhat.com-1-2mv70   1/1       Running            2          20h
glusterfs-dc-dhcp46-119.lab.eng.blr.redhat.com-1-kpjch   1/1       Running            5          6d
glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com-1-ylm2b   0/1       CrashLoopBackOff   12         6d
heketi-1-cvuq4                                           1/1       Running            4          2h
storage-project-router-1-4f2sv                           1/1       Running            5          7d


oc get nodes
NAME                                STATUS                     AGE
dhcp46-118.lab.eng.blr.redhat.com   Ready                      7d
dhcp46-119.lab.eng.blr.redhat.com   Ready                      7d
dhcp46-123.lab.eng.blr.redhat.com   Ready                      7d
dhcp46-146.lab.eng.blr.redhat.com   Ready,SchedulingDisabled   7d


[root@dhcp46-146 ~]# oc describe pods/glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com-1-ylm2b
Name:			glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com-1-ylm2b
Namespace:		storage-project
Security Policy:	privileged
Node:			dhcp46-123.lab.eng.blr.redhat.com/10.70.46.123
Start Time:		Wed, 09 Nov 2016 16:13:25 +0530
Labels:			deployment=glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com-1
			deploymentconfig=glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com
			glusterfs=pod
			glusterfs-node=dhcp46-123.lab.eng.blr.redhat.com
			name=glusterfs
Status:			Running
IP:			10.70.46.123
Controllers:		ReplicationController/glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com-1
Containers:
  glusterfs:
    Container ID:	docker://62e10140d90f356c5e6754cef2182e4197a219ab808f494b61ce206115b1d124
    Image:		rhgs3/rhgs-server-rhel7
    Image ID:		docker://sha256:d440f833317de8c3cb96c36d49c72ff390b355df021af6eeb31a9079d5ee9d4d
    Port:		
    State:		Waiting
      Reason:		CrashLoopBackOff
    Last State:		Terminated
      Reason:		Error
      Exit Code:	137
      Started:		Wed, 16 Nov 2016 15:57:57 +0530
      Finished:		Wed, 16 Nov 2016 15:59:37 +0530
    Ready:		False
    Restart Count:	12
    Liveness:		exec [/bin/bash -c systemctl status glusterd.service] delay=60s timeout=3s period=10s #success=1 #failure=3
    Readiness:		exec [/bin/bash -c systemctl status glusterd.service] delay=60s timeout=3s period=10s #success=1 #failure=3
    Volume Mounts:
      /dev from glusterfs-dev (rw)
      /etc/glusterfs from glusterfs-etc (rw)
      /run from glusterfs-run (rw)
      /run/lvm from glusterfs-lvm (rw)
      /sys/fs/cgroup from glusterfs-cgroup (rw)
      /var/lib/glusterd from glusterfs-config (rw)
      /var/lib/heketi from glusterfs-heketi (rw)
      /var/log/glusterfs from glusterfs-logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-7xdw5 (ro)
    Environment Variables:	<none>
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  glusterfs-heketi:
    Type:	HostPath (bare host directory volume)
    Path:	/var/lib/heketi
  glusterfs-run:
    Type:	EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:	
  glusterfs-lvm:
    Type:	HostPath (bare host directory volume)
    Path:	/run/lvm
  glusterfs-etc:
    Type:	HostPath (bare host directory volume)
    Path:	/etc/glusterfs
  glusterfs-logs:
    Type:	HostPath (bare host directory volume)
    Path:	/var/log/glusterfs
  glusterfs-config:
    Type:	HostPath (bare host directory volume)
    Path:	/var/lib/glusterd
  glusterfs-dev:
    Type:	HostPath (bare host directory volume)
    Path:	/dev
  glusterfs-cgroup:
    Type:	HostPath (bare host directory volume)
    Path:	/sys/fs/cgroup
  default-token-7xdw5:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-7xdw5
QoS Class:	BestEffort
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From						SubobjectPath			Type		Reason		Message
  ---------	--------	-----	----						-------------			--------	------		-------
  27m		27m		1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}					Warning		FailedSync	Error syncing pod, skipping: Error response from daemon: devmapper: Unknown device ea5d28b75dfea4e8172b026c7e040e4cc55e210236b163fb52023f697adfbbf1
  27m		27m		1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal		Created		Created container with docker id 53d5c073dd4a; Security:[seccomp=unconfined]
  27m		27m		1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal		Started		Started container with docker id 53d5c073dd4a
  25m		25m		1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Warning		Unhealthy	Liveness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: activating (start) since Wed 2016-11-16 05:06:16 EST; 2s ago
  Control: 972 (glusterd)
   CGroup: /system.slice/docker-53d5c073dd4ad1194b9a8b8051fe1a499a28c06ca39dcc18af06239ba1a2107d.scope/system.slice/glusterd.service
           ├─ 972 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
           ├─ 973 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
           └─1115 /usr/bin/python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py -c /var/lib/glusterd/geo-replication/gsyncd_template.conf --config-set-rx gluster-params aux-gfid-mount acl .

Nov 16 05:06:16 dhcp46-123.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server...

  25m	25m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Warning	Unhealthy	Readiness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: activating (start) since Wed 2016-11-16 05:06:16 EST; 2s ago
  Control: 972 (glusterd)
   CGroup: /system.slice/docker-53d5c073dd4ad1194b9a8b8051fe1a499a28c06ca39dcc18af06239ba1a2107d.scope/system.slice/glusterd.service
           ├─ 972 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
           ├─ 973 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
           └─1091 /usr/bin/python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py -c /var/lib/glusterd/geo-replication/gsyncd_template.conf --config-set-rx gluster-command-dir /usr/sbin/ .

Nov 16 05:06:16 dhcp46-123.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server...

  25m	25m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Killing		Killing container with docker id 53d5c073dd4a: pod "glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com-1-ylm2b_storage-project(57ac768f-a669-11e6-a52d-005056b380ec)" container "glusterfs" is unhealthy, it will be killed and re-created.
  25m	25m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Started		Started container with docker id 2ad9588b980f
  25m	25m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Created		Created container with docker id 2ad9588b980f; Security:[seccomp=unconfined]
  23m	23m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Killing		Killing container with docker id 2ad9588b980f: pod "glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com-1-ylm2b_storage-project(57ac768f-a669-11e6-a52d-005056b380ec)" container "glusterfs" is unhealthy, it will be killed and re-created.
  23m	23m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Created		Created container with docker id 4afd8abda915; Security:[seccomp=unconfined]
  23m	23m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Started		Started container with docker id 4afd8abda915
  21m	21m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Killing		Killing container with docker id 4afd8abda915: pod "glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com-1-ylm2b_storage-project(57ac768f-a669-11e6-a52d-005056b380ec)" container "glusterfs" is unhealthy, it will be killed and re-created.
  21m	21m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Started		Started container with docker id e023f5f4a1fd
  21m	21m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Created		Created container with docker id e023f5f4a1fd; Security:[seccomp=unconfined]
  19m	19m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Killing		Killing container with docker id e023f5f4a1fd: pod "glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com-1-ylm2b_storage-project(57ac768f-a669-11e6-a52d-005056b380ec)" container "glusterfs" is unhealthy, it will be killed and re-created.
  19m	19m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Created		Created container with docker id 63da263bdf30; Security:[seccomp=unconfined]
  19m	19m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Started		Started container with docker id 63da263bdf30
  17m	17m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Killing		Killing container with docker id 63da263bdf30: pod "glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com-1-ylm2b_storage-project(57ac768f-a669-11e6-a52d-005056b380ec)" container "glusterfs" is unhealthy, it will be killed and re-created.
  17m	17m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Started		Started container with docker id 02aa8e2074de
  17m	17m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Created		Created container with docker id 02aa8e2074de; Security:[seccomp=unconfined]
  16m	16m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Warning	Unhealthy	Readiness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: activating (start) since Wed 2016-11-16 05:15:36 EST; 3s ago
  Control: 951 (glusterd)
   CGroup: /system.slice/docker-02aa8e2074deee45556b88579326c42b6f0a82cef0309607ef8ccafb74f02c3f.scope/system.slice/glusterd.service
           ├─951 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
           └─952 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO

Nov 16 05:15:36 dhcp46-123.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server...

  16m	16m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Killing		Killing container with docker id 02aa8e2074de: pod "glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com-1-ylm2b_storage-project(57ac768f-a669-11e6-a52d-005056b380ec)" container "glusterfs" is unhealthy, it will be killed and re-created.
  16m	13m	14	{kubelet dhcp46-123.lab.eng.blr.redhat.com}					Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "glusterfs" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=glusterfs pod=glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com-1-ylm2b_storage-project(57ac768f-a669-11e6-a52d-005056b380ec)"

  13m	13m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Started		Started container with docker id 6664b7da06d5
  13m	13m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Created		Created container with docker id 6664b7da06d5; Security:[seccomp=unconfined]
  12m	12m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Warning	Unhealthy	Readiness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: activating (start) since Wed 2016-11-16 05:20:06 EST; 3s ago
  Control: 912 (glusterd)
   CGroup: /system.slice/docker-6664b7da06d5e70a5b17e3988fef0518c955aefe29c2b72a89b81dd9e6009312.scope/system.slice/glusterd.service
           ├─912 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
           └─913 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO

Nov 16 05:20:06 dhcp46-123.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server...

  12m	12m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Warning	Unhealthy	Liveness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: activating (start) since Wed 2016-11-16 05:20:06 EST; 3s ago
  Control: 912 (glusterd)
   CGroup: /system.slice/docker-6664b7da06d5e70a5b17e3988fef0518c955aefe29c2b72a89b81dd9e6009312.scope/system.slice/glusterd.service
           ├─912 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
           └─913 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO

Nov 16 05:20:06 dhcp46-123.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server...

  11m	11m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Killing		Killing container with docker id 6664b7da06d5: pod "glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com-1-ylm2b_storage-project(57ac768f-a669-11e6-a52d-005056b380ec)" container "glusterfs" is unhealthy, it will be killed and re-created.
  6m	6m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Started		Started container with docker id 77ce5a592c24
  6m	6m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Created		Created container with docker id 77ce5a592c24; Security:[seccomp=unconfined]
  27m	4m	9	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Pulling		pulling image "rhgs3/rhgs-server-rhel7"
  4m	4m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Killing		Killing container with docker id 77ce5a592c24: pod "glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com-1-ylm2b_storage-project(57ac768f-a669-11e6-a52d-005056b380ec)" container "glusterfs" is unhealthy, it will be killed and re-created.
  27m	4m	9	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Pulled		Successfully pulled image "rhgs3/rhgs-server-rhel7"
  4m	4m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Created		Created container with docker id 62e10140d90f; Security:[seccomp=unconfined]
  4m	4m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Started		Started container with docker id 62e10140d90f
  26m	3m	9	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Warning	Unhealthy	Liveness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: inactive (dead)

  26m	3m	9	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Warning	Unhealthy	Readiness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: inactive (dead)

  2m	2m	1	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal	Killing		Killing container with docker id 62e10140d90f: pod "glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com-1-ylm2b_storage-project(57ac768f-a669-11e6-a52d-005056b380ec)" container "glusterfs" is unhealthy, it will be killed and re-created.
  11m	11s	37	{kubelet dhcp46-123.lab.eng.blr.redhat.com}					Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "glusterfs" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=glusterfs pod=glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com-1-ylm2b_storage-project(57ac768f-a669-11e6-a52d-005056b380ec)"

  16m	11s	51	{kubelet dhcp46-123.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Warning	BackOff	Back-off restarting failed docker container

Version-Release number of selected component (if applicable):
openshift version
openshift v3.4.0.23+24b1a58
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

How reproducible:
Reproducible depending on the number of volumes in the system. More the volumes, higher the chance of reproducibility

Steps to Reproduce:
1. create 3 node CNS cluster
2. create around 90 volumes - by 90 pvc requests
3. Reboot the three nodes one by one. i.e., reboot a node, wait for the node to turn up. check if gluster and other containers are spawned successfully. proceed with next node's reboot.

When this test was run, node 'dhcp46-119.lab.eng.blr.redhat.com' turned up without any issues. However, gluster container in node 'dhcp46-123.lab.eng.blr.redhat.com' failed to spawn successfully.

Actual results:
gluster container in node 'dhcp46-123.lab.eng.blr.redhat.com' failed to spawn successfully.

Expected results:
Node reboot should spawn all the containers hosted by the node successfully.

Additional info:
I suspect that the liveliness probe of 60 seconds to be the reason behind failure of gluster pod to be respun. With more number of volumes, time taken to create the brick process increases and eventually glusterd process doesn't start before 60 seconds, resulting in container being killed and the complete process getting started once again. I'll leave it to dev to either take this theory or see if there is some other reason.

Logs shall be attached shortly.

Comment 2 krishnaram Karthick 2016-11-16 11:14:07 UTC
Created attachment 1221104 [details]
oc_describe_glusterfs-dc-dhcp46-123.output

Comment 3 krishnaram Karthick 2016-11-16 11:14:45 UTC
Created attachment 1221105 [details]
heketi_volumelist.output

Comment 4 krishnaram Karthick 2016-11-16 11:15:16 UTC
Created attachment 1221106 [details]
oc_get_pvc.output

Comment 5 krishnaram Karthick 2016-11-16 11:16:52 UTC
Created attachment 1221107 [details]
oc_get_events.output

Comment 7 Humble Chirammal 2016-11-16 11:53:10 UTC
Karthick, thanks for the detailed bug report. :)  

I am yet to check the logs, however I have couple of questions. 

iic, all three nodes has same number of brick processes and same hardware configuration and only one of the node is failing. Is that correct?

Secondly, can you confirm this issue does not pop up when you reduce the number of bricks in your setup ? say 70-80 volumes ?

The pointer about the probe timeout is valid, it was 'kind of' working for last release volume scaling test. But we could definitely rethink about it.

Comment 8 krishnaram Karthick 2016-11-16 13:44:32 UTC
(In reply to Humble Chirammal from comment #7)
> Karthick, thanks for the detailed bug report. :)  
> 
> I am yet to check the logs, however I have couple of questions. 
> 
> iic, all three nodes has same number of brick processes and same hardware
> configuration and only one of the node is failing. Is that correct?

That's right.

> 
> Secondly, can you confirm this issue does not pop up when you reduce the
> number of bricks in your setup ? say 70-80 volumes ?

Unfortunately, trying that won't be possible. We don't have bandwidth :) It takes a good amount of time to setup and run this test. When the container crashes it again takes time to get back the setup in a working state.

But I've seen this issue when volumes were 137 and 175 and the issue was seen both the times with 100% hit rate.

Hope this information helps.

> The pointer about the probe timeout is valid, it was 'kind of' working for
> last release volume scaling test. But we could definitely rethink about it.

Comment 11 Humble Chirammal 2016-11-17 04:45:38 UTC
(In reply to krishnaram Karthick from comment #10)
> (In reply to krishnaram Karthick from comment #9)
> > Given that there is a possibility of this issue being seen in previous
> > release of CNS, it would be nice if we come up with workaround to get out of
> > this issue to help any existing customers with scaled volume.
> 
> Found a workaround,
> 
> 1) delete the deployment config and pod which is in failed state
> 
> # oc delete dc/glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com
> # oc delete pods/glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com
> 
> 2) Delete the glusterfs template
> 
> # oc get templates
> NAME        DESCRIPTION                               PARAMETERS    OBJECTS
> glusterfs   GlusterFS container deployment template   1 (1 blank)   1
> heketi      Heketi service deployment template        8 (7 blank)   3
> 
> # oc delete templates/glusterfs
> 
> 3) Edit the /usr/share/heketi/templates/glusterfs-template.json file to
> increase liveliness probe from 60 seconds to a higher value. (I used 90
> seconds)
> 
> 4) create the glusterfs template again
> 
> # oc create -f /usr/share/heketi/templates/glusterfs-template.json
> 
> 5) Deploy RHGS container 
> 
> # oc process glusterfs -v GLUSTERFS_NODE=dhcp46-123.lab.eng.blr.redhat.com |
> oc create -f -
> 
> 6) Wait for glusterfs pod to be up and running
> 
> oc get pods
> NAME                                                     READY     STATUS   
> RESTARTS   AGE
> glusterfs-dc-dhcp46-118.lab.eng.blr.redhat.com-1-2mv70   1/1       Running  
> 2          1d
> glusterfs-dc-dhcp46-119.lab.eng.blr.redhat.com-1-kpjch   1/1       Running  
> 5          7d
> glusterfs-dc-dhcp46-123.lab.eng.blr.redhat.com-1-t5swh   1/1       Running  
> 1          12m
> heketi-1-cvuq4                                           1/1       Running  
> 4          19h
> storage-project-router-1-4f2sv                           1/1       Running  
> 5          7d
> 
> I'd once again wait for dev to approve these steps before sharing it.

Karthick, when you said, its difficult to setup, I was planning to ask you to edit the template for ~100s and deploy it again. We are increasing the template timeout to "100" , thats what we planned yesterday (https://github.com/heketi/heketi/pull/576) in next heketi build. We know we had a scale test ~70 volumes went without issues on "60s", as the volumes are more, it is good to increase to some more higher. The only issue with increasing this value is that, if there are not much volumes, the probe may be delayed unwantedly. However considering we are a storage container, it looks fine to me.

Comment 15 Humble Chirammal 2016-12-13 12:25:38 UTC
@Karthick, the timeout is now 100s as discussed here. The new values is available in latest heketi build mentioned here:

New build with heketi-3.1.0.5, rhgs-volmanager-docker:3.1.3-19 is available.

Brew Task link :  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=528359.

Comment 16 krishnaram Karthick 2016-12-28 03:56:17 UTC
Timeout has been increased to 100 seconds.

      containers:
      - image: rhgs3/rhgs-server-rhel7:3.1.3-16
        imagePullPolicy: Always
        livenessProbe:
          exec:
            command:
            - /bin/bash
            - -c
            - systemctl status glusterd.service
          failureThreshold: 3
          initialDelaySeconds: 100
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 3
        name: glusterfs
        readinessProbe:
          exec:
            command:
            - /bin/bash
            - -c
            - systemctl status glusterd.service
          failureThreshold: 3
          initialDelaySeconds: 100
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 3
        resources: {}
        securityContext:
          capabilities: {}
          privileged: true


And the issue reported in this bug is no more seen with the scale tests anymore.

Comment 17 krishnaram Karthick 2016-12-28 03:58:22 UTC
@Humble, How are we handling this change for upgrades from 3.3? Are we documenting the steps to change the value from 60 to 100 if this change is not handled automatically?

Comment 18 Humble Chirammal 2016-12-28 06:42:40 UTC
(In reply to krishnaram Karthick from comment #17)
> @Humble, How are we handling this change for upgrades from 3.3? Are we
> documenting the steps to change the value from 60 to 100 if this change is
> not handled automatically?

The new settings will be available in new template we ship with 3.4. So, its automatically taken care with the upgrade.

Comment 19 krishnaram Karthick 2016-12-28 06:46:53 UTC
Thanks for the update Humble, I'll move the bug to verified as there is no doc text required based on C#18.

Comment 20 errata-xmlrpc 2017-01-18 21:57:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2017-0148.html