Bug 1468386

Summary:

Glusterfs pod crash

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

mdunn

Component:

rhgs-server-container

Assignee:

Saravanakumar <sarumuga>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Anoop <annair>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

cns-3.5

CC:

akhakhar, amukherj, annair, aos-bugs, bkunal, btejado, hchiramm, jokerman, madam, mdunn, mmccomas, rcyriac, rhs-bugs, rtalur, sankarshan, sarumuga

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-12-07 08:29:07 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1724792, 1549776

Attachments:

Description	Flags
oc_get_events	none

Description mdunn 2017-07-06 22:39:00 UTC

Created attachment 1295081 [details]
oc_get_events

Description of problem:

While running approximately 160 pods on OCP 3.5, one (out of six) of the glusterfs pods crashed and repeatedly crashed and would not start running again.
The pod is:
[root@dhcp19-231-229 ~]# oc describe pod glusterfs-5s8z6
Name:			glusterfs-5s8z6
Namespace:		storage-project
Security Policy:	privileged
Node:			dhcp19-231-235.css.lab.eng.bos.redhat.com/10.19.231.235
Start Time:		Thu, 29 Jun 2017 14:12:32 -0400
Labels:			glusterfs-node=pod
Status:			Running
IP:			10.19.231.235
Controllers:		DaemonSet/glusterfs
Containers:
  glusterfs:
    Container ID:	docker://7351f03b6c28cd7d49feb77641de7f5573126f85503e0d7cea8f2a574155687f
    Image:		rhgs3/rhgs-server-rhel7:3.2.0-7
    Image ID:		docker-pullable://registry.access.redhat.com/rhgs3/rhgs-server-rhel7@sha256:745adac32afa649eab352d4780e2d4429d0f8aa75369fe623e2b353124282bf1
    Port:		
    State:		Running
      Started:		Thu, 06 Jul 2017 12:04:25 -0400
    Last State:		Terminated
      Reason:		Error
      Exit Code:	137
      Started:		Thu, 06 Jul 2017 11:52:24 -0400
      Finished:		Thu, 06 Jul 2017 11:54:12 -0400
    Ready:		True
    Restart Count:	154
    Liveness:		exec [/bin/bash -c systemctl status glusterd.service] delay=40s timeout=3s period=25s #success=1 #failure=15
    Readiness:		exec [/bin/bash -c systemctl status glusterd.service] delay=40s timeout=3s period=25s #success=1 #failure=15
    Volume Mounts:
      /dev from glusterfs-dev (rw)
      /etc/glusterfs from glusterfs-etc (rw)
      /etc/ssl from glusterfs-ssl (ro)
      /run from glusterfs-run (rw)
      /run/lvm from glusterfs-lvm (rw)
      /sys/fs/cgroup from glusterfs-cgroup (ro)
      /var/lib/glusterd from glusterfs-config (rw)
      /var/lib/heketi from glusterfs-heketi (rw)
      /var/lib/misc/glusterfsd from glusterfs-misc (rw)
      /var/log/glusterfs from glusterfs-logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-104ao (ro)
    Environment Variables:	<none>
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	True 
  PodScheduled 	True 
Volumes:
  glusterfs-heketi:
    Type:	HostPath (bare host directory volume)
    Path:	/var/lib/heketi
  glusterfs-run:
    Type:	EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:	
  glusterfs-lvm:
    Type:	HostPath (bare host directory volume)
    Path:	/run/lvm
  glusterfs-etc:
    Type:	HostPath (bare host directory volume)
    Path:	/etc/glusterfs
  glusterfs-logs:
    Type:	HostPath (bare host directory volume)
    Path:	/var/log/glusterfs
  glusterfs-config:
    Type:	HostPath (bare host directory volume)
    Path:	/var/lib/glusterd
  glusterfs-dev:
    Type:	HostPath (bare host directory volume)
    Path:	/dev
  glusterfs-misc:
    Type:	HostPath (bare host directory volume)
    Path:	/var/lib/misc/glusterfsd
  glusterfs-cgroup:
    Type:	HostPath (bare host directory volume)
    Path:	/sys/fs/cgroup
  glusterfs-ssl:
    Type:	HostPath (bare host directory volume)
    Path:	/etc/ssl
  default-token-104ao:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-104ao
QoS Class:	BestEffort
Tolerations:	<none>
 

While collecting sosreports for the nodes in the cluster, one of the hosts (10.19.231.231) entered a hung state and necessitated a forced power off. A crash dump was collected and some of the information contained within is available here:
https://drive.google.com/drive/folders/0B6zCSURQcngfYVdfTlBTZ3JMdUU?usp=sharing
(e.g. the output from log, mount, ps, task, and vm). The node displayed a number of messages stating that it was out of memory and was killing processes.

Once the node was rebooted, the aforementioned glusterfs pod entered the running state successfully. At this time it was noted that a different glusterfs pod was now experiencing the crash loop. That pod is as follows:
[root@dhcp19-231-229 ~]# oc describe pod glusterfs-qfnlx
Name:			glusterfs-qfnlx
Namespace:		storage-project
Security Policy:	privileged
Node:			dhcp19-231-233.css.lab.eng.bos.redhat.com/10.19.231.233
Start Time:		Thu, 29 Jun 2017 14:12:32 -0400
Labels:			glusterfs-node=pod
Status:			Running
IP:			10.19.231.233
Controllers:		DaemonSet/glusterfs
Containers:
  glusterfs:
    Container ID:	docker://8b8c085a867c1b48b9ca8ceddf7352669f8b848d71af8a590f8a19e97e2bae25
    Image:		rhgs3/rhgs-server-rhel7:3.2.0-7
    Image ID:		docker-pullable://registry.access.redhat.com/rhgs3/rhgs-server-rhel7@sha256:745adac32afa649eab352d4780e2d4429d0f8aa75369fe623e2b353124282bf1
    Port:		
    State:		Waiting
      Reason:		CrashLoopBackOff
    Last State:		Terminated
      Reason:		Error
      Exit Code:	137
      Started:		Thu, 06 Jul 2017 17:47:12 -0400
      Finished:		Thu, 06 Jul 2017 17:48:38 -0400
    Ready:		False
    Restart Count:	76
    Liveness:		exec [/bin/bash -c systemctl status glusterd.service] delay=40s timeout=3s period=25s #success=1 #failure=15
    Readiness:		exec [/bin/bash -c systemctl status glusterd.service] delay=40s timeout=3s period=25s #success=1 #failure=15
    Volume Mounts:
      /dev from glusterfs-dev (rw)
      /etc/glusterfs from glusterfs-etc (rw)
      /etc/ssl from glusterfs-ssl (ro)
      /run from glusterfs-run (rw)
      /run/lvm from glusterfs-lvm (rw)
      /sys/fs/cgroup from glusterfs-cgroup (ro)
      /var/lib/glusterd from glusterfs-config (rw)
      /var/lib/heketi from glusterfs-heketi (rw)
      /var/lib/misc/glusterfsd from glusterfs-misc (rw)
      /var/log/glusterfs from glusterfs-logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-104ao (ro)
    Environment Variables:	<none>
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  glusterfs-heketi:
    Type:	HostPath (bare host directory volume)
    Path:	/var/lib/heketi
  glusterfs-run:
    Type:	EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:	
  glusterfs-lvm:
    Type:	HostPath (bare host directory volume)
    Path:	/run/lvm
  glusterfs-etc:
    Type:	HostPath (bare host directory volume)
    Path:	/etc/glusterfs
  glusterfs-logs:
    Type:	HostPath (bare host directory volume)
    Path:	/var/log/glusterfs
  glusterfs-config:
    Type:	HostPath (bare host directory volume)
    Path:	/var/lib/glusterd
  glusterfs-dev:
    Type:	HostPath (bare host directory volume)
    Path:	/dev
  glusterfs-misc:
    Type:	HostPath (bare host directory volume)
    Path:	/var/lib/misc/glusterfsd
  glusterfs-cgroup:
    Type:	HostPath (bare host directory volume)
    Path:	/sys/fs/cgroup
  glusterfs-ssl:
    Type:	HostPath (bare host directory volume)
    Path:	/etc/ssl
  default-token-104ao:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-104ao
QoS Class:	BestEffort
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From							SubObjectPath			Type		Reason		Message
  ---------	--------	-----	----							-------------			--------	------		-------
  7d		2m		77	{kubelet dhcp19-231-233.css.lab.eng.bos.redhat.com}	spec.containers{glusterfs}	Normal		Pulled		Container image "rhgs3/rhgs-server-rhel7:3.2.0-7" already present on machine
  4h		2m		67	{kubelet dhcp19-231-233.css.lab.eng.bos.redhat.com}	spec.containers{glusterfs}	Normal		Created		(events with common reason combined)
  3h		2m		47	{kubelet dhcp19-231-233.css.lab.eng.bos.redhat.com}	spec.containers{glusterfs}	Normal		Started		(events with common reason combined)
  3h		1m		55	{kubelet dhcp19-231-233.css.lab.eng.bos.redhat.com}	spec.containers{glusterfs}	Warning		Unhealthy	Readiness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: inactive (dead)

  3h	1m	51	{kubelet dhcp19-231-233.css.lab.eng.bos.redhat.com}	spec.containers{glusterfs}	Warning	Unhealthy	Liveness probe failed: ● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: inactive (dead)

  2h	1m	26	{kubelet dhcp19-231-233.css.lab.eng.bos.redhat.com}	spec.containers{glusterfs}	Warning	Unhealthy	(events with common reason combined)
  3h	1m	47	{kubelet dhcp19-231-233.css.lab.eng.bos.redhat.com}	spec.containers{glusterfs}	Normal	Killing		(events with common reason combined)
  5h	9s	752	{kubelet dhcp19-231-233.css.lab.eng.bos.redhat.com}	spec.containers{glusterfs}	Warning	BackOff		Back-off restarting failed docker container
  5h	9s	751	{kubelet dhcp19-231-233.css.lab.eng.bos.redhat.com}					Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "glusterfs" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=glusterfs pod=glusterfs-qfnlx_storage-project(85232959-5cf6-11e7-9813-5254009e1595)"


Neither the initial pod that crashed nor the subsequent one were residing on the node that hung and was then rebooted.

The cluster consists of 1 master, 6 CNS nodes, and 3 additional (non-CNS) nodes.
The sosreports are available here:
https://drive.google.com/drive/folders/0B6zCSURQcngfYVdfTlBTZ3JMdUU?usp=sharing

The output of oc get events -n storage-project is attached.

Version-Release number of selected component (if applicable):
oc v3.5.5.24
kubernetes v1.5.2+43a9be4

How reproducible:
At this time it appears to be easily reproducible

Steps to Reproduce:
1. Install OCP 3.5
2. Install CNS
3. Deploy approximately 160 pods (with approximately 100 total PVCs)

Actual results:
After the pods run for some time, one of the glusterfs pods will crash and be unable to recover.

Expected results:
No glusterfs pods should crash.

Additional info:

Comment 2 Mohamed Ashiq 2017-07-17 06:37:54 UTC

Hi,

We have not yet found the root cause, Just a thought that something could have gone bad on op-version bump-up.

There are 6 nodes, 

3 old:
dhcp19-231-235.css.lab.eng.bos.redhat.com
dhcp19-231-231.css.lab.eng.bos.redhat.com
dhcp19-231-233.css.lab.eng.bos.redhat.com

3 new:
dhcp19-231-239.css.lab.eng.bos.redhat.com
dhcp19-231-200.css.lab.eng.bos.redhat.com
dhcp19-231-237.css.lab.eng.bos.redhat.com


# oc get pods -o wide
NAME                             READY     STATUS             RESTARTS   AGE       IP              NODE
glusterfs-5s8z6                  0/1       CrashLoopBackOff   2827       14d       10.19.231.235   dhcp19-231-235.css.lab.eng.bos.redhat.com
glusterfs-7519d                  1/1       Running            1          14d       10.19.231.231   dhcp19-231-231.css.lab.eng.bos.redhat.com
glusterfs-br2pp                  1/1       Running            1          9d        10.19.231.239   dhcp19-231-239.css.lab.eng.bos.redhat.com
glusterfs-lk223                  1/1       Running            0          8d        10.19.231.200   dhcp19-231-200.css.lab.eng.bos.redhat.com
glusterfs-nlbsf                  1/1       Running            0          9d        10.19.231.237   dhcp19-231-237.css.lab.eng.bos.redhat.com
glusterfs-qfnlx                  1/1       Running            1207       14d       10.19.231.233   dhcp19-231-233.css.lab.eng.bos.redhat.com
heketi-1-xp40f                   1/1       Running            0          14d       10.130.0.66     dhcp19-231-223.css.lab.eng.bos.redhat.com
storage-project-router-3-4lxtf   1/1       Running            0          4d        10.19.231.239   dhcp19-231-239.css.lab.eng.bos.redhat.com

29th of June Upgrade of cns from 3.4 to 3.5 has happened. We see op-version bumb up in the logs.

[2017-06-29 18:12:40.557051] I [MSGID: 100030] [glusterfsd.c:2412:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.8.4 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO)
[2017-06-29 18:12:40.582559] I [MSGID: 106478] [glusterd.c:1382:init] 0-management: Maximum allowed open file descriptors set to 65536
[2017-06-29 18:12:40.582687] I [MSGID: 106479] [glusterd.c:1431:init] 0-management: Using /var/lib/glusterd as working directory
[2017-06-29 18:12:40.599088] E [rpc-transport.c:283:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/3.8.4/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
[2017-06-29 18:12:40.599158] W [rpc-transport.c:287:rpc_transport_load] 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not valid or not found on this machine
[2017-06-29 18:12:40.599184] W [rpcsvc.c:1646:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
[2017-06-29 18:12:40.599225] E [MSGID: 106243] [glusterd.c:1655:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2017-06-29 18:12:45.368518] E [MSGID: 101032] [store.c:433:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
[2017-06-29 18:12:45.368601] E [MSGID: 101032] [store.c:433:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
[2017-06-29 18:12:45.368609] I [MSGID: 106514] [glusterd-store.c:2123:glusterd_restore_op_version] 0-management: Detected new install. Setting op-version to maximum : 31001
[2017-06-29 18:12:45.368880] I [MSGID: 106194] [glusterd-store.c:3636:glusterd_store_retrieve_missed_snaps_list] 0-management: No missed snaps list.
Final graph:
+------------------------------------------------------------------------------+
  1: volume management
  2:     type mgmt/glusterd
  3:     option rpc-auth.auth-glusterfs on
  4:     option rpc-auth.auth-unix on
  5:     option rpc-auth.auth-null on
  6:     option rpc-auth-allow-insecure on
  7:     option transport.socket.listen-backlog 128
  8:     option event-threads 1
  9:     option ping-timeout 0
 10:     option transport.socket.read-fail-log off
 11:     option transport.socket.keepalive-interval 2
 12:     option transport.socket.keepalive-time 10
 13:     option transport-type rdma
 14:     option working-directory /var/lib/glusterd
 15: end-volume
 16:  
+------------------------------------------------------------------------------+
[2017-06-29 18:12:45.369733] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2017-06-29 18:13:53.121184] I [MSGID: 106163] [glusterd-handshake.c:1274:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31001
[2017-06-29 18:13:53.121418] E [MSGID: 101032] [store.c:433:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
[2017-06-29 18:13:53.121717] I [MSGID: 106477] [glusterd.c:188:glusterd_uuid_generate_save] 0-management: generated UUID: f947321c-b02e-4bd0-8c2e-12e477b920d8
[2017-06-29 18:13:53.388681] I [MSGID: 106490] [glusterd-handler.c:2961:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: acd4dae7-1e8b-4179-a3f3-8bb587f4d373
[2017-06-29 18:13:53.414595] I [MSGID: 106129] [glusterd-handler.c:2996:__glusterd_handle_probe_query] 0-glusterd: Unable to find peerinfo for host: dhcp19-231-233.css.lab.eng.bos.redhat.com (24007)
[2017-06-29 18:13:53.467206] I [rpc-clnt.c:1060:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2017-06-29 18:13:53.476788] I [MSGID: 106498] [glusterd-handler.c:3609:glusterd_friend_add] 0-management: connect returned 0
[2017-06-29 18:13:53.476957] I [MSGID: 106493] [glusterd-handler.c:3024:__glusterd_handle_probe_query] 0-glusterd: Responded to dhcp19-231-233.css.lab.eng.bos.redhat.com, op_ret: 0, op_errno: 0, ret: 0
[2017-06-29 18:13:53.477912] I [MSGID: 106490] [glusterd-handler.c:2610:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: acd4dae7-1e8b-4179-a3f3-8bb587f4d373
[2017-06-29 18:13:53.621016] I [MSGID: 106493] [glusterd-handler.c:3865:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to dhcp19-231-233.css.lab.eng.bos.redhat.com (0), ret: 0, op_ret: 0
[2017-06-29 18:13:53.877264] I [MSGID: 106511] [glusterd-rpc-ops.c:254:__glusterd_probe_cbk] 0-management: Received probe resp from uuid: acd4dae7-1e8b-4179-a3f3-8bb587f4d373, host: dhcp19-231-233.css.lab.eng.bo
s.redhat.com
[2017-06-29 18:13:53.877499] I [MSGID: 106511] [glusterd-rpc-ops.c:414:__glusterd_probe_cbk] 0-glusterd: Received resp to probe req
[2017-06-29 18:13:53.917137] I [MSGID: 106493] [glusterd-rpc-ops.c:478:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: acd4dae7-1e8b-4179-a3f3-8bb587f4d373, host: dhcp19-231-233.css.lab.eng.bos.re
dhat.com, port: 0


We have to check if this is clean/non-error op-version bump-up log. 


# oc logs  heketi-1-xp40f  | grep dhcp19 | grep " peer"
[kubeexec] DEBUG 2017/07/05 15:00:10 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:250: Host: dhcp19-231-231.css.lab.eng.bos.redhat.com Pod: glusterfs-7519d Command: gluster peer probe 10.19.231.239
[kubeexec] DEBUG 2017/07/05 15:00:57 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:250: Host: dhcp19-231-231.css.lab.eng.bos.redhat.com Pod: glusterfs-7519d Command: gluster peer probe 10.19.231.237
[kubeexec] DEBUG 2017/07/05 15:51:27 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:250: Host: dhcp19-231-231.css.lab.eng.bos.redhat.com Pod: glusterfs-7519d Command: gluster peer probe 10.19.231.200



@mdunn Thanks for the setup for debugging.

We are debugging more. 

--
Ashiq, Talur.

Comment 4 Mohamed Ashiq 2017-07-17 06:44:04 UTC

moving to gluster container, could be issue related to gluster.

Comment 5 mdunn 2017-07-17 13:36:50 UTC

After discussion with Anoop, we have concluded that this scenario was not tested as part of the 3.5 release testing. The release testing does not include the addition of nodes post upgrade which is why this was not caught.

Comment 6 Mohamed Ashiq 2017-07-24 13:36:47 UTC

Hi,

Thanks for the setup.

I was checking the setup.

# oc get pods -o wide
NAME                             READY     STATUS    RESTARTS   AGE       IP              NODE
glusterfs-7519d                  1/1       Running   1          24d       10.19.231.231   dhcp19-231-231.css.lab.eng.bos.redhat.com
glusterfs-br2pp                  1/1       Running   1          18d       10.19.231.239   dhcp19-231-239.css.lab.eng.bos.redhat.com
glusterfs-lk223                  1/1       Running   0          18d       10.19.231.200   dhcp19-231-200.css.lab.eng.bos.redhat.com
glusterfs-nlbsf                  1/1       Running   0          18d       10.19.231.237   dhcp19-231-237.css.lab.eng.bos.redhat.com
glusterfs-qfnlx                  1/1       Running   1207       24d       10.19.231.233   dhcp19-231-233.css.lab.eng.bos.redhat.com
heketi-1-xp40f                   1/1       Running   0          24d       10.130.0.66     dhcp19-231-223.css.lab.eng.bos.redhat.com
storage-project-router-3-4lxtf   1/1       Running   0          14d       10.19.231.239   dhcp19-231-239.css.lab.eng.bos.redhat.com


# oc get node --show-labels
NAME                                        STATUS                     AGE       LABELS
dhcp19-231-200.css.lab.eng.bos.redhat.com   Ready                      33d       beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=dhcp19-231-200.css.lab.eng.bos.redhat.com,storagenode=glusterfs
dhcp19-231-220.css.lab.eng.bos.redhat.com   Ready                      104d      beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=dhcp19-231-220.css.lab.eng.bos.redhat.com
dhcp19-231-223.css.lab.eng.bos.redhat.com   Ready                      104d      beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=dhcp19-231-223.css.lab.eng.bos.redhat.com
dhcp19-231-227.css.lab.eng.bos.redhat.com   Ready                      104d      beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=dhcp19-231-227.css.lab.eng.bos.redhat.com
dhcp19-231-229.css.lab.eng.bos.redhat.com   Ready,SchedulingDisabled   104d      beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=dhcp19-231-229.css.lab.eng.bos.redhat.com
dhcp19-231-231.css.lab.eng.bos.redhat.com   Ready                      104d      beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=dhcp19-231-231.css.lab.eng.bos.redhat.com,storagenode=glusterfs
dhcp19-231-233.css.lab.eng.bos.redhat.com   Ready                      104d      beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=dhcp19-231-233.css.lab.eng.bos.redhat.com,storagenode=glusterfs
dhcp19-231-235.css.lab.eng.bos.redhat.com   Ready                      104d      beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=dhcp19-231-235.css.lab.eng.bos.redhat.com,storagenode=glusterfs
dhcp19-231-237.css.lab.eng.bos.redhat.com   Ready                      33d       beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=dhcp19-231-237.css.lab.eng.bos.redhat.com,storagenode=glusterfs
dhcp19-231-239.css.lab.eng.bos.redhat.com   Ready                      33d       beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=dhcp19-231-239.css.lab.eng.bos.redhat.com,storagenode=glusterfs


# oc get ds
NAME        DESIRED   CURRENT   READY     NODE-SELECTOR           AGE
glusterfs   5         5         5         storagenode=glusterfs   24d


Looks like only "dhcp19-231-235.css.lab.eng.bos.redhat.com" node doesn't have the pod running.

[root@dhcp19-231-229 ~]# ssh root.lab.eng.bos.redhat.com
Last login: Mon Jul 24 09:11:35 2017 from 10.13.57.206
[root@dhcp19-231-235 ~]# df -h
Filesystem                              Size  Used Avail Use% Mounted on
/dev/mapper/rhel_dhcp19--231--235-root   50G   50G   20K 100% /
devtmpfs                                 16G     0   16G   0% /dev
tmpfs                                    16G     0   16G   0% /dev/shm
tmpfs                                    16G  2.8M   16G   1% /run
tmpfs                                    16G     0   16G   0% /sys/fs/cgroup
/dev/vda1                              1014M  184M  831M  19% /boot
tmpfs                                    16G  8.0K   16G   1% /var/lib/origin/openshift.local.volumes/pods/612b5e7c-6264-11e7-9813-5254009e1595/volumes/kubernetes.io~secret/server-certificate
tmpfs                                    16G   16K   16G   1% /var/lib/origin/openshift.local.volumes/pods/612b5e7c-6264-11e7-9813-5254009e1595/volumes/kubernetes.io~secret/router-token-2szq9
tmpfs                                    16G   16K   16G   1% /var/lib/origin/openshift.local.volumes/pods/1a41a597-552d-11e7-9813-5254009e1595/volumes/kubernetes.io~secret/default-token-hur2n
tmpfs                                   3.2G     0  3.2G   0% /run/user/0



This is completely expected gluster can't process without space in /var/lib.

[root@dhcp19-231-235 ~]# du -h /var/lib/glusterd
.
.
.
.
7.8M	/var/lib/glusterd


[root@dhcp19-231-235 ~]# du -h /var/log/glusterfs
.
.
.
975M	/var/log/glusterfs


[root@dhcp19-231-235 ~]# du -h /var | grep /var/log
.
.
.
33G	/var/log/journal
35G	/var/log



journal seems to be filling up all the space.
Also logs of gluster for container looks ok.

This is the reason why the contianer is not started.

Can you try expanding the root filesystem?

Comment 7 mdunn 2017-07-24 21:08:27 UTC

I will work on expanding the root filesystem, but it should be noted that the glusterfs pod that should be present on dhcp19-231-235 being absent is not the original issue.

This bz was filed on 7/6 and if you take a look at the information regarding that node, it did not transition to the out of disk space condition until 7/16 (approximately 10 days later).

oc describe node dhcp19-231-235.css.lab.eng.bos.redhat.com 
Name:			dhcp19-231-235.css.lab.eng.bos.redhat.com
Role:			
Labels:			beta.kubernetes.io/arch=amd64
			beta.kubernetes.io/os=linux
			kubernetes.io/hostname=dhcp19-231-235.css.lab.eng.bos.redhat.com
			storagenode=glusterfs
Taints:			<none>
CreationTimestamp:	Mon, 10 Apr 2017 13:32:12 -0400
Phase:			
Conditions:
  Type			Status	LastHeartbeatTime			LastTransitionTime			Reason				Message
  ----			------	-----------------			------------------			------				-------
  OutOfDisk 		True 	Mon, 24 Jul 2017 14:08:15 -0400 	Sun, 16 Jul 2017 22:23:42 -0400 	KubeletOutOfDisk 		out of disk space
  MemoryPressure 	False 	Mon, 24 Jul 2017 14:08:15 -0400 	Mon, 10 Apr 2017 13:32:12 -0400 	KubeletHasSufficientMemory 	kubelet has sufficient memory available
  DiskPressure 		False 	Mon, 24 Jul 2017 14:08:15 -0400 	Mon, 10 Apr 2017 13:32:12 -0400 	KubeletHasNoDiskPressure 	kubelet has no disk pressure
  Ready 		True 	Mon, 24 Jul 2017 14:08:15 -0400 	Tue, 11 Jul 2017 15:43:51 -0400 	KubeletReady 			kubelet is posting ready status


Pain Points
1) I have been unable to find anything that documents that a node transitioning to the out of space condition will subsequently kill the glusterfs pod. I stumbled upon the fact that the cluster was missing one of the glusterfs pods and further investigation led me to find the out of space condition as well. It was not until inquiring with a few people that I was able to confirm that these two items were, in fact, related.
If a customer were to get into such a situation, there should be an easier way to confirm that these two items are related.

2) The OCP 3.5 installation documentation states that nodes require a minimum of 15GB of space for the file system that contains /var/. This node has over 3x that and I've still reached a point where the recommendation given is to expand the file system. This leads me to the following points:
a) How was this 15GB number reached?
b) Is there any data showing how regularly we can expect for /var/log/journal (or anything else on /var/) to grow large enough to far outstrip these recommendations?
c) Depending on the frequency with which we can expect to hit such an issue should we modify the minimum requirement value?
d) If such a scenario as this is reached then we should document the suggested steps for successful resolution (assuming that we have not already done so, but I have yet to find that documentation myself).

Comment 8 mdunn 2017-07-26 15:21:49 UTC

The file system has been expanded to a size of 100G. There is 72G of available space on the file system. There is now a glusterfs pod on the dhcp19-231-235 node, but it has so far been unable to enter the "Ready" state.