1576564 – CRIO - Etcd keeps restarting in CrashLoopBackOff after container is deleted

Bug 1576564 - CRIO - Etcd keeps restarting in CrashLoopBackOff after container is deleted

Summary: CRIO - Etcd keeps restarting in CrashLoopBackOff after container is deleted

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.10.0
Assignee:	Mrunal Patel
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-09 19:02 UTC by Vikas Laad
Modified:	2018-05-21 20:20 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-05-21 20:20:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Vikas Laad 2018-05-09 19:02:05 UTC

Description of problem:
I deleted etcd container from one of the masters in my HA cluster using "crictl rm". I was expecting it will start back up since its a static container. But it is stuck in CrashLoopBackOff state.

I see following error in the container log
2018-05-09 18:57:20.326665 W | etcdmain: found invalid file/dir test under data dir /var/lib/etcd/ (Ignore this if you are upgrading etcd)
2018-05-09 18:57:20.326680 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2018-05-09 18:57:20.326699 I | embed: peerTLS: cert = /etc/etcd/peer.crt, key = /etc/etcd/peer.key, ca = , trusted-ca = /etc/etcd/ca.crt, client-cert-auth = true
2018-05-09 18:57:20.327403 I | embed: listening for peers on https://172.31.59.66:2380
2018-05-09 18:57:20.327491 I | embed: listening for client requests on 172.31.59.66:2379
2018-05-09 18:57:20.327697 C | etcdmain: cannot access data directory: open /var/lib/etcd/.touch: permission denied

Version-Release number of selected component (if applicable):
openshift v3.10.0-0.32.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16

Steps to Reproduce:
1. Create a HA cluster with 3 etcd and master co-located
2. ssh to one of the masters and "crictl rm <etcd-container>"
3. it never starts back

Actual results:
Etcd container never starts back

Expected results:
Etcd container should start again

Additional info:

Comment 1 Vikas Laad 2018-05-09 19:04:42 UTC

Note: I was able to make it start after rebooting the instance.

Comment 2 Michal Fojtik 2018-05-10 09:25:16 UTC

It seems like the volume that etcd used for /var/lib/etcd is getting re-used but the permissions are wrong for the user that runs the container?

This might be storage or CRI-O bug (not related to etcd), assigning to containers team for triage.

Comment 3 Mrunal Patel 2018-05-10 18:20:59 UTC

Can you share the k8s configuration for etcd?
Also, what are the permissions including SELinux label for the /var/lib/etcd directory?

Comment 4 Antonio Murdaca 2018-05-21 09:37:56 UTC

Can you do a smoke test enabling/disabling selinux just to make sure everything "works" with and w/o selinux?

Comment 5 Vikas Laad 2018-05-21 20:20:02 UTC

I am not able to reproduce this issue in following build, closing it.

openshift v3.10.0-0.47.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16

Note You need to log in before you can comment on or make changes to this bug.