Bug 1576564

Summary:	CRIO - Etcd keeps restarting in CrashLoopBackOff after container is deleted
Product:	OpenShift Container Platform	Reporter:	Vikas Laad <vlaad>
Component:	Containers	Assignee:	Mrunal Patel <mpatel>
Status:	CLOSED WORKSFORME	QA Contact:	DeShuai Ma <dma>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	3.10.0	CC:	amurdaca, aos-bugs, dwalsh, jokerman, mmccomas, mpatel, vlaad, wmeng
Target Milestone:	---
Target Release:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-05-21 20:20:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vikas Laad 2018-05-09 19:02:05 UTC

Description of problem:
I deleted etcd container from one of the masters in my HA cluster using "crictl rm". I was expecting it will start back up since its a static container. But it is stuck in CrashLoopBackOff state.

I see following error in the container log
2018-05-09 18:57:20.326665 W | etcdmain: found invalid file/dir test under data dir /var/lib/etcd/ (Ignore this if you are upgrading etcd)
2018-05-09 18:57:20.326680 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2018-05-09 18:57:20.326699 I | embed: peerTLS: cert = /etc/etcd/peer.crt, key = /etc/etcd/peer.key, ca = , trusted-ca = /etc/etcd/ca.crt, client-cert-auth = true
2018-05-09 18:57:20.327403 I | embed: listening for peers on https://172.31.59.66:2380
2018-05-09 18:57:20.327491 I | embed: listening for client requests on 172.31.59.66:2379
2018-05-09 18:57:20.327697 C | etcdmain: cannot access data directory: open /var/lib/etcd/.touch: permission denied

Version-Release number of selected component (if applicable):
openshift v3.10.0-0.32.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16

Steps to Reproduce:
1. Create a HA cluster with 3 etcd and master co-located
2. ssh to one of the masters and "crictl rm <etcd-container>"
3. it never starts back

Actual results:
Etcd container never starts back

Expected results:
Etcd container should start again

Additional info:

Comment 1 Vikas Laad 2018-05-09 19:04:42 UTC

Note: I was able to make it start after rebooting the instance.

Comment 2 Michal Fojtik 2018-05-10 09:25:16 UTC

It seems like the volume that etcd used for /var/lib/etcd is getting re-used but the permissions are wrong for the user that runs the container?

This might be storage or CRI-O bug (not related to etcd), assigning to containers team for triage.

Comment 3 Mrunal Patel 2018-05-10 18:20:59 UTC

Can you share the k8s configuration for etcd?
Also, what are the permissions including SELinux label for the /var/lib/etcd directory?

Comment 4 Antonio Murdaca 2018-05-21 09:37:56 UTC

Can you do a smoke test enabling/disabling selinux just to make sure everything "works" with and w/o selinux?

Comment 5 Vikas Laad 2018-05-21 20:20:02 UTC

I am not able to reproduce this issue in following build, closing it.

openshift v3.10.0-0.47.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16