Bug 1576564

Summary: CRIO - Etcd keeps restarting in CrashLoopBackOff after container is deleted
Product: OpenShift Container Platform Reporter: Vikas Laad <vlaad>
Component: ContainersAssignee: Mrunal Patel <mpatel>
Status: CLOSED WORKSFORME QA Contact: DeShuai Ma <dma>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: amurdaca, aos-bugs, dwalsh, jokerman, mmccomas, mpatel, vlaad, wmeng
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-21 20:20:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vikas Laad 2018-05-09 19:02:05 UTC
Description of problem:
I deleted etcd container from one of the masters in my HA cluster using "crictl rm". I was expecting it will start back up since its a static container. But it is stuck in CrashLoopBackOff state.

I see following error in the container log
2018-05-09 18:57:20.326665 W | etcdmain: found invalid file/dir test under data dir /var/lib/etcd/ (Ignore this if you are upgrading etcd)
2018-05-09 18:57:20.326680 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2018-05-09 18:57:20.326699 I | embed: peerTLS: cert = /etc/etcd/peer.crt, key = /etc/etcd/peer.key, ca = , trusted-ca = /etc/etcd/ca.crt, client-cert-auth = true
2018-05-09 18:57:20.327403 I | embed: listening for peers on https://172.31.59.66:2380
2018-05-09 18:57:20.327491 I | embed: listening for client requests on 172.31.59.66:2379
2018-05-09 18:57:20.327697 C | etcdmain: cannot access data directory: open /var/lib/etcd/.touch: permission denied

Version-Release number of selected component (if applicable):
openshift v3.10.0-0.32.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16

Steps to Reproduce:
1. Create a HA cluster with 3 etcd and master co-located
2. ssh to one of the masters and "crictl rm <etcd-container>"
3. it never starts back

Actual results:
Etcd container never starts back

Expected results:
Etcd container should start again

Additional info:

Comment 1 Vikas Laad 2018-05-09 19:04:42 UTC
Note: I was able to make it start after rebooting the instance.

Comment 2 Michal Fojtik 2018-05-10 09:25:16 UTC
It seems like the volume that etcd used for /var/lib/etcd is getting re-used but the permissions are wrong for the user that runs the container?

This might be storage or CRI-O bug (not related to etcd), assigning to containers team for triage.

Comment 3 Mrunal Patel 2018-05-10 18:20:59 UTC
Can you share the k8s configuration for etcd?
Also, what are the permissions including SELinux label for the /var/lib/etcd directory?

Comment 4 Antonio Murdaca 2018-05-21 09:37:56 UTC
Can you do a smoke test enabling/disabling selinux just to make sure everything "works" with and w/o selinux?

Comment 5 Vikas Laad 2018-05-21 20:20:02 UTC
I am not able to reproduce this issue in following build, closing it.

openshift v3.10.0-0.47.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16