1591805 – etcd pod stuck in CrashLoopBackOff after upgrade - port 2380 already in use

Bug 1591805 - etcd pod stuck in CrashLoopBackOff after upgrade - port 2380 already in use

Summary: etcd pod stuck in CrashLoopBackOff after upgrade - port 2380 already in use

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.10.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.10.z
Assignee:	Scott Dodson
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-15 15:28 UTC by Mike Fiedler
Modified:	2018-07-30 20:22 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2018-07-30 20:22:32 UTC
Target Upstream Version:
Embargoed:
Flags:	jiajliu: needinfo-

Attachments	(Terms of Use)
inventory, ansible -vvv log and etcd pod log (400.26 KB, application/x-gzip) 2018-06-15 15:28 UTC, Mike Fiedler	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:2263	0	None	None	None	2018-07-30 20:22:54 UTC

Description Mike Fiedler 2018-06-15 15:28:38 UTC

Created attachment 1451955 [details]
inventory, ansible -vvv log and etcd pod log

Description of problem:

Failed upgrade of a healthy CRI-O cluster with 1 co-located master/etcd, 1 infra and 2 computes.  

The install failed on the etcd health check:

    "stderr": "W0615 14:57:30.080453   15940 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".\ntime=\"2018-06-15T14:57:30Z\" level=fatal msg=\"execing command in container failed: Internal error occurred: error executing command in container: container is not created or running\" ", 
    "stderr_lines": [
        "W0615 14:57:30.080453   15940 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", 
        "time=\"2018-06-15T14:57:30Z\" level=fatal msg=\"execing command in container failed: Internal error occurred: error executing command in container: container is not created or running\" "
    ], 


The etcd pod is in CrashLoopBackOff with the following error:

2018-06-15 15:09:54.927617 C | etcdmain: listen tcp 172.31.52.209:2380: bind: address already in use

root@ip-172-31-52-209: ~ # oc get pods -n kube-system
NAME                                                      READY     STATUS             RESTARTS   AGE
master-etcd-ip-172-31-52-209.us-west-2.compute.internal   0/1       CrashLoopBackOff   12         29m


netstat and ps 

root@ip-172-31-52-209: ~ # netstat -tunapl | grep 2380
tcp        0      0 172.31.52.209:2380      0.0.0.0:*               LISTEN      10847/etcd
root@ip-172-31-52-209: ~ # ps -ef | grep etcd
root      10836      1  0 14:53 ?        00:00:00 /usr/libexec/crio/conmon -s -c fa9198972f569e9a7e602290677ce1632c1f2ca14c6a33953987c2e756db9fd0 -u fa9198972f569e9a7e602290677ce1632c1f2ca14c6a33953987c2e756db9fd0 -r /usr/bin/runc -b /var/run/containers/storage/overlay-containers/fa9198972f569e9a7e602290677ce1632c1f2ca14c6a33953987c2e756db9fd0/userdata -p /var/run/containers/storage/overlay-containers/fa9198972f569e9a7e602290677ce1632c1f2ca14c6a33953987c2e756db9fd0/userdata/pidfile -l /var/log/pods/9a3a498538cdade3ffc6a2379f08a141/etcd/0.log --exit-dir /var/run/crio/exits --socket-dir-path /var/run/crio --log-size-max 52428800
root      10847  10836  3 14:53 ?        00:00:58 etcd
root      12669  12658  0 14:54 ?        00:00:03 /usr/bin/service-catalog apiserver --storage-type etcd --secure-port 6443 --etcd-servers https://ip-172-31-52-209.us-west-2.compute.internal:2379 --etcd-cafile /etc/origin/master/master.etcd-ca.crt --etcd-certfile /etc/origin/master/master.etcd-client.crt --etcd-keyfile /etc/origin/master/master.etcd-client.key -v 3 --cors-allowed-origins localhost --admission-control KubernetesNamespaceLifecycle,DefaultServicePlan,ServiceBindingsLifecycle,ServicePlanChangeValidator,BrokerAuthSarCheck --feature-gates OriginatingIdentity=true 


Version-Release number of selected component (if applicable):  openshift-ansible master as of 6.15.2018

commit 95c7f6035d5ae53196d7a6f762383969d72d3e88 


How reproducible: Unknown.  Seen once so far.


Actual results:

Inventory, ansible -vvv and etcd pod log attached.

Comment 1 Scott Dodson 2018-06-15 19:48:00 UTC

I've seen this once before and it was because etcd was still running on the host as a systemd service. Not sure how frequently this happens. I've only seen it once.

Comment 2 Mike Fiedler 2018-06-20 18:09:42 UTC

Re-running the upgrade in the same configuration did not reproduce this.  It did hit https://bugzilla.redhat.com/show_bug.cgi?id=1591752 which is already provisionally targeted for 3.10.0.    Agree with leaving in 3.10.z for now.   We'll be running upgrade tests through code freeze

Comment 3 Scott Dodson 2018-07-09 20:37:25 UTC

I've seen this happen and Justin Pierce has run into it when doing a 3.10.x to 3.10.x+1 upgrade in starter environments. Something is starting etcd on the host. When tracing through our code I noticed that we delete /etc/systemd/system/etcd.service which effectively unmasks the service. I think we should stop doing that.


https://github.com/openshift/openshift-ansible/pull/9115

Comment 5 Weihua Meng 2018-07-16 08:15:12 UTC

meet same issue as comment 4.
openshift-ansible-3.10.18-1.git.314.cfe4f91.el7.noarch.rpm

upgrade success for RPM install (container runtime docker-1.13.1)

Comment 6 Scott Dodson 2018-07-18 20:00:19 UTC

https://github.com/openshift/openshift-ansible/pull/9246 follow up fix from mike

Comment 9 Scott Dodson 2018-07-20 12:15:26 UTC

Fix is in openshift-ansible-3.10.21-1

Comment 10 liujia 2018-07-23 09:02:26 UTC

Verified on openshift-ansible-3.10.21-1.git.0.6446011.el7.noarch

Comment 12 errata-xmlrpc 2018-07-30 20:22:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2263

Note You need to log in before you can comment on or make changes to this bug.