Created attachment 1451955 [details] inventory, ansible -vvv log and etcd pod log Description of problem: Failed upgrade of a healthy CRI-O cluster with 1 co-located master/etcd, 1 infra and 2 computes. The install failed on the etcd health check: "stderr": "W0615 14:57:30.080453 15940 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".\ntime=\"2018-06-15T14:57:30Z\" level=fatal msg=\"execing command in container failed: Internal error occurred: error executing command in container: container is not created or running\" ", "stderr_lines": [ "W0615 14:57:30.080453 15940 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", "time=\"2018-06-15T14:57:30Z\" level=fatal msg=\"execing command in container failed: Internal error occurred: error executing command in container: container is not created or running\" " ], The etcd pod is in CrashLoopBackOff with the following error: 2018-06-15 15:09:54.927617 C | etcdmain: listen tcp 172.31.52.209:2380: bind: address already in use root@ip-172-31-52-209: ~ # oc get pods -n kube-system NAME READY STATUS RESTARTS AGE master-etcd-ip-172-31-52-209.us-west-2.compute.internal 0/1 CrashLoopBackOff 12 29m netstat and ps root@ip-172-31-52-209: ~ # netstat -tunapl | grep 2380 tcp 0 0 172.31.52.209:2380 0.0.0.0:* LISTEN 10847/etcd root@ip-172-31-52-209: ~ # ps -ef | grep etcd root 10836 1 0 14:53 ? 00:00:00 /usr/libexec/crio/conmon -s -c fa9198972f569e9a7e602290677ce1632c1f2ca14c6a33953987c2e756db9fd0 -u fa9198972f569e9a7e602290677ce1632c1f2ca14c6a33953987c2e756db9fd0 -r /usr/bin/runc -b /var/run/containers/storage/overlay-containers/fa9198972f569e9a7e602290677ce1632c1f2ca14c6a33953987c2e756db9fd0/userdata -p /var/run/containers/storage/overlay-containers/fa9198972f569e9a7e602290677ce1632c1f2ca14c6a33953987c2e756db9fd0/userdata/pidfile -l /var/log/pods/9a3a498538cdade3ffc6a2379f08a141/etcd/0.log --exit-dir /var/run/crio/exits --socket-dir-path /var/run/crio --log-size-max 52428800 root 10847 10836 3 14:53 ? 00:00:58 etcd root 12669 12658 0 14:54 ? 00:00:03 /usr/bin/service-catalog apiserver --storage-type etcd --secure-port 6443 --etcd-servers https://ip-172-31-52-209.us-west-2.compute.internal:2379 --etcd-cafile /etc/origin/master/master.etcd-ca.crt --etcd-certfile /etc/origin/master/master.etcd-client.crt --etcd-keyfile /etc/origin/master/master.etcd-client.key -v 3 --cors-allowed-origins localhost --admission-control KubernetesNamespaceLifecycle,DefaultServicePlan,ServiceBindingsLifecycle,ServicePlanChangeValidator,BrokerAuthSarCheck --feature-gates OriginatingIdentity=true Version-Release number of selected component (if applicable): openshift-ansible master as of 6.15.2018 commit 95c7f6035d5ae53196d7a6f762383969d72d3e88 How reproducible: Unknown. Seen once so far. Actual results: Inventory, ansible -vvv and etcd pod log attached.
I've seen this once before and it was because etcd was still running on the host as a systemd service. Not sure how frequently this happens. I've only seen it once.
Re-running the upgrade in the same configuration did not reproduce this. It did hit https://bugzilla.redhat.com/show_bug.cgi?id=1591752 which is already provisionally targeted for 3.10.0. Agree with leaving in 3.10.z for now. We'll be running upgrade tests through code freeze
I've seen this happen and Justin Pierce has run into it when doing a 3.10.x to 3.10.x+1 upgrade in starter environments. Something is starting etcd on the host. When tracing through our code I noticed that we delete /etc/systemd/system/etcd.service which effectively unmasks the service. I think we should stop doing that. https://github.com/openshift/openshift-ansible/pull/9115
meet same issue as comment 4. openshift-ansible-3.10.18-1.git.314.cfe4f91.el7.noarch.rpm upgrade success for RPM install (container runtime docker-1.13.1)
https://github.com/openshift/openshift-ansible/pull/9246 follow up fix from mike
Fix is in openshift-ansible-3.10.21-1
Verified on openshift-ansible-3.10.21-1.git.0.6446011.el7.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2263