While trying to update Openshift 4.5.14 to 4.6.0-rc.0, all operators except machine-config updated successfully but at the end after one of the masters updated and restarted, etcd on that node became disconnected from the other etcd instances and the other nodes didn't proceed to the update. I've checked cincinnati graph and my current version (4.5.14) was eligible for 4.6.0-rc.0 update. Patched my version adding candidate-4.6 ```sh oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "candidate-4.6"}]' ``` My nodes at the error: ```sh ❯ oc get nodes NAME STATUS ROLES AGE VERSION ocp-9s46k-infra-mvg7f Ready infra,worker 32h v1.19.0+db1fc96 ocp-9s46k-infra-swl4z Ready infra,worker 32h v1.19.0+db1fc96 ocp-9s46k-master-0 Ready,SchedulingDisabled master,worker 9d v1.18.3+970c1b3 ocp-9s46k-master-1 Ready master,worker 9d v1.18.3+970c1b3 ocp-9s46k-master-2 Ready master,worker 9d v1.19.0+db1fc96 ocp-9s46k-worker-lhts2 Ready worker 9d v1.19.0+db1fc96 ocp-9s46k-worker-r9bc2 Ready worker 9d v1.19.0+db1fc96 ocp-9s46k-worker-wrc9l Ready worker 173m v1.19.0+db1fc96 ``` Pods pending ```sh kubectl get pods -o wide --all-namespaces |grep -v Running |grep -v Completed Naboo.internal.carlosedp.com: Wed Oct 7 19:18:31 2020 NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-apiserver apiserver-6648cbddbd-25kvw 0/2 Pending 0 20m <none> <none> <none> <none> openshift-apiserver apiserver-74d96d87-9n827 0/1 Pending 0 20m <none> <none> <none> <none> openshift-workspaces postgres-7bc69f8bf5-p6n2v 0/1 ContainerCreating 0 17m <none> ocp-9s46k-worker-r9bc2 <none> <none> ``` ETCD Pods ```sh ❯ kp NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES etcd-ocp-9s46k-master-0 3/3 Running 0 115m 192.168.1.179 ocp-9s46k-master-0 <none> <none> etcd-ocp-9s46k-master-1 3/3 Running 0 115m 192.168.1.136 ocp-9s46k-master-1 <none> <none> etcd-ocp-9s46k-master-2 3/3 Running 0 3m2s 192.168.1.134 ocp-9s46k-master-2 <none> <none> etcd-quorum-guard-7986975d98-bdxsg 1/1 Running 0 139m 192.168.1.179 ocp-9s46k-master-0 <none> <none> etcd-quorum-guard-7986975d98-blxlv 0/1 Running 0 53m 192.168.1.134 ocp-9s46k-master-2 <none> <none> etcd-quorum-guard-7986975d98-mf2z6 1/1 Running 0 130m 192.168.1.136 ocp-9s46k-master-1 <none> <none> installer-7-ocp-9s46k-master-1 0/1 Completed 0 120m 10.130.0.39 ocp-9s46k-master-1 <none> <none> installer-7-ocp-9s46k-master-2 0/1 Completed 0 123m 10.129.0.9 ocp-9s46k-master-2 <none> <none> revision-pruner-6-ocp-9s46k-master-1 0/1 Completed 0 129m 10.130.0.4 ocp-9s46k-master-1 <none> <none> revision-pruner-6-ocp-9s46k-master-2 0/1 Completed 0 126m 10.129.0.31 ocp-9s46k-master-2 <none> <none> revision-pruner-7-ocp-9s46k-master-1 0/1 Completed 0 115m 10.130.0.41 ocp-9s46k-master-1 <none> <none> revision-pruner-7-ocp-9s46k-master-2 0/1 Completed 0 120m 10.129.0.21 ocp-9s46k-master-2 <none> <none> ``` ETCD-Probe Guard ```sh ❯ kdesc etcd-quorum-guard-7986975d98-blxlv Pod: etcd-quorum-guard-7986975d98-blxlv Name: etcd-quorum-guard-7986975d98-blxlv Namespace: openshift-etcd Priority: 2000000000 Priority Class Name: system-cluster-critical Node: ocp-9s46k-master-2/192.168.1.134 Start Time: Wed, 07 Oct 2020 21:17:47 -0300 Labels: k8s-app=etcd-quorum-guard name=etcd-quorum-guard pod-template-hash=7986975d98 Annotations: <none> Status: Running IP: 192.168.1.134 IPs: IP: 192.168.1.134 Controlled By: ReplicaSet/etcd-quorum-guard-7986975d98 Containers: guard: Container ID: cri-o://bb448e4d6e1ee47c752fc29a452c4c801c57c25f0bd47f62314fb8ed6ee1974c Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8f5420d0c2a29acbc91319cb1e1e0f20f6cf859706d3cf20b56fd2d6922011d9 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8f5420d0c2a29acbc91319cb1e1e0f20f6cf859706d3cf20b56fd2d6922011d9 Port: <none> Host Port: <none> Command: /bin/bash Args: -c # properly handle TERM and exit as soon as it is signaled set -euo pipefail trap 'jobs -p | xargs -r kill; exit 0' TERM sleep infinity & wait State: Running Started: Wed, 07 Oct 2020 21:26:39 -0300 Ready: False Restart Count: 0 Requests: cpu: 10m memory: 5Mi Readiness: exec [/bin/sh -c declare -r health_endpoint="https://localhost:2379/health" declare -r cert="/var/run/secrets/etcd-client/tls.crt" declare -r key="/var/run/secrets/etcd-client/tls.key" declare -r cacert="/var/run/configmaps/etcd-ca/ca-bundle.crt" export NSS_SDB_USE_CACHE=no [[ -z $cert || -z $key ]] && exit 1 curl --max-time 2 --silent --cert "${cert//:/\:}" --key "$key" --cacert "$cacert" "$health_endpoint" |grep '{ *"health" *: *"true" *}' ] delay=5s timeout=3s period=5s #success=1 #failure=3 Environment: <none> Mounts: /var/run/configmaps/etcd-ca from etcd-ca (rw) /var/run/secrets/etcd-client from etcd-client (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-qfwcl (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: etcd-client: Type: Secret (a volume populated by a Secret) SecretName: etcd-client Optional: false etcd-ca: Type: ConfigMap (a volume populated by a ConfigMap) Name: etcd-ca-bundle Optional: false default-token-qfwcl: Type: Secret (a volume populated by a Secret) SecretName: default-token-qfwcl Optional: false QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/etcd:NoSchedule op=Exists node-role.kubernetes.io/master:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 46m (x2 over 46m) default-scheduler 0/8 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't match pod anti-affinity rules, 5 node(s) didn't match node selector. Normal Scheduled 46m default-scheduler Successfully assigned openshift-etcd/etcd-quorum-guard-7986975d98-blxlv to ocp-9s46k-master-2 Normal Pulled 46m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8f5420d0c2a29acbc91319cb1e1e0f20f6cf859706d3cf20b56fd2d6922011d9" already present on machine Normal Created 46m kubelet Created container guard Normal Started 46m kubelet Started container guard Warning Unhealthy 41m (x59 over 45m) kubelet Readiness probe failed: Normal Pulled 37m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8f5420d0c2a29acbc91319cb1e1e0f20f6cf859706d3cf20b56fd2d6922011d9" already present on machine Normal Created 37m kubelet Created container guard Normal Started 37m kubelet Started container guard Warning Unhealthy 2m19s (x419 over 37m) kubelet Readiness probe failed: ``` ETCD ```sh ❯ kdesc etcd-ocp-9s46k-master-2 Pod: etcd-ocp-9s46k-master-2 Name: etcd-ocp-9s46k-master-2 Namespace: openshift-etcd Priority: 2000001000 Priority Class Name: system-node-critical Node: ocp-9s46k-master-2/192.168.1.134 Start Time: Wed, 07 Oct 2020 19:22:29 -0300 Labels: app=etcd etcd=true k8s-app=etcd revision=7 Annotations: kubernetes.io/config.hash: 56249732f6da17545c2ce7cba3ba5f13 kubernetes.io/config.mirror: 56249732f6da17545c2ce7cba3ba5f13 kubernetes.io/config.seen: 2020-10-08T00:26:26.662289151Z kubernetes.io/config.source: file Status: Running IP: 192.168.1.134 IPs: IP: 192.168.1.134 Controlled By: Node/ocp-9s46k-master-2 Init Containers: etcd-ensure-env-vars: Container ID: cri-o://13748fe8352d0c17d2e0db635a0682f5414feef9351bbd88b12905e48c00ccf3 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df Port: <none> Host Port: <none> Command: /bin/sh -c #!/bin/sh set -euo pipefail : "${NODE_ocp_9s46k_master_2_ETCD_URL_HOST?not set}" : "${NODE_ocp_9s46k_master_2_ETCD_NAME?not set}" : "${NODE_ocp_9s46k_master_2_IP?not set}" # check for ipv4 addresses as well as ipv6 addresses with extra square brackets if [[ "${NODE_ocp_9s46k_master_2_IP}" != "${NODE_IP}" && "${NODE_ocp_9s46k_master_2_IP}" != "[${NODE_IP}]" ]]; then # echo the error message to stderr echo "Expected node IP to be ${NODE_IP} got ${NODE_ocp_9s46k_master_2_IP}" >&2 exit 1 fi # check for ipv4 addresses as well as ipv6 addresses with extra square brackets if [[ "${NODE_ocp_9s46k_master_2_ETCD_URL_HOST}" != "${NODE_IP}" && "${NODE_ocp_9s46k_master_2_ETCD_URL_HOST}" != "[${NODE_IP}]" ]]; then # echo the error message to stderr echo "Expected etcd url host to be ${NODE_IP} got ${NODE_ocp_9s46k_master_2_ETCD_URL_HOST}" >&2 exit 1 fi State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 07 Oct 2020 21:26:38 -0300 Finished: Wed, 07 Oct 2020 21:26:38 -0300 Ready: True Restart Count: 0 Requests: cpu: 30m memory: 60Mi Environment: ALL_ETCD_ENDPOINTS: https://192.168.1.134:2379,https://192.168.1.136:2379,https://192.168.1.179:2379 ETCDCTL_API: 3 ETCDCTL_CACERT: /etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt ETCDCTL_CERT: /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.crt ETCDCTL_ENDPOINTS: https://192.168.1.134:2379,https://192.168.1.136:2379,https://192.168.1.179:2379 ETCDCTL_KEY: /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.key ETCD_DATA_DIR: /var/lib/etcd ETCD_ELECTION_TIMEOUT: 1000 ETCD_ENABLE_PPROF: true ETCD_HEARTBEAT_INTERVAL: 100 ETCD_IMAGE: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df ETCD_INITIAL_CLUSTER_STATE: existing ETCD_QUOTA_BACKEND_BYTES: 7516192768 NODE_ocp_9s46k_master_0_ETCD_NAME: ocp-9s46k-master-0 NODE_ocp_9s46k_master_0_ETCD_URL_HOST: 192.168.1.179 NODE_ocp_9s46k_master_0_IP: 192.168.1.179 NODE_ocp_9s46k_master_1_ETCD_NAME: ocp-9s46k-master-1 NODE_ocp_9s46k_master_1_ETCD_URL_HOST: 192.168.1.136 NODE_ocp_9s46k_master_1_IP: 192.168.1.136 NODE_ocp_9s46k_master_2_ETCD_NAME: ocp-9s46k-master-2 NODE_ocp_9s46k_master_2_ETCD_URL_HOST: 192.168.1.134 NODE_ocp_9s46k_master_2_IP: 192.168.1.134 NODE_IP: (v1:status.podIP) Mounts: <none> etcd-resources-copy: Container ID: cri-o://6ab451193ffbb8352a88cd799c51df0e7c6cc0c669c66775b6783b014fa41a1f Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df Port: <none> Host Port: <none> Command: /bin/sh -c #!/bin/sh set -euo pipefail cp /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.crt /etc/kubernetes/etcd-backup-dir/system:etcd-peer-ocp-9s46k-master-2.crt cp /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.key /etc/kubernetes/etcd-backup-dir/system:etcd-peer-ocp-9s46k-master-2.key rm -f $(grep -l '^### Created by cluster-etcd-operator' /usr/local/bin/*) cp -p /etc/kubernetes/static-pod-certs/configmaps/etcd-scripts/*.sh /usr/local/bin State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 07 Oct 2020 21:26:41 -0300 Finished: Wed, 07 Oct 2020 21:26:41 -0300 Ready: True Restart Count: 0 Requests: cpu: 30m memory: 60Mi Environment: <none> Mounts: /etc/kubernetes/etcd-backup-dir from etcd-backup-dir (rw) /etc/kubernetes/static-pod-certs from cert-dir (rw) /etc/kubernetes/static-pod-resources from resource-dir (rw) /usr/local/bin from usr-local-bin (rw) Containers: etcdctl: Container ID: cri-o://44e9fefdd13320549f0cc33ede351885cddfcf6e8a95967b4346512a5188c8ed Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df Port: <none> Host Port: <none> Command: /bin/bash -c trap TERM INT; sleep infinity & wait State: Running Started: Wed, 07 Oct 2020 21:26:44 -0300 Ready: True Restart Count: 0 Requests: cpu: 30m memory: 60Mi Environment: ALL_ETCD_ENDPOINTS: https://192.168.1.134:2379,https://192.168.1.136:2379,https://192.168.1.179:2379 ETCDCTL_API: 3 ETCDCTL_CACERT: /etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt ETCDCTL_CERT: /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.crt ETCDCTL_ENDPOINTS: https://192.168.1.134:2379,https://192.168.1.136:2379,https://192.168.1.179:2379 ETCDCTL_KEY: /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.key ETCD_DATA_DIR: /var/lib/etcd ETCD_ELECTION_TIMEOUT: 1000 ETCD_ENABLE_PPROF: true ETCD_HEARTBEAT_INTERVAL: 100 ETCD_IMAGE: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df ETCD_INITIAL_CLUSTER_STATE: existing ETCD_QUOTA_BACKEND_BYTES: 7516192768 NODE_ocp_9s46k_master_0_ETCD_NAME: ocp-9s46k-master-0 NODE_ocp_9s46k_master_0_ETCD_URL_HOST: 192.168.1.179 NODE_ocp_9s46k_master_0_IP: 192.168.1.179 NODE_ocp_9s46k_master_1_ETCD_NAME: ocp-9s46k-master-1 NODE_ocp_9s46k_master_1_ETCD_URL_HOST: 192.168.1.136 NODE_ocp_9s46k_master_1_IP: 192.168.1.136 NODE_ocp_9s46k_master_2_ETCD_NAME: ocp-9s46k-master-2 NODE_ocp_9s46k_master_2_ETCD_URL_HOST: 192.168.1.134 NODE_ocp_9s46k_master_2_IP: 192.168.1.134 Mounts: /etc/kubernetes/etcd-backup-dir from etcd-backup-dir (rw) /etc/kubernetes/manifests from static-pod-dir (rw) /etc/kubernetes/static-pod-certs from cert-dir (rw) /etc/kubernetes/static-pod-resources from resource-dir (rw) /var/lib/etcd/ from data-dir (rw) etcd: Container ID: cri-o://9f41d4b16b9f3f74924f504cffc225ac943e664bf87c5c7056e6fbe41abf47ce Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df Port: <none> Host Port: <none> Command: /bin/sh -c #!/bin/sh set -euo pipefail etcdctl member list || true # this has a non-zero return code if the command is non-zero. If you use an export first, it doesn't and you # will succeed when you should fail. ETCD_INITIAL_CLUSTER=$(discover-etcd-initial-cluster \ --cacert=/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt \ --cert=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.crt \ --key=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.key \ --endpoints=${ALL_ETCD_ENDPOINTS} \ --data-dir=/var/lib/etcd \ --target-peer-url-host=${NODE_ocp_9s46k_master_2_ETCD_URL_HOST} \ --target-name=ocp-9s46k-master-2) export ETCD_INITIAL_CLUSTER # at this point we know this member is added. To support a transition, we must remove the old etcd pod. # move it somewhere safe so we can retrieve it again later if something goes badly. mv /etc/kubernetes/manifests/etcd-member.yaml /etc/kubernetes/etcd-backup-dir || true # we cannot use the "normal" port conflict initcontainer because when we upgrade, the existing static pod will never yield, # so we do the detection in etcd container itsefl. echo -n "Waiting for ports 2379, 2380 and 9978 to be released." while [ -n "$(ss -Htan '( sport = 2379 or sport = 2380 or sport = 9978 )')" ]; do echo -n "." sleep 1 done export ETCD_NAME=${NODE_ocp_9s46k_master_2_ETCD_NAME} env | grep ETCD | grep -v NODE set -x # See https://etcd.io/docs/v3.4.0/tuning/ for why we use ionice exec ionice -c2 -n0 etcd \ --initial-advertise-peer-urls=https://${NODE_ocp_9s46k_master_2_IP}:2380 \ --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving/etcd-serving-ocp-9s46k-master-2.crt \ --key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving/etcd-serving-ocp-9s46k-master-2.key \ --trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt \ --client-cert-auth=true \ --peer-cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.crt \ --peer-key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.key \ --peer-trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt \ --peer-client-cert-auth=true \ --advertise-client-urls=https://${NODE_ocp_9s46k_master_2_IP}:2379 \ --listen-client-urls=https://0.0.0.0:2379 \ --listen-peer-urls=https://0.0.0.0:2380 \ --listen-metrics-urls=https://0.0.0.0:9978 || mv /etc/kubernetes/etcd-backup-dir/etcd-member.yaml /etc/kubernetes/manifests State: Running Started: Wed, 07 Oct 2020 21:26:46 -0300 Ready: True Restart Count: 0 Requests: cpu: 300m memory: 600Mi Readiness: exec [/bin/sh -ec lsof -n -i :2380 | grep LISTEN] delay=3s timeout=5s period=5s #success=1 #failure=3 Environment: ALL_ETCD_ENDPOINTS: https://192.168.1.134:2379,https://192.168.1.136:2379,https://192.168.1.179:2379 ETCDCTL_API: 3 ETCDCTL_CACERT: /etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt ETCDCTL_CERT: /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.crt ETCDCTL_ENDPOINTS: https://192.168.1.134:2379,https://192.168.1.136:2379,https://192.168.1.179:2379 ETCDCTL_KEY: /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.key ETCD_DATA_DIR: /var/lib/etcd ETCD_ELECTION_TIMEOUT: 1000 ETCD_ENABLE_PPROF: true ETCD_HEARTBEAT_INTERVAL: 100 ETCD_IMAGE: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df ETCD_INITIAL_CLUSTER_STATE: existing ETCD_QUOTA_BACKEND_BYTES: 7516192768 NODE_ocp_9s46k_master_0_ETCD_NAME: ocp-9s46k-master-0 NODE_ocp_9s46k_master_0_ETCD_URL_HOST: 192.168.1.179 NODE_ocp_9s46k_master_0_IP: 192.168.1.179 NODE_ocp_9s46k_master_1_ETCD_NAME: ocp-9s46k-master-1 NODE_ocp_9s46k_master_1_ETCD_URL_HOST: 192.168.1.136 NODE_ocp_9s46k_master_1_IP: 192.168.1.136 NODE_ocp_9s46k_master_2_ETCD_NAME: ocp-9s46k-master-2 NODE_ocp_9s46k_master_2_ETCD_URL_HOST: 192.168.1.134 NODE_ocp_9s46k_master_2_IP: 192.168.1.134 Mounts: /etc/kubernetes/etcd-backup-dir from etcd-backup-dir (rw) /etc/kubernetes/manifests from static-pod-dir (rw) /etc/kubernetes/static-pod-certs from cert-dir (rw) /etc/kubernetes/static-pod-resources from resource-dir (rw) /var/lib/etcd/ from data-dir (rw) etcd-metrics: Container ID: cri-o://d457a867c69c04144596ca936da1ccfe76d38ea29feb9ae065b9c5d7394990ea Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df Port: <none> Host Port: <none> Command: /bin/sh -c #!/bin/sh set -euo pipefail export ETCD_NAME=${NODE_ocp_9s46k_master_2_ETCD_NAME} exec etcd grpc-proxy start \ --endpoints https://${NODE_ocp_9s46k_master_2_ETCD_URL_HOST}:9978 \ --metrics-addr https://0.0.0.0:9979 \ --listen-addr 127.0.0.1:9977 \ --key /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.key \ --key-file /etc/kubernetes/static-pod-certs/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ocp-9s46k-master-2.key \ --cert /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.crt \ --cert-file /etc/kubernetes/static-pod-certs/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ocp-9s46k-master-2.crt \ --cacert /etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt \ --trusted-ca-file /etc/kubernetes/static-pod-certs/configmaps/etcd-metrics-proxy-serving-ca/ca-bundle.crt State: Running Started: Wed, 07 Oct 2020 21:26:47 -0300 Ready: True Restart Count: 0 Requests: cpu: 100m memory: 200Mi Environment: ALL_ETCD_ENDPOINTS: https://192.168.1.134:2379,https://192.168.1.136:2379,https://192.168.1.179:2379 ETCDCTL_API: 3 ETCDCTL_CACERT: /etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt ETCDCTL_CERT: /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.crt ETCDCTL_ENDPOINTS: https://192.168.1.134:2379,https://192.168.1.136:2379,https://192.168.1.179:2379 ETCDCTL_KEY: /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.key ETCD_DATA_DIR: /var/lib/etcd ETCD_ELECTION_TIMEOUT: 1000 ETCD_ENABLE_PPROF: true ETCD_HEARTBEAT_INTERVAL: 100 ETCD_IMAGE: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df ETCD_INITIAL_CLUSTER_STATE: existing ETCD_QUOTA_BACKEND_BYTES: 7516192768 NODE_ocp_9s46k_master_0_ETCD_NAME: ocp-9s46k-master-0 NODE_ocp_9s46k_master_0_ETCD_URL_HOST: 192.168.1.179 NODE_ocp_9s46k_master_0_IP: 192.168.1.179 NODE_ocp_9s46k_master_1_ETCD_NAME: ocp-9s46k-master-1 NODE_ocp_9s46k_master_1_ETCD_URL_HOST: 192.168.1.136 NODE_ocp_9s46k_master_1_IP: 192.168.1.136 NODE_ocp_9s46k_master_2_ETCD_NAME: ocp-9s46k-master-2 NODE_ocp_9s46k_master_2_ETCD_URL_HOST: 192.168.1.134 NODE_ocp_9s46k_master_2_IP: 192.168.1.134 Mounts: /etc/kubernetes/static-pod-certs from cert-dir (rw) /etc/kubernetes/static-pod-resources from resource-dir (rw) /var/lib/etcd/ from data-dir (rw) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: static-pod-dir: Type: HostPath (bare host directory volume) Path: /etc/kubernetes/manifests HostPathType: etcd-backup-dir: Type: HostPath (bare host directory volume) Path: /etc/kubernetes/static-pod-resources/etcd-member HostPathType: resource-dir: Type: HostPath (bare host directory volume) Path: /etc/kubernetes/static-pod-resources/etcd-pod-7 HostPathType: cert-dir: Type: HostPath (bare host directory volume) Path: /etc/kubernetes/static-pod-resources/etcd-certs HostPathType: data-dir: Type: HostPath (bare host directory volume) Path: /var/lib/etcd HostPathType: usr-local-bin: Type: HostPath (bare host directory volume) Path: /usr/local/bin HostPathType: QoS Class: Burstable Node-Selectors: <none> Tolerations: op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Created 121m kubelet Created container etcd-ensure-env-vars Normal Started 121m kubelet Started container etcd-ensure-env-vars Normal Pulled 121m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine Normal Pulled 121m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine Normal Created 121m kubelet Created container etcd-resources-copy Normal Started 121m kubelet Started container etcd-resources-copy Normal Pulled 120m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine Normal Created 120m kubelet Created container etcdctl Normal Created 120m kubelet Created container etcd Normal Pulled 120m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine Normal Started 120m kubelet Started container etcdctl Normal Started 120m kubelet Started container etcd Normal Pulled 120m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine Normal Created 120m kubelet Created container etcd-metrics Normal Started 120m kubelet Started container etcd-metrics Warning Unhealthy 120m kubelet Readiness probe failed: Normal Pulled 44m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine Normal Started 44m kubelet Started container etcd-ensure-env-vars Normal Pulled 44m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine Normal Created 44m kubelet Created container etcd-ensure-env-vars Normal Created 44m kubelet Created container etcd-resources-copy Normal Started 44m kubelet Started container etcd-resources-copy Normal Pulled 44m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine Normal Created 44m kubelet Created container etcdctl Normal Pulled 44m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine Normal Started 44m kubelet Started container etcdctl Normal Created 44m kubelet Created container etcd Normal Started 44m kubelet Started container etcd Normal Pulled 44m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine Normal Created 44m kubelet Created container etcd-metrics Normal Started 44m kubelet Started container etcd-metrics Warning Unhealthy 44m kubelet Readiness probe failed: ``` I've collected a must-gather on the cluster in this state. The file is at: https://drive.google.com/file/d/1xQVlG3UgkWF_85s8kWrmHlBZadRhmkcu/view?usp=sharing
Several pods are unschedulable as machine-config-daemon can't evict etcd-quorum-guard: ``` 2020-10-08T22:00:10.541608848Z I1008 22:00:10.541534 5493 daemon.go:344] evicting pod openshift-etcd/etcd-quorum-guard-7986975d98-bdxsg 2020-10-08T22:00:10.556118478Z E1008 22:00:10.556035 5493 daemon.go:344] error when evicting pod "etcd-quorum-guard-7986975d98-bdxsg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. ``` This pod logs failures: ``` 2020-10-07T23:58:58.057356627Z kill: 2020-10-07T23:58:58.05770516Z sending signal to 7 failed2020-10-07T23:58:58.057782408Z : 2020-10-07T23:58:58.057867723Z No such process 2020-10-08T01:15:10.456048553Z kill: sending signal to 6 failed: No such process ``` Moving to MCO - not quite sure why the pod can't be stopped, might be container runtime issue
# namespaces/openshift-etcd/pods/etcd-ocp-9s46k-master-0/etcd/etcd/logs/current.log > 2020-10-08T00:51:54.703793616Z 2020-10-08 00:51:54.703621 I | embed: rejected connection from "192.168.1.134:39132" (error "tls: \"192.168.1.134\" does not match any of DNSNames [\"localhost\" \"ocp.internal.carlosedp.com\" \"192.168.1.135\"] (lookup ocp.internal.carlosedp.com on 192.168.1.179:53: no such host)", ServerName "", IPAddresses ["192.168.1.135"], DNSNames ["localhost" "ocp.internal.carlosedp.com" "192.168.1.135"]) This (pretty sure) is telling me that the caller's IP address changed when the node rebooted can we verify this? As per documentation nodes IP address must remain persistent through reboot.
Hi Sam, just confirmed that my DHCP did this. The pod etcd-quorum-guard from master-0 was stuck, when I killed it, the node rebooted with correct IP and the update finished successfully. Don't know why my DHCP allocated a different IP for the node. It's fixed now. Thanks!
If the IP address mismatches, the etcd operator should go degraded with clear message and regenerate the certificates for the new IP address. This is being worked on in the Bugzilla bug https://bugzilla.redhat.com/show_bug.cgi?id=1882176. *** This bug has been marked as a duplicate of bug 1882176 ***