Bug 1886771 - Updating Openshift 4.5.14 to 4.6.0-rc.0 made etcd unavailable on the first updated master
Summary: Updating Openshift 4.5.14 to 4.6.0-rc.0 made etcd unavailable on the first up...
Keywords:
Status: CLOSED DUPLICATE of bug 1882176
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6
Hardware: All
OS: All
unspecified
medium
Target Milestone: ---
: ---
Assignee: Antonio Murdaca
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-09 11:36 UTC by Carlos de Paula
Modified: 2020-10-09 14:29 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-09 14:07:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Carlos de Paula 2020-10-09 11:36:07 UTC
While trying to update Openshift 4.5.14 to 4.6.0-rc.0, all operators except machine-config updated successfully but at the end after one of the masters updated and restarted, etcd  on that node became disconnected from the other etcd instances and the other nodes didn't proceed to the update.

I've checked cincinnati graph and my current version (4.5.14) was eligible for 4.6.0-rc.0 update.

Patched my version adding candidate-4.6
```sh
oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "candidate-4.6"}]'
```

My nodes at the error:

```sh
❯ oc get nodes
NAME                     STATUS                     ROLES           AGE    VERSION
ocp-9s46k-infra-mvg7f    Ready                      infra,worker    32h    v1.19.0+db1fc96
ocp-9s46k-infra-swl4z    Ready                      infra,worker    32h    v1.19.0+db1fc96
ocp-9s46k-master-0       Ready,SchedulingDisabled   master,worker   9d     v1.18.3+970c1b3
ocp-9s46k-master-1       Ready                      master,worker   9d     v1.18.3+970c1b3
ocp-9s46k-master-2       Ready                      master,worker   9d     v1.19.0+db1fc96
ocp-9s46k-worker-lhts2   Ready                      worker          9d     v1.19.0+db1fc96
ocp-9s46k-worker-r9bc2   Ready                      worker          9d     v1.19.0+db1fc96
ocp-9s46k-worker-wrc9l   Ready                      worker          173m   v1.19.0+db1fc96
```

Pods pending

```sh
kubectl get pods -o wide --all-namespaces |grep -v Running |grep -v Completed               Naboo.internal.carlosedp.com: Wed Oct  7 19:18:31 2020

NAMESPACE                                          NAME                                                             READY   STATUS              RESTARTS   AGE
     IP              NODE                     NOMINATED NODE   READINESS GATES
openshift-apiserver                                apiserver-6648cbddbd-25kvw                                       0/2     Pending             0          20m
     <none>          <none>                   <none>           <none>
openshift-apiserver                                apiserver-74d96d87-9n827                                         0/1     Pending             0          20m
     <none>          <none>                   <none>           <none>
openshift-workspaces                               postgres-7bc69f8bf5-p6n2v                                        0/1     ContainerCreating   0          17m
     <none>          ocp-9s46k-worker-r9bc2   <none>           <none>
```

ETCD Pods

```sh
❯ kp
NAME                                   READY   STATUS      RESTARTS   AGE    IP              NODE                 NOMINATED NODE   READINESS GATES
etcd-ocp-9s46k-master-0                3/3     Running     0          115m   192.168.1.179   ocp-9s46k-master-0   <none>           <none>
etcd-ocp-9s46k-master-1                3/3     Running     0          115m   192.168.1.136   ocp-9s46k-master-1   <none>           <none>
etcd-ocp-9s46k-master-2                3/3     Running     0          3m2s   192.168.1.134   ocp-9s46k-master-2   <none>           <none>
etcd-quorum-guard-7986975d98-bdxsg     1/1     Running     0          139m   192.168.1.179   ocp-9s46k-master-0   <none>           <none>
etcd-quorum-guard-7986975d98-blxlv     0/1     Running     0          53m    192.168.1.134   ocp-9s46k-master-2   <none>           <none>
etcd-quorum-guard-7986975d98-mf2z6     1/1     Running     0          130m   192.168.1.136   ocp-9s46k-master-1   <none>           <none>
installer-7-ocp-9s46k-master-1         0/1     Completed   0          120m   10.130.0.39     ocp-9s46k-master-1   <none>           <none>
installer-7-ocp-9s46k-master-2         0/1     Completed   0          123m   10.129.0.9      ocp-9s46k-master-2   <none>           <none>
revision-pruner-6-ocp-9s46k-master-1   0/1     Completed   0          129m   10.130.0.4      ocp-9s46k-master-1   <none>           <none>
revision-pruner-6-ocp-9s46k-master-2   0/1     Completed   0          126m   10.129.0.31     ocp-9s46k-master-2   <none>           <none>
revision-pruner-7-ocp-9s46k-master-1   0/1     Completed   0          115m   10.130.0.41     ocp-9s46k-master-1   <none>           <none>
revision-pruner-7-ocp-9s46k-master-2   0/1     Completed   0          120m   10.129.0.21     ocp-9s46k-master-2   <none>           <none>
```

ETCD-Probe Guard

```sh
❯ kdesc etcd-quorum-guard-7986975d98-blxlv
Pod: etcd-quorum-guard-7986975d98-blxlv

Name:                 etcd-quorum-guard-7986975d98-blxlv
Namespace:            openshift-etcd
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 ocp-9s46k-master-2/192.168.1.134
Start Time:           Wed, 07 Oct 2020 21:17:47 -0300
Labels:               k8s-app=etcd-quorum-guard
                      name=etcd-quorum-guard
                      pod-template-hash=7986975d98
Annotations:          <none>
Status:               Running
IP:                   192.168.1.134
IPs:
  IP:           192.168.1.134
Controlled By:  ReplicaSet/etcd-quorum-guard-7986975d98
Containers:
  guard:
    Container ID:  cri-o://bb448e4d6e1ee47c752fc29a452c4c801c57c25f0bd47f62314fb8ed6ee1974c
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8f5420d0c2a29acbc91319cb1e1e0f20f6cf859706d3cf20b56fd2d6922011d9
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8f5420d0c2a29acbc91319cb1e1e0f20f6cf859706d3cf20b56fd2d6922011d9
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
    Args:
      -c
      # properly handle TERM and exit as soon as it is signaled
      set -euo pipefail
      trap 'jobs -p | xargs -r kill; exit 0' TERM
      sleep infinity & wait

    State:          Running
      Started:      Wed, 07 Oct 2020 21:26:39 -0300
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      10m
      memory:   5Mi
    Readiness:  exec [/bin/sh -c declare -r health_endpoint="https://localhost:2379/health"
declare -r cert="/var/run/secrets/etcd-client/tls.crt"
declare -r key="/var/run/secrets/etcd-client/tls.key"
declare -r cacert="/var/run/configmaps/etcd-ca/ca-bundle.crt"
export NSS_SDB_USE_CACHE=no
[[ -z $cert || -z $key ]] && exit 1
curl --max-time 2 --silent --cert "${cert//:/\:}" --key "$key" --cacert "$cacert" "$health_endpoint" |grep '{ *"health" *: *"true" *}'
] delay=5s timeout=3s period=5s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/configmaps/etcd-ca from etcd-ca (rw)
      /var/run/secrets/etcd-client from etcd-client (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-qfwcl (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  etcd-client:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  etcd-client
    Optional:    false
  etcd-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      etcd-ca-bundle
    Optional:  false
  default-token-qfwcl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-qfwcl
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  node-role.kubernetes.io/master=
Tolerations:     node-role.kubernetes.io/etcd:NoSchedule op=Exists
                 node-role.kubernetes.io/master:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  46m (x2 over 46m)      default-scheduler  0/8 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't match pod anti-affinity rules, 5 node(s) didn't match node selector.
  Normal   Scheduled         46m                    default-scheduler  Successfully assigned openshift-etcd/etcd-quorum-guard-7986975d98-blxlv to ocp-9s46k-master-2
  Normal   Pulled            46m                    kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8f5420d0c2a29acbc91319cb1e1e0f20f6cf859706d3cf20b56fd2d6922011d9" already present on machine
  Normal   Created           46m                    kubelet            Created container guard
  Normal   Started           46m                    kubelet            Started container guard
  Warning  Unhealthy         41m (x59 over 45m)     kubelet            Readiness probe failed:
  Normal   Pulled            37m                    kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8f5420d0c2a29acbc91319cb1e1e0f20f6cf859706d3cf20b56fd2d6922011d9" already present on machine
  Normal   Created           37m                    kubelet            Created container guard
  Normal   Started           37m                    kubelet            Started container guard
  Warning  Unhealthy         2m19s (x419 over 37m)  kubelet            Readiness probe failed:
```


ETCD

```sh
❯ kdesc etcd-ocp-9s46k-master-2
Pod: etcd-ocp-9s46k-master-2

Name:                 etcd-ocp-9s46k-master-2
Namespace:            openshift-etcd
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ocp-9s46k-master-2/192.168.1.134
Start Time:           Wed, 07 Oct 2020 19:22:29 -0300
Labels:               app=etcd
                      etcd=true
                      k8s-app=etcd
                      revision=7
Annotations:          kubernetes.io/config.hash: 56249732f6da17545c2ce7cba3ba5f13
                      kubernetes.io/config.mirror: 56249732f6da17545c2ce7cba3ba5f13
                      kubernetes.io/config.seen: 2020-10-08T00:26:26.662289151Z
                      kubernetes.io/config.source: file
Status:               Running
IP:                   192.168.1.134
IPs:
  IP:           192.168.1.134
Controlled By:  Node/ocp-9s46k-master-2
Init Containers:
  etcd-ensure-env-vars:
    Container ID:  cri-o://13748fe8352d0c17d2e0db635a0682f5414feef9351bbd88b12905e48c00ccf3
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      #!/bin/sh
      set -euo pipefail

      : "${NODE_ocp_9s46k_master_2_ETCD_URL_HOST?not set}"
      : "${NODE_ocp_9s46k_master_2_ETCD_NAME?not set}"
      : "${NODE_ocp_9s46k_master_2_IP?not set}"

      # check for ipv4 addresses as well as ipv6 addresses with extra square brackets
      if [[ "${NODE_ocp_9s46k_master_2_IP}" != "${NODE_IP}" && "${NODE_ocp_9s46k_master_2_IP}" != "[${NODE_IP}]" ]]; then
        # echo the error message to stderr
        echo "Expected node IP to be ${NODE_IP} got ${NODE_ocp_9s46k_master_2_IP}" >&2
        exit 1
      fi

      # check for ipv4 addresses as well as ipv6 addresses with extra square brackets
      if [[ "${NODE_ocp_9s46k_master_2_ETCD_URL_HOST}" != "${NODE_IP}" && "${NODE_ocp_9s46k_master_2_ETCD_URL_HOST}" != "[${NODE_IP}]" ]]; then
        # echo the error message to stderr
        echo "Expected etcd url host to be ${NODE_IP} got ${NODE_ocp_9s46k_master_2_ETCD_URL_HOST}" >&2
        exit 1
      fi

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 07 Oct 2020 21:26:38 -0300
      Finished:     Wed, 07 Oct 2020 21:26:38 -0300
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     30m
      memory:  60Mi
    Environment:
      ALL_ETCD_ENDPOINTS:                     https://192.168.1.134:2379,https://192.168.1.136:2379,https://192.168.1.179:2379
      ETCDCTL_API:                            3
      ETCDCTL_CACERT:                         /etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt
      ETCDCTL_CERT:                           /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.crt
      ETCDCTL_ENDPOINTS:                      https://192.168.1.134:2379,https://192.168.1.136:2379,https://192.168.1.179:2379
      ETCDCTL_KEY:                            /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.key
      ETCD_DATA_DIR:                          /var/lib/etcd
      ETCD_ELECTION_TIMEOUT:                  1000
      ETCD_ENABLE_PPROF:                      true
      ETCD_HEARTBEAT_INTERVAL:                100
      ETCD_IMAGE:                             quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df
      ETCD_INITIAL_CLUSTER_STATE:             existing
      ETCD_QUOTA_BACKEND_BYTES:               7516192768
      NODE_ocp_9s46k_master_0_ETCD_NAME:      ocp-9s46k-master-0
      NODE_ocp_9s46k_master_0_ETCD_URL_HOST:  192.168.1.179
      NODE_ocp_9s46k_master_0_IP:             192.168.1.179
      NODE_ocp_9s46k_master_1_ETCD_NAME:      ocp-9s46k-master-1
      NODE_ocp_9s46k_master_1_ETCD_URL_HOST:  192.168.1.136
      NODE_ocp_9s46k_master_1_IP:             192.168.1.136
      NODE_ocp_9s46k_master_2_ETCD_NAME:      ocp-9s46k-master-2
      NODE_ocp_9s46k_master_2_ETCD_URL_HOST:  192.168.1.134
      NODE_ocp_9s46k_master_2_IP:             192.168.1.134
      NODE_IP:                                 (v1:status.podIP)
    Mounts:                                   <none>
  etcd-resources-copy:
    Container ID:  cri-o://6ab451193ffbb8352a88cd799c51df0e7c6cc0c669c66775b6783b014fa41a1f
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      #!/bin/sh
      set -euo pipefail

      cp /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.crt /etc/kubernetes/etcd-backup-dir/system:etcd-peer-ocp-9s46k-master-2.crt
      cp /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.key /etc/kubernetes/etcd-backup-dir/system:etcd-peer-ocp-9s46k-master-2.key
      rm -f $(grep -l '^### Created by cluster-etcd-operator' /usr/local/bin/*)
      cp -p /etc/kubernetes/static-pod-certs/configmaps/etcd-scripts/*.sh /usr/local/bin

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 07 Oct 2020 21:26:41 -0300
      Finished:     Wed, 07 Oct 2020 21:26:41 -0300
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        30m
      memory:     60Mi
    Environment:  <none>
    Mounts:
      /etc/kubernetes/etcd-backup-dir from etcd-backup-dir (rw)
      /etc/kubernetes/static-pod-certs from cert-dir (rw)
      /etc/kubernetes/static-pod-resources from resource-dir (rw)
      /usr/local/bin from usr-local-bin (rw)
Containers:
  etcdctl:
    Container ID:  cri-o://44e9fefdd13320549f0cc33ede351885cddfcf6e8a95967b4346512a5188c8ed
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      trap TERM INT; sleep infinity & wait
    State:          Running
      Started:      Wed, 07 Oct 2020 21:26:44 -0300
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     30m
      memory:  60Mi
    Environment:
      ALL_ETCD_ENDPOINTS:                     https://192.168.1.134:2379,https://192.168.1.136:2379,https://192.168.1.179:2379
      ETCDCTL_API:                            3
      ETCDCTL_CACERT:                         /etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt
      ETCDCTL_CERT:                           /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.crt
      ETCDCTL_ENDPOINTS:                      https://192.168.1.134:2379,https://192.168.1.136:2379,https://192.168.1.179:2379
      ETCDCTL_KEY:                            /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.key
      ETCD_DATA_DIR:                          /var/lib/etcd
      ETCD_ELECTION_TIMEOUT:                  1000
      ETCD_ENABLE_PPROF:                      true
      ETCD_HEARTBEAT_INTERVAL:                100
      ETCD_IMAGE:                             quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df
      ETCD_INITIAL_CLUSTER_STATE:             existing
      ETCD_QUOTA_BACKEND_BYTES:               7516192768
      NODE_ocp_9s46k_master_0_ETCD_NAME:      ocp-9s46k-master-0
      NODE_ocp_9s46k_master_0_ETCD_URL_HOST:  192.168.1.179
      NODE_ocp_9s46k_master_0_IP:             192.168.1.179
      NODE_ocp_9s46k_master_1_ETCD_NAME:      ocp-9s46k-master-1
      NODE_ocp_9s46k_master_1_ETCD_URL_HOST:  192.168.1.136
      NODE_ocp_9s46k_master_1_IP:             192.168.1.136
      NODE_ocp_9s46k_master_2_ETCD_NAME:      ocp-9s46k-master-2
      NODE_ocp_9s46k_master_2_ETCD_URL_HOST:  192.168.1.134
      NODE_ocp_9s46k_master_2_IP:             192.168.1.134
    Mounts:
      /etc/kubernetes/etcd-backup-dir from etcd-backup-dir (rw)
      /etc/kubernetes/manifests from static-pod-dir (rw)
      /etc/kubernetes/static-pod-certs from cert-dir (rw)
      /etc/kubernetes/static-pod-resources from resource-dir (rw)
      /var/lib/etcd/ from data-dir (rw)
  etcd:
    Container ID:  cri-o://9f41d4b16b9f3f74924f504cffc225ac943e664bf87c5c7056e6fbe41abf47ce
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      #!/bin/sh
      set -euo pipefail

      etcdctl member list || true

      # this has a non-zero return code if the command is non-zero.  If you use an export first, it doesn't and you
      # will succeed when you should fail.
      ETCD_INITIAL_CLUSTER=$(discover-etcd-initial-cluster \
        --cacert=/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt \
        --cert=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.crt \
        --key=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.key \
        --endpoints=${ALL_ETCD_ENDPOINTS} \
        --data-dir=/var/lib/etcd \
        --target-peer-url-host=${NODE_ocp_9s46k_master_2_ETCD_URL_HOST} \
        --target-name=ocp-9s46k-master-2)
       export ETCD_INITIAL_CLUSTER

      # at this point we know this member is added.  To support a transition, we must remove the old etcd pod.
      # move it somewhere safe so we can retrieve it again later if something goes badly.
      mv /etc/kubernetes/manifests/etcd-member.yaml /etc/kubernetes/etcd-backup-dir || true

      # we cannot use the "normal" port conflict initcontainer because when we upgrade, the existing static pod will never yield,
      # so we do the detection in etcd container itsefl.
      echo -n "Waiting for ports 2379, 2380 and 9978 to be released."
      while [ -n "$(ss -Htan '( sport = 2379 or sport = 2380 or sport = 9978 )')" ]; do
        echo -n "."
        sleep 1
      done

      export ETCD_NAME=${NODE_ocp_9s46k_master_2_ETCD_NAME}
      env | grep ETCD | grep -v NODE

      set -x
      # See https://etcd.io/docs/v3.4.0/tuning/ for why we use ionice
      exec ionice -c2 -n0 etcd \
        --initial-advertise-peer-urls=https://${NODE_ocp_9s46k_master_2_IP}:2380 \
        --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving/etcd-serving-ocp-9s46k-master-2.crt \
        --key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving/etcd-serving-ocp-9s46k-master-2.key \
        --trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt \
        --client-cert-auth=true \
        --peer-cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.crt \
        --peer-key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.key \
        --peer-trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt \
        --peer-client-cert-auth=true \
        --advertise-client-urls=https://${NODE_ocp_9s46k_master_2_IP}:2379 \
        --listen-client-urls=https://0.0.0.0:2379 \
        --listen-peer-urls=https://0.0.0.0:2380 \
        --listen-metrics-urls=https://0.0.0.0:9978 ||  mv /etc/kubernetes/etcd-backup-dir/etcd-member.yaml /etc/kubernetes/manifests

    State:          Running
      Started:      Wed, 07 Oct 2020 21:26:46 -0300
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      300m
      memory:   600Mi
    Readiness:  exec [/bin/sh -ec lsof -n -i :2380 | grep LISTEN] delay=3s timeout=5s period=5s #success=1 #failure=3
    Environment:
      ALL_ETCD_ENDPOINTS:                     https://192.168.1.134:2379,https://192.168.1.136:2379,https://192.168.1.179:2379
      ETCDCTL_API:                            3
      ETCDCTL_CACERT:                         /etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt
      ETCDCTL_CERT:                           /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.crt
      ETCDCTL_ENDPOINTS:                      https://192.168.1.134:2379,https://192.168.1.136:2379,https://192.168.1.179:2379
      ETCDCTL_KEY:                            /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.key
      ETCD_DATA_DIR:                          /var/lib/etcd
      ETCD_ELECTION_TIMEOUT:                  1000
      ETCD_ENABLE_PPROF:                      true
      ETCD_HEARTBEAT_INTERVAL:                100
      ETCD_IMAGE:                             quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df
      ETCD_INITIAL_CLUSTER_STATE:             existing
      ETCD_QUOTA_BACKEND_BYTES:               7516192768
      NODE_ocp_9s46k_master_0_ETCD_NAME:      ocp-9s46k-master-0
      NODE_ocp_9s46k_master_0_ETCD_URL_HOST:  192.168.1.179
      NODE_ocp_9s46k_master_0_IP:             192.168.1.179
      NODE_ocp_9s46k_master_1_ETCD_NAME:      ocp-9s46k-master-1
      NODE_ocp_9s46k_master_1_ETCD_URL_HOST:  192.168.1.136
      NODE_ocp_9s46k_master_1_IP:             192.168.1.136
      NODE_ocp_9s46k_master_2_ETCD_NAME:      ocp-9s46k-master-2
      NODE_ocp_9s46k_master_2_ETCD_URL_HOST:  192.168.1.134
      NODE_ocp_9s46k_master_2_IP:             192.168.1.134
    Mounts:
      /etc/kubernetes/etcd-backup-dir from etcd-backup-dir (rw)
      /etc/kubernetes/manifests from static-pod-dir (rw)
      /etc/kubernetes/static-pod-certs from cert-dir (rw)
      /etc/kubernetes/static-pod-resources from resource-dir (rw)
      /var/lib/etcd/ from data-dir (rw)
  etcd-metrics:
    Container ID:  cri-o://d457a867c69c04144596ca936da1ccfe76d38ea29feb9ae065b9c5d7394990ea
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      #!/bin/sh
      set -euo pipefail

      export ETCD_NAME=${NODE_ocp_9s46k_master_2_ETCD_NAME}

      exec etcd grpc-proxy start \
        --endpoints https://${NODE_ocp_9s46k_master_2_ETCD_URL_HOST}:9978 \
        --metrics-addr https://0.0.0.0:9979 \
        --listen-addr 127.0.0.1:9977 \
        --key /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.key \
        --key-file /etc/kubernetes/static-pod-certs/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ocp-9s46k-master-2.key \
        --cert /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.crt \
        --cert-file /etc/kubernetes/static-pod-certs/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ocp-9s46k-master-2.crt \
        --cacert /etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt \
        --trusted-ca-file /etc/kubernetes/static-pod-certs/configmaps/etcd-metrics-proxy-serving-ca/ca-bundle.crt

    State:          Running
      Started:      Wed, 07 Oct 2020 21:26:47 -0300
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     100m
      memory:  200Mi
    Environment:
      ALL_ETCD_ENDPOINTS:                     https://192.168.1.134:2379,https://192.168.1.136:2379,https://192.168.1.179:2379
      ETCDCTL_API:                            3
      ETCDCTL_CACERT:                         /etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt
      ETCDCTL_CERT:                           /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.crt
      ETCDCTL_ENDPOINTS:                      https://192.168.1.134:2379,https://192.168.1.136:2379,https://192.168.1.179:2379
      ETCDCTL_KEY:                            /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-ocp-9s46k-master-2.key
      ETCD_DATA_DIR:                          /var/lib/etcd
      ETCD_ELECTION_TIMEOUT:                  1000
      ETCD_ENABLE_PPROF:                      true
      ETCD_HEARTBEAT_INTERVAL:                100
      ETCD_IMAGE:                             quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df
      ETCD_INITIAL_CLUSTER_STATE:             existing
      ETCD_QUOTA_BACKEND_BYTES:               7516192768
      NODE_ocp_9s46k_master_0_ETCD_NAME:      ocp-9s46k-master-0
      NODE_ocp_9s46k_master_0_ETCD_URL_HOST:  192.168.1.179
      NODE_ocp_9s46k_master_0_IP:             192.168.1.179
      NODE_ocp_9s46k_master_1_ETCD_NAME:      ocp-9s46k-master-1
      NODE_ocp_9s46k_master_1_ETCD_URL_HOST:  192.168.1.136
      NODE_ocp_9s46k_master_1_IP:             192.168.1.136
      NODE_ocp_9s46k_master_2_ETCD_NAME:      ocp-9s46k-master-2
      NODE_ocp_9s46k_master_2_ETCD_URL_HOST:  192.168.1.134
      NODE_ocp_9s46k_master_2_IP:             192.168.1.134
    Mounts:
      /etc/kubernetes/static-pod-certs from cert-dir (rw)
      /etc/kubernetes/static-pod-resources from resource-dir (rw)
      /var/lib/etcd/ from data-dir (rw)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  static-pod-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/manifests
    HostPathType:
  etcd-backup-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/static-pod-resources/etcd-member
    HostPathType:
  resource-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/static-pod-resources/etcd-pod-7
    HostPathType:
  cert-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/static-pod-resources/etcd-certs
    HostPathType:
  data-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/etcd
    HostPathType:
  usr-local-bin:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/bin
    HostPathType:
QoS Class:         Burstable
Node-Selectors:    <none>
Tolerations:       op=Exists
Events:
  Type     Reason     Age   From     Message
  ----     ------     ----  ----     -------
  Normal   Created    121m  kubelet  Created container etcd-ensure-env-vars
  Normal   Started    121m  kubelet  Started container etcd-ensure-env-vars
  Normal   Pulled     121m  kubelet  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine
  Normal   Pulled     121m  kubelet  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine
  Normal   Created    121m  kubelet  Created container etcd-resources-copy
  Normal   Started    121m  kubelet  Started container etcd-resources-copy
  Normal   Pulled     120m  kubelet  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine
  Normal   Created    120m  kubelet  Created container etcdctl
  Normal   Created    120m  kubelet  Created container etcd
  Normal   Pulled     120m  kubelet  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine
  Normal   Started    120m  kubelet  Started container etcdctl
  Normal   Started    120m  kubelet  Started container etcd
  Normal   Pulled     120m  kubelet  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine
  Normal   Created    120m  kubelet  Created container etcd-metrics
  Normal   Started    120m  kubelet  Started container etcd-metrics
  Warning  Unhealthy  120m  kubelet  Readiness probe failed:
  Normal   Pulled     44m   kubelet  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine
  Normal   Started    44m   kubelet  Started container etcd-ensure-env-vars
  Normal   Pulled     44m   kubelet  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine
  Normal   Created    44m   kubelet  Created container etcd-ensure-env-vars
  Normal   Created    44m   kubelet  Created container etcd-resources-copy
  Normal   Started    44m   kubelet  Started container etcd-resources-copy
  Normal   Pulled     44m   kubelet  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine
  Normal   Created    44m   kubelet  Created container etcdctl
  Normal   Pulled     44m   kubelet  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine
  Normal   Started    44m   kubelet  Started container etcdctl
  Normal   Created    44m   kubelet  Created container etcd
  Normal   Started    44m   kubelet  Started container etcd
  Normal   Pulled     44m   kubelet  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c2b849e2337c2b12cb3feced6735fd6278c80a7ec08c8292f0f58cf290c566df" already present on machine
  Normal   Created    44m   kubelet  Created container etcd-metrics
  Normal   Started    44m   kubelet  Started container etcd-metrics
  Warning  Unhealthy  44m   kubelet  Readiness probe failed:
```

I've collected a must-gather on the cluster in this state. The file is at: https://drive.google.com/file/d/1xQVlG3UgkWF_85s8kWrmHlBZadRhmkcu/view?usp=sharing

Comment 1 Vadim Rutkovsky 2020-10-09 11:47:33 UTC
Several pods are unschedulable as machine-config-daemon can't evict etcd-quorum-guard:
```
2020-10-08T22:00:10.541608848Z I1008 22:00:10.541534    5493 daemon.go:344] evicting pod openshift-etcd/etcd-quorum-guard-7986975d98-bdxsg
2020-10-08T22:00:10.556118478Z E1008 22:00:10.556035    5493 daemon.go:344] error when evicting pod "etcd-quorum-guard-7986975d98-bdxsg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
```

This pod logs failures:
```
2020-10-07T23:58:58.057356627Z kill: 2020-10-07T23:58:58.05770516Z sending signal to 7 failed2020-10-07T23:58:58.057782408Z : 2020-10-07T23:58:58.057867723Z No such process
2020-10-08T01:15:10.456048553Z kill: sending signal to 6 failed: No such process
```

Moving to MCO - not quite sure why the pod can't be stopped, might be container runtime issue

Comment 2 Sam Batschelet 2020-10-09 12:57:07 UTC
# namespaces/openshift-etcd/pods/etcd-ocp-9s46k-master-0/etcd/etcd/logs/current.log

>  2020-10-08T00:51:54.703793616Z 2020-10-08 00:51:54.703621 I | embed: rejected connection from "192.168.1.134:39132" (error "tls: \"192.168.1.134\" does not match any of DNSNames [\"localhost\"   \"ocp.internal.carlosedp.com\" \"192.168.1.135\"] (lookup ocp.internal.carlosedp.com on 192.168.1.179:53: no such host)", ServerName "", IPAddresses ["192.168.1.135"], DNSNames ["localhost" "ocp.internal.carlosedp.com" "192.168.1.135"])

This (pretty sure) is telling me that the caller's IP address changed when the node rebooted can we verify this? As per documentation nodes IP address must remain persistent through reboot.

Comment 3 Carlos de Paula 2020-10-09 14:07:53 UTC
Hi Sam, just confirmed that my DHCP did this. The pod etcd-quorum-guard from master-0 was stuck, when I killed it, the node rebooted with correct IP and the update finished successfully.

Don't know why my DHCP allocated a different IP for the node. It's fixed now.

Thanks!

Comment 4 Suresh Kolichala 2020-10-09 14:29:54 UTC
If the IP address mismatches, the etcd operator should go degraded with clear message and regenerate the certificates for the new IP address.

This is being worked on in the Bugzilla bug https://bugzilla.redhat.com/show_bug.cgi?id=1882176.

*** This bug has been marked as a duplicate of bug 1882176 ***


Note You need to log in before you can comment on or make changes to this bug.