Description of problem: Unable to properly upgrade a cluster from 4.10.0-fc.4 to 4.10.0-rc.0 because the etcd operator Version-Release number of selected component (if applicable):4.10.0-rc.0 How reproducible: Unknown Steps to Reproduce: 1. Create aarch64_IPI on AWS & OVN cluster 2. Upgrade cluster oc image info quay.io/openshift-release-dev/ocp-release:4.10.0-rc.0-aarch64 oc adm upgrade --allow-explicit-upgrade --to-image=quay.io/openshift-release-dev/ocp-release@sha256:8c767585e07a0b5626a20eb0a4078b2a0f042658e21813cd75349906fb4b1173 --force Actual results: Unable to apply 4.10.0-rc.0: wait has exceeded 40 minutes for these operators: etcd Expected results: Upgrade succeeds and all cluster operators are not degraded Additional info: 02-02 10:11:54.990 ClusterID: 77483415-2c45-464a-be5c-647ca1ba3696 02-02 10:11:54.990 ClusterVersion: Updating to "4.10.0-rc.0" from "4.10.0-fc.4" for 51 minutes: Unable to apply 4.10.0-rc.0: wait has exceeded 40 minutes for these operators: etcd 02-02 10:11:55.246 ClusterOperators: 02-02 10:11:55.246 clusteroperator/etcd is degraded because EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.135.187, not 10.0.136.189, x509: certificate is valid for ::1, 10.0.135.187, 127.0.0.1, ::1, not 10.0.136.189] 02-02 10:11:55.246 clusteroperator/kube-scheduler is degraded because TargetConfigControllerDegraded: "configmap": scheduler Policy config has been removed upstream, this field remains for CRD compatibility but does nothing now. Please use a Profile instead (defaulting to LowNodeUtilization).
Poking around in the must-gather: $ tar -xz --strip-components=2 <must-gather.tar.gz $ yaml2json cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2022-02-02T12:28:50Z Available=True : Done applying 4.10.0-fc.4 2022-02-02T15:01:51Z Failing=True ClusterOperatorDegraded: Cluster operator etcd is degraded 2022-02-02T14:20:33Z Progressing=True ClusterOperatorDegraded: Unable to apply 4.10.0-rc.0: wait has exceeded 40 minutes for these operators: etcd 2022-02-02T12:04:17Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.10.0-rc.0 not found in the "stable-4.10" channel $ yaml2json cluster-scoped-resources/config.openshift.io/clusteroperators/etcd.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2022-02-02T13:20:15Z Degraded=True EtcdCertSignerController_Error: EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.135.187, not 10.0.136.189, x509: certificate is valid for ::1, 10.0.135.187, 127.0.0.1, ::1, not 10.0.136.189] 2022-02-02T14:25:24Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 8 EtcdMembersProgressing: No unstarted etcd members found 2022-02-02T12:10:45Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 8 EtcdMembersAvailable: 3 members are available 2022-02-02T12:09:08Z Upgradeable=True AsExpected: All is well 2022-02-02T12:09:09Z RecentBackup=Unknown ControllerStarted: Looking at the nodes: $ for NODE in cluster-scoped-resources/core/nodes/*; do yaml2json "${NODE}" | jq -r 'select(.metadata.labels["node-role.kubernetes.io/master"]).metadata.name'; done ip-10-0-135-187.us-east-2.compute.internal ip-10-0-172-14.us-east-2.compute.internal ip-10-0-211-1.us-east-2.compute.internal So yeah, a cert for 10.0.135.187 makes sense. Why is that cert being used for 10.0.136.189? $ yaml2json cluster-scoped-resources/core/nodes/ip-10-0-135-187.us-east-2.compute.internal.yaml | jq -r '.status.addresses[] | .type + " " + .address' InternalIP 10.0.135.187 InternalIP 10.0.136.189 Hostname ip-10-0-135-187.us-east-2.compute.internal InternalDNS ip-10-0-135-187.us-east-2.compute.internal Two InternalIP. There was bug 1954129 in this space back in 4.8, and mentions that manual work may be needed if InternalIP changes on update. $ yaml2json cluster-scoped-resources/cloud.network.openshift.io/cloudprivateipconfigs/10.0.136.189.yaml | jq -r .metadata.creationTimestamp 2022-02-02T13:17:59Z That's suspicious. Let's see if we can figure out where it came from in the audit logs: $ zgrep -h '"verb":"create"' audit_logs/*/*.gz 2>/dev/null | grep 'cloudprivateipconfigs.*10.0.136.189' | jq -r .user.username system:serviceaccount:openshift-ovn-kubernetes:ovn-kubernetes-controller Can we find logs? $ grep -r 'serviceAccount: ovn-kubernetes-controller' namespaces/openshift-ovn-kubernetes/apps/daemonsets/ovnkube-master.yaml: serviceAccount: ovn-kubernetes-controller namespaces/openshift-ovn-kubernetes/apps/daemonsets.yaml: serviceAccount: ovn-kubernetes-controller namespaces/openshift-ovn-kubernetes/core/pods.yaml: serviceAccount: ovn-kubernetes-controller namespaces/openshift-ovn-kubernetes/core/pods.yaml: serviceAccount: ovn-kubernetes-controller namespaces/openshift-ovn-kubernetes/core/pods.yaml: serviceAccount: ovn-kubernetes-controller namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-chmcc/ovnkube-master-chmcc.yaml: serviceAccount: ovn-kubernetes-controller namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master-g56fg.yaml: serviceAccount: ovn-kubernetes-controller namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-rg2tq/ovnkube-master-rg2tq.yaml: serviceAccount: ovn-kubernetes-controller $ grep nodeName: namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-*/*.yaml namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-chmcc/ovnkube-master-chmcc.yaml: nodeName: ip-10-0-211-1.us-east-2.compute.internal namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master-g56fg.yaml: nodeName: ip-10-0-135-187.us-east-2.compute.internal namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-rg2tq/ovnkube-master-rg2tq.yaml: nodeName: ip-10-0-172-14.us-east-2.compute.internal Poking in logs: $ head -n1 namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/* ==> namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/current.log <== 2022-02-02T12:15:57.986804393Z + [[ -f /env/_master ]] ==> namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/previous.log <== 2022-02-02T12:07:30.931083310Z + [[ -f /env/_master ]] 12:15 is before the 13:17 CloudPrivateIPConfig creation, so let's look at current.log: $ grep -o '10[.]0[.][0-9]*[.][0-9]*' namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/current.log | sort -n | uniq -c ... 1427 10.0.135.187 3 10.0.136.189 ... $ grep 10.0.136.189 namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/current.log 2022-02-02T13:17:59.857899233Z I0202 13:17:59.857836 1 transact.go:41] Configuring OVN: [{Op:insert Table:NAT Row:map[external_ids:{GoMap:map[name:egressip]} external_ip:10.0.136.189 ... 2022-02-02T13:54:35.377453412Z I0202 13:54:35.377393 1 transact.go:41]... ... So looks like OVN did indeed add that CloudPrivateIPConfig at 13:17, and that this is what is leading to the etcd cert confusion, and locking the update. I don't see anything in the OVN logs that makes the source of that IP clear to me, so sending over to the OVN folks.
Was there an egress IP configured on one of the master nodes? If so this is a duplicate of: https://bugzilla.redhat.com/show_bug.cgi?id=2039656
Hi, we have hit the same issue as well upgrading from 4.10.0-fc.4 to 4.10.0-rc.0 a baremetal ovnkubernetes cluster, in our case with the IPv6 addresses: etcd 4.10.0-rc.1 True False True 23h EtcdCertSignerControllerDegraded: [x509: certificate is valid for 192.168.216.11, not fd01:1101::5743:2180:facf:2c3d, x509: certificate is valid for ::1, 127.0.0.1, 192.168.216.11, ::1, not fd01:1101::5743:2180:facf:2c3d, x509: certificate is valid for ::1, 127.0.0.1, 192.168.216.10, ::1, not fd01:1101::c098:ca86:2951:9c63, x509: certificate is valid for 192.168.216.10, not fd01:1101::c098:ca86:2951:9c63, x509: certificate is valid for 192.168.216.12, not fd01:1101::e25f:dcd3:1bec:bd22, x509: certificate is valid for ::1, 127.0.0.1, 192.168.216.12, ::1, not fd01:1101::e25f:dcd3:1bec:bd22]
*** This bug has been marked as a duplicate of bug 2039656 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days