2050403 – Unable to upgrade from 4.10.0-fc-4 to 4.10.0-rc.0 because of etcd EtcdCertSignerControllerDegraded, from OVN-created CloudPrivateIPConfig

Bug 2050403 - Unable to upgrade from 4.10.0-fc-4 to 4.10.0-rc.0 because of etcd EtcdCertSignerControllerDegraded, from OVN-created CloudPrivateIPConfig

Summary: Unable to upgrade from 4.10.0-fc-4 to 4.10.0-rc.0 because of etcd EtcdCertSig...

Keywords:
Status:	CLOSED DUPLICATE of bug 2039656
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Patryk Diak
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-03 21:22 UTC by Paige Rubendall
Modified:	2023-09-15 01:19 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-02-21 08:59:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Paige Rubendall 2022-02-03 21:22:27 UTC

Description of problem:
Unable to properly upgrade a cluster from 4.10.0-fc.4 to 4.10.0-rc.0 because the etcd operator 

Version-Release number of selected component (if applicable):4.10.0-rc.0


How reproducible: Unknown


Steps to Reproduce:
1. Create aarch64_IPI on AWS & OVN cluster
2. Upgrade cluster
 oc image info quay.io/openshift-release-dev/ocp-release:4.10.0-rc.0-aarch64

 oc adm upgrade --allow-explicit-upgrade --to-image=quay.io/openshift-release-dev/ocp-release@sha256:8c767585e07a0b5626a20eb0a4078b2a0f042658e21813cd75349906fb4b1173 --force

Actual results:
Unable to apply 4.10.0-rc.0: wait has exceeded 40 minutes for these operators: etcd

Expected results:
Upgrade succeeds and all cluster operators are not degraded

Additional info:
02-02 10:11:54.990  ClusterID: 77483415-2c45-464a-be5c-647ca1ba3696
02-02 10:11:54.990  ClusterVersion: Updating to "4.10.0-rc.0" from "4.10.0-fc.4" for 51 minutes: Unable to apply 4.10.0-rc.0: wait has exceeded 40 minutes for these operators: etcd
02-02 10:11:55.246  ClusterOperators:
02-02 10:11:55.246  	clusteroperator/etcd is degraded because EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.135.187, not 10.0.136.189, x509: certificate is valid for ::1, 10.0.135.187, 127.0.0.1, ::1, not 10.0.136.189]
02-02 10:11:55.246  	clusteroperator/kube-scheduler is degraded because TargetConfigControllerDegraded: "configmap": scheduler Policy config has been removed upstream, this field remains for CRD compatibility but does nothing now. Please use a Profile instead (defaulting to LowNodeUtilization).

Comment 2 W. Trevor King 2022-02-04 19:58:39 UTC

Poking around in the must-gather:

$ tar -xz --strip-components=2 <must-gather.tar.gz
$ yaml2json cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-02-02T12:28:50Z Available=True : Done applying 4.10.0-fc.4
2022-02-02T15:01:51Z Failing=True ClusterOperatorDegraded: Cluster operator etcd is degraded
2022-02-02T14:20:33Z Progressing=True ClusterOperatorDegraded: Unable to apply 4.10.0-rc.0: wait has exceeded 40 minutes for these operators: etcd
2022-02-02T12:04:17Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.10.0-rc.0 not found in the "stable-4.10" channel
$ yaml2json cluster-scoped-resources/config.openshift.io/clusteroperators/etcd.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-02-02T13:20:15Z Degraded=True EtcdCertSignerController_Error: EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.135.187, not 10.0.136.189, x509: certificate is valid for ::1, 10.0.135.187, 127.0.0.1, ::1, not 10.0.136.189]
2022-02-02T14:25:24Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 8
EtcdMembersProgressing: No unstarted etcd members found
2022-02-02T12:10:45Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 8
EtcdMembersAvailable: 3 members are available
2022-02-02T12:09:08Z Upgradeable=True AsExpected: All is well
2022-02-02T12:09:09Z RecentBackup=Unknown ControllerStarted: 

Looking at the nodes:

$ for NODE in cluster-scoped-resources/core/nodes/*; do yaml2json "${NODE}" | jq -r 'select(.metadata.labels["node-role.kubernetes.io/master"]).metadata.name'; done
ip-10-0-135-187.us-east-2.compute.internal
ip-10-0-172-14.us-east-2.compute.internal
ip-10-0-211-1.us-east-2.compute.internal

So yeah, a cert for 10.0.135.187 makes sense.  Why is that cert being used for 10.0.136.189?

$ yaml2json cluster-scoped-resources/core/nodes/ip-10-0-135-187.us-east-2.compute.internal.yaml | jq -r '.status.addresses[] | .type + " " + .address'
InternalIP 10.0.135.187
InternalIP 10.0.136.189
Hostname ip-10-0-135-187.us-east-2.compute.internal
InternalDNS ip-10-0-135-187.us-east-2.compute.internal

Two InternalIP.  There was bug 1954129 in this space back in 4.8, and mentions that manual work may be needed if InternalIP changes on update.

$ yaml2json cluster-scoped-resources/cloud.network.openshift.io/cloudprivateipconfigs/10.0.136.189.yaml | jq -r .metadata.creationTimestamp
2022-02-02T13:17:59Z

That's suspicious.  Let's see if we can figure out where it came from in the audit logs:

$ zgrep -h '"verb":"create"' audit_logs/*/*.gz 2>/dev/null | grep 'cloudprivateipconfigs.*10.0.136.189' | jq -r .user.username
system:serviceaccount:openshift-ovn-kubernetes:ovn-kubernetes-controller

Can we find logs?

$ grep -r 'serviceAccount: ovn-kubernetes-controller'
namespaces/openshift-ovn-kubernetes/apps/daemonsets/ovnkube-master.yaml:      serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/apps/daemonsets.yaml:        serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/core/pods.yaml:    serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/core/pods.yaml:    serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/core/pods.yaml:    serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-chmcc/ovnkube-master-chmcc.yaml:  serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master-g56fg.yaml:  serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-rg2tq/ovnkube-master-rg2tq.yaml:  serviceAccount: ovn-kubernetes-controller
$ grep nodeName: namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-*/*.yaml
namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-chmcc/ovnkube-master-chmcc.yaml:  nodeName: ip-10-0-211-1.us-east-2.compute.internal
namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master-g56fg.yaml:  nodeName: ip-10-0-135-187.us-east-2.compute.internal
namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-rg2tq/ovnkube-master-rg2tq.yaml:  nodeName: ip-10-0-172-14.us-east-2.compute.internal

Poking in logs:

$ head -n1 namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/*
==> namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/current.log <==
2022-02-02T12:15:57.986804393Z + [[ -f /env/_master ]]

==> namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/previous.log <==
2022-02-02T12:07:30.931083310Z + [[ -f /env/_master ]]

12:15 is before the 13:17 CloudPrivateIPConfig creation, so let's look at current.log:

$ grep -o '10[.]0[.][0-9]*[.][0-9]*' namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/current.log | sort -n | uniq -c
...
   1427 10.0.135.187
      3 10.0.136.189
...
$ grep 10.0.136.189 namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/current.log
2022-02-02T13:17:59.857899233Z I0202 13:17:59.857836       1 transact.go:41] Configuring OVN: [{Op:insert Table:NAT Row:map[external_ids:{GoMap:map[name:egressip]} external_ip:10.0.136.189 ...
2022-02-02T13:54:35.377453412Z I0202 13:54:35.377393       1 transact.go:41]...
...

So looks like OVN did indeed add that CloudPrivateIPConfig at 13:17, and that this is what is leading to the etcd cert confusion, and locking the update.  I don't see anything in the OVN logs that makes the source of that IP clear to me, so sending over to the OVN folks.

Comment 4 Patryk Diak 2022-02-07 09:48:15 UTC

Was there an egress IP configured on one of the master nodes?
If so this is a duplicate of: https://bugzilla.redhat.com/show_bug.cgi?id=2039656

Comment 5 Jose Castillo Lema 2022-02-07 20:43:16 UTC

Hi,

we have hit the same issue as well upgrading from 4.10.0-fc.4 to 4.10.0-rc.0 a baremetal ovnkubernetes cluster, in our case with the IPv6 addresses:
etcd                                       4.10.0-rc.1   True        False         True       23h     EtcdCertSignerControllerDegraded: [x509: certificate is valid for 192.168.216.11, not fd01:1101::5743:2180:facf:2c3d, x509: certificate is valid for ::1, 127.0.0.1, 192.168.216.11, ::1, not fd01:1101::5743:2180:facf:2c3d, x509: certificate is valid for ::1, 127.0.0.1, 192.168.216.10, ::1, not fd01:1101::c098:ca86:2951:9c63, x509: certificate is valid for 192.168.216.10, not fd01:1101::c098:ca86:2951:9c63, x509: certificate is valid for 192.168.216.12, not fd01:1101::e25f:dcd3:1bec:bd22, x509: certificate is valid for ::1, 127.0.0.1, 192.168.216.12, ::1, not fd01:1101::e25f:dcd3:1bec:bd22]

Comment 6 Patryk Diak 2022-02-21 08:59:43 UTC


*** This bug has been marked as a duplicate of bug 2039656 ***

Comment 7 Red Hat Bugzilla 2023-09-15 01:19:23 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.