Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2050403

Summary: Unable to upgrade from 4.10.0-fc-4 to 4.10.0-rc.0 because of etcd EtcdCertSignerControllerDegraded, from OVN-created CloudPrivateIPConfig
Product: OpenShift Container Platform Reporter: Paige Rubendall <prubenda>
Component: NetworkingAssignee: Patryk Diak <pdiak>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: jlema, pdiak, william.caban, wking
Version: 4.10   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-02-21 08:59:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Paige Rubendall 2022-02-03 21:22:27 UTC
Description of problem:
Unable to properly upgrade a cluster from 4.10.0-fc.4 to 4.10.0-rc.0 because the etcd operator 

Version-Release number of selected component (if applicable):4.10.0-rc.0


How reproducible: Unknown


Steps to Reproduce:
1. Create aarch64_IPI on AWS & OVN cluster
2. Upgrade cluster
 oc image info quay.io/openshift-release-dev/ocp-release:4.10.0-rc.0-aarch64

 oc adm upgrade --allow-explicit-upgrade --to-image=quay.io/openshift-release-dev/ocp-release@sha256:8c767585e07a0b5626a20eb0a4078b2a0f042658e21813cd75349906fb4b1173 --force

Actual results:
Unable to apply 4.10.0-rc.0: wait has exceeded 40 minutes for these operators: etcd

Expected results:
Upgrade succeeds and all cluster operators are not degraded

Additional info:
02-02 10:11:54.990  ClusterID: 77483415-2c45-464a-be5c-647ca1ba3696
02-02 10:11:54.990  ClusterVersion: Updating to "4.10.0-rc.0" from "4.10.0-fc.4" for 51 minutes: Unable to apply 4.10.0-rc.0: wait has exceeded 40 minutes for these operators: etcd
02-02 10:11:55.246  ClusterOperators:
02-02 10:11:55.246  	clusteroperator/etcd is degraded because EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.135.187, not 10.0.136.189, x509: certificate is valid for ::1, 10.0.135.187, 127.0.0.1, ::1, not 10.0.136.189]
02-02 10:11:55.246  	clusteroperator/kube-scheduler is degraded because TargetConfigControllerDegraded: "configmap": scheduler Policy config has been removed upstream, this field remains for CRD compatibility but does nothing now. Please use a Profile instead (defaulting to LowNodeUtilization).

Comment 2 W. Trevor King 2022-02-04 19:58:39 UTC
Poking around in the must-gather:

$ tar -xz --strip-components=2 <must-gather.tar.gz
$ yaml2json cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-02-02T12:28:50Z Available=True : Done applying 4.10.0-fc.4
2022-02-02T15:01:51Z Failing=True ClusterOperatorDegraded: Cluster operator etcd is degraded
2022-02-02T14:20:33Z Progressing=True ClusterOperatorDegraded: Unable to apply 4.10.0-rc.0: wait has exceeded 40 minutes for these operators: etcd
2022-02-02T12:04:17Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.10.0-rc.0 not found in the "stable-4.10" channel
$ yaml2json cluster-scoped-resources/config.openshift.io/clusteroperators/etcd.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-02-02T13:20:15Z Degraded=True EtcdCertSignerController_Error: EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.135.187, not 10.0.136.189, x509: certificate is valid for ::1, 10.0.135.187, 127.0.0.1, ::1, not 10.0.136.189]
2022-02-02T14:25:24Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 8
EtcdMembersProgressing: No unstarted etcd members found
2022-02-02T12:10:45Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 8
EtcdMembersAvailable: 3 members are available
2022-02-02T12:09:08Z Upgradeable=True AsExpected: All is well
2022-02-02T12:09:09Z RecentBackup=Unknown ControllerStarted: 

Looking at the nodes:

$ for NODE in cluster-scoped-resources/core/nodes/*; do yaml2json "${NODE}" | jq -r 'select(.metadata.labels["node-role.kubernetes.io/master"]).metadata.name'; done
ip-10-0-135-187.us-east-2.compute.internal
ip-10-0-172-14.us-east-2.compute.internal
ip-10-0-211-1.us-east-2.compute.internal

So yeah, a cert for 10.0.135.187 makes sense.  Why is that cert being used for 10.0.136.189?

$ yaml2json cluster-scoped-resources/core/nodes/ip-10-0-135-187.us-east-2.compute.internal.yaml | jq -r '.status.addresses[] | .type + " " + .address'
InternalIP 10.0.135.187
InternalIP 10.0.136.189
Hostname ip-10-0-135-187.us-east-2.compute.internal
InternalDNS ip-10-0-135-187.us-east-2.compute.internal

Two InternalIP.  There was bug 1954129 in this space back in 4.8, and mentions that manual work may be needed if InternalIP changes on update.

$ yaml2json cluster-scoped-resources/cloud.network.openshift.io/cloudprivateipconfigs/10.0.136.189.yaml | jq -r .metadata.creationTimestamp
2022-02-02T13:17:59Z

That's suspicious.  Let's see if we can figure out where it came from in the audit logs:

$ zgrep -h '"verb":"create"' audit_logs/*/*.gz 2>/dev/null | grep 'cloudprivateipconfigs.*10.0.136.189' | jq -r .user.username
system:serviceaccount:openshift-ovn-kubernetes:ovn-kubernetes-controller

Can we find logs?

$ grep -r 'serviceAccount: ovn-kubernetes-controller'
namespaces/openshift-ovn-kubernetes/apps/daemonsets/ovnkube-master.yaml:      serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/apps/daemonsets.yaml:        serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/core/pods.yaml:    serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/core/pods.yaml:    serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/core/pods.yaml:    serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-chmcc/ovnkube-master-chmcc.yaml:  serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master-g56fg.yaml:  serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-rg2tq/ovnkube-master-rg2tq.yaml:  serviceAccount: ovn-kubernetes-controller
$ grep nodeName: namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-*/*.yaml
namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-chmcc/ovnkube-master-chmcc.yaml:  nodeName: ip-10-0-211-1.us-east-2.compute.internal
namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master-g56fg.yaml:  nodeName: ip-10-0-135-187.us-east-2.compute.internal
namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-rg2tq/ovnkube-master-rg2tq.yaml:  nodeName: ip-10-0-172-14.us-east-2.compute.internal

Poking in logs:

$ head -n1 namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/*
==> namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/current.log <==
2022-02-02T12:15:57.986804393Z + [[ -f /env/_master ]]

==> namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/previous.log <==
2022-02-02T12:07:30.931083310Z + [[ -f /env/_master ]]

12:15 is before the 13:17 CloudPrivateIPConfig creation, so let's look at current.log:

$ grep -o '10[.]0[.][0-9]*[.][0-9]*' namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/current.log | sort -n | uniq -c
...
   1427 10.0.135.187
      3 10.0.136.189
...
$ grep 10.0.136.189 namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/current.log
2022-02-02T13:17:59.857899233Z I0202 13:17:59.857836       1 transact.go:41] Configuring OVN: [{Op:insert Table:NAT Row:map[external_ids:{GoMap:map[name:egressip]} external_ip:10.0.136.189 ...
2022-02-02T13:54:35.377453412Z I0202 13:54:35.377393       1 transact.go:41]...
...

So looks like OVN did indeed add that CloudPrivateIPConfig at 13:17, and that this is what is leading to the etcd cert confusion, and locking the update.  I don't see anything in the OVN logs that makes the source of that IP clear to me, so sending over to the OVN folks.

Comment 4 Patryk Diak 2022-02-07 09:48:15 UTC
Was there an egress IP configured on one of the master nodes?
If so this is a duplicate of: https://bugzilla.redhat.com/show_bug.cgi?id=2039656

Comment 5 Jose Castillo Lema 2022-02-07 20:43:16 UTC
Hi,

we have hit the same issue as well upgrading from 4.10.0-fc.4 to 4.10.0-rc.0 a baremetal ovnkubernetes cluster, in our case with the IPv6 addresses:
etcd                                       4.10.0-rc.1   True        False         True       23h     EtcdCertSignerControllerDegraded: [x509: certificate is valid for 192.168.216.11, not fd01:1101::5743:2180:facf:2c3d, x509: certificate is valid for ::1, 127.0.0.1, 192.168.216.11, ::1, not fd01:1101::5743:2180:facf:2c3d, x509: certificate is valid for ::1, 127.0.0.1, 192.168.216.10, ::1, not fd01:1101::c098:ca86:2951:9c63, x509: certificate is valid for 192.168.216.10, not fd01:1101::c098:ca86:2951:9c63, x509: certificate is valid for 192.168.216.12, not fd01:1101::e25f:dcd3:1bec:bd22, x509: certificate is valid for ::1, 127.0.0.1, 192.168.216.12, ::1, not fd01:1101::e25f:dcd3:1bec:bd22]

Comment 6 Patryk Diak 2022-02-21 08:59:43 UTC

*** This bug has been marked as a duplicate of bug 2039656 ***

Comment 7 Red Hat Bugzilla 2023-09-15 01:19:23 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days