Bug 2050403
| Summary: | Unable to upgrade from 4.10.0-fc-4 to 4.10.0-rc.0 because of etcd EtcdCertSignerControllerDegraded, from OVN-created CloudPrivateIPConfig | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Paige Rubendall <prubenda> |
| Component: | Networking | Assignee: | Patryk Diak <pdiak> |
| Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | jlema, pdiak, william.caban, wking |
| Version: | 4.10 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-02-21 08:59:43 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Paige Rubendall
2022-02-03 21:22:27 UTC
Poking around in the must-gather:
$ tar -xz --strip-components=2 <must-gather.tar.gz
$ yaml2json cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-02-02T12:28:50Z Available=True : Done applying 4.10.0-fc.4
2022-02-02T15:01:51Z Failing=True ClusterOperatorDegraded: Cluster operator etcd is degraded
2022-02-02T14:20:33Z Progressing=True ClusterOperatorDegraded: Unable to apply 4.10.0-rc.0: wait has exceeded 40 minutes for these operators: etcd
2022-02-02T12:04:17Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.10.0-rc.0 not found in the "stable-4.10" channel
$ yaml2json cluster-scoped-resources/config.openshift.io/clusteroperators/etcd.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-02-02T13:20:15Z Degraded=True EtcdCertSignerController_Error: EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.135.187, not 10.0.136.189, x509: certificate is valid for ::1, 10.0.135.187, 127.0.0.1, ::1, not 10.0.136.189]
2022-02-02T14:25:24Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 8
EtcdMembersProgressing: No unstarted etcd members found
2022-02-02T12:10:45Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 8
EtcdMembersAvailable: 3 members are available
2022-02-02T12:09:08Z Upgradeable=True AsExpected: All is well
2022-02-02T12:09:09Z RecentBackup=Unknown ControllerStarted:
Looking at the nodes:
$ for NODE in cluster-scoped-resources/core/nodes/*; do yaml2json "${NODE}" | jq -r 'select(.metadata.labels["node-role.kubernetes.io/master"]).metadata.name'; done
ip-10-0-135-187.us-east-2.compute.internal
ip-10-0-172-14.us-east-2.compute.internal
ip-10-0-211-1.us-east-2.compute.internal
So yeah, a cert for 10.0.135.187 makes sense. Why is that cert being used for 10.0.136.189?
$ yaml2json cluster-scoped-resources/core/nodes/ip-10-0-135-187.us-east-2.compute.internal.yaml | jq -r '.status.addresses[] | .type + " " + .address'
InternalIP 10.0.135.187
InternalIP 10.0.136.189
Hostname ip-10-0-135-187.us-east-2.compute.internal
InternalDNS ip-10-0-135-187.us-east-2.compute.internal
Two InternalIP. There was bug 1954129 in this space back in 4.8, and mentions that manual work may be needed if InternalIP changes on update.
$ yaml2json cluster-scoped-resources/cloud.network.openshift.io/cloudprivateipconfigs/10.0.136.189.yaml | jq -r .metadata.creationTimestamp
2022-02-02T13:17:59Z
That's suspicious. Let's see if we can figure out where it came from in the audit logs:
$ zgrep -h '"verb":"create"' audit_logs/*/*.gz 2>/dev/null | grep 'cloudprivateipconfigs.*10.0.136.189' | jq -r .user.username
system:serviceaccount:openshift-ovn-kubernetes:ovn-kubernetes-controller
Can we find logs?
$ grep -r 'serviceAccount: ovn-kubernetes-controller'
namespaces/openshift-ovn-kubernetes/apps/daemonsets/ovnkube-master.yaml: serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/apps/daemonsets.yaml: serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/core/pods.yaml: serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/core/pods.yaml: serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/core/pods.yaml: serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-chmcc/ovnkube-master-chmcc.yaml: serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master-g56fg.yaml: serviceAccount: ovn-kubernetes-controller
namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-rg2tq/ovnkube-master-rg2tq.yaml: serviceAccount: ovn-kubernetes-controller
$ grep nodeName: namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-*/*.yaml
namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-chmcc/ovnkube-master-chmcc.yaml: nodeName: ip-10-0-211-1.us-east-2.compute.internal
namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master-g56fg.yaml: nodeName: ip-10-0-135-187.us-east-2.compute.internal
namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-rg2tq/ovnkube-master-rg2tq.yaml: nodeName: ip-10-0-172-14.us-east-2.compute.internal
Poking in logs:
$ head -n1 namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/*
==> namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/current.log <==
2022-02-02T12:15:57.986804393Z + [[ -f /env/_master ]]
==> namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/previous.log <==
2022-02-02T12:07:30.931083310Z + [[ -f /env/_master ]]
12:15 is before the 13:17 CloudPrivateIPConfig creation, so let's look at current.log:
$ grep -o '10[.]0[.][0-9]*[.][0-9]*' namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/current.log | sort -n | uniq -c
...
1427 10.0.135.187
3 10.0.136.189
...
$ grep 10.0.136.189 namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-g56fg/ovnkube-master/ovnkube-master/logs/current.log
2022-02-02T13:17:59.857899233Z I0202 13:17:59.857836 1 transact.go:41] Configuring OVN: [{Op:insert Table:NAT Row:map[external_ids:{GoMap:map[name:egressip]} external_ip:10.0.136.189 ...
2022-02-02T13:54:35.377453412Z I0202 13:54:35.377393 1 transact.go:41]...
...
So looks like OVN did indeed add that CloudPrivateIPConfig at 13:17, and that this is what is leading to the etcd cert confusion, and locking the update. I don't see anything in the OVN logs that makes the source of that IP clear to me, so sending over to the OVN folks.
Was there an egress IP configured on one of the master nodes? If so this is a duplicate of: https://bugzilla.redhat.com/show_bug.cgi?id=2039656 Hi, we have hit the same issue as well upgrading from 4.10.0-fc.4 to 4.10.0-rc.0 a baremetal ovnkubernetes cluster, in our case with the IPv6 addresses: etcd 4.10.0-rc.1 True False True 23h EtcdCertSignerControllerDegraded: [x509: certificate is valid for 192.168.216.11, not fd01:1101::5743:2180:facf:2c3d, x509: certificate is valid for ::1, 127.0.0.1, 192.168.216.11, ::1, not fd01:1101::5743:2180:facf:2c3d, x509: certificate is valid for ::1, 127.0.0.1, 192.168.216.10, ::1, not fd01:1101::c098:ca86:2951:9c63, x509: certificate is valid for 192.168.216.10, not fd01:1101::c098:ca86:2951:9c63, x509: certificate is valid for 192.168.216.12, not fd01:1101::e25f:dcd3:1bec:bd22, x509: certificate is valid for ::1, 127.0.0.1, 192.168.216.12, ::1, not fd01:1101::e25f:dcd3:1bec:bd22] *** This bug has been marked as a duplicate of bug 2039656 *** The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |