Created attachment 1594812 [details] must-gather logs Description of problem: While upgrading from 4.1.4 to 4.1.7 the master nodes were marked Degraded. The nodes were marked SchedulingDisabled during the upgrade and not changed. There have been a few different scenarios that I have been debugging as I have dug into this issue. Version-Release number of selected component (if applicable): 4.1.4 and 4.1.7 How reproducible: I encountered this by upgrading from 4.1.4 to 4.1.7 Actual results: MCO is hung and cannot progress with the 4.1.7 upgrade Expected results: MCO finishes running it's upgrade steps and the upgrade is finished Additional info: I don't really know what caused this and I have done a lot of debugging and mucking around with things trying to unstick the process and figure out how MCO works. There doesn't appear to be a good way to get a node out of Degraded status. I would like to know the steps required to reset MCO into a good state.
One of the MCDs is stuck in drain: 2019-07-30T13:17:04.895320414Z I0730 13:17:04.895296 4213 update.go:836] Update prepared; beginning drain 2019-07-30T15:27:27.941121869Z I0730 15:27:27.941073 4213 update.go:134] Draining failed with: the server was unable to return a response in the time allotted, but may still be processing the request (get pods), retrying 2019-07-30T15:28:07.942531937Z I0730 15:28:07.942495 4213 update.go:134] Draining failed with: Timeout: request did not complete within requested timeout 30s, retrying
We'll need to backport the kubernetes-drain fixes anyway and logs will help us figure out why it's stuck there
Created attachment 1600022 [details] must-gather logs for fresh cluster install with error I did a fresh install of a cluster at 4.1.4 and then upgraded to 4.1.7 and then to 4.1.8 and saw the same errors. Attached are the must-gather logs from right after it happened and before I took any steps on my own to correct the issue (none of which worked)
Did an upgrade of the 4.1.8 cluster to 4.1.9 and am still seeing the error while trying to progress to the newest version. From the machine-config ClusterOperator Unable to apply 4.1.9: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-77fedea5c08df3344292ab91a9ce1298 expected 83392b13a5c17e56656acf3f7b0031e3303fb5c0 has 02c07496ba0417b3e12b78fb32baf6293d314f79, retrying
(In reply to Jason Kincl from comment #4) > Did an upgrade of the 4.1.8 cluster to 4.1.9 and am still seeing the error > while trying to progress to the newest version. > > From the machine-config ClusterOperator > > Unable to apply 4.1.9: timed out waiting for the condition during > syncRequiredMachineConfigPools: pool master has not progressed to latest > configuration: controller version mismatch for > rendered-master-77fedea5c08df3344292ab91a9ce1298 expected > 83392b13a5c17e56656acf3f7b0031e3303fb5c0 has > 02c07496ba0417b3e12b78fb32baf6293d314f79, retrying Does this reconcile eventually? That message is saying that the MCC hasn't generated the newest rendered machineconfigs for the new version which is ok as the MCC may have not run yet.
(In reply to Antonio Murdaca from comment #5) > Does this reconcile eventually? That message is saying that the MCC hasn't > generated the newest rendered machineconfigs for the new version which is ok > as the MCC may have not run yet. No it has been trying to reconcile since I put in the bug and it has not completed.
Looks like my cluster is stuck in upgrading on the masters for backround, the customer was upgraded since 4.1.1 though all versions in between. Once I run into https://bugzilla.redhat.com/show_bug.cgi?id=1723327 which was solved by rolling back ostree on the affected nodes, but that does not do th trick this time # oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.8 True True 23h Working towards 4.1.9: 89% complete # oc get clusteroperators.config.openshift.io NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.1.9 True False False 65d cloud-credential 4.1.9 True False False 65d cluster-autoscaler 4.1.9 True False False 65d console 4.1.9 True False False 16d dns 4.1.9 True False False 65d image-registry 4.1.9 True False False 2d4h ingress 4.1.9 True False False 65d kube-apiserver 4.1.9 True False False 65d kube-controller-manager 4.1.9 True False False 65d kube-scheduler 4.1.9 True False False 65d machine-api 4.1.9 True False False 65d machine-config 4.1.8 False True True 5d marketplace 4.1.9 True False False 79m monitoring 4.1.9 True False False 2d4h network 4.1.9 True False False 65d node-tuning 4.1.9 True False False 2d5h openshift-apiserver 4.1.9 True False False 58m openshift-controller-manager 4.1.9 True False False 65d openshift-samples 4.1.9 True False False 2d5h operator-lifecycle-manager 4.1.9 True False False 65d operator-lifecycle-manager-catalog 4.1.9 True False False 65d service-ca 4.1.9 True False False 65d service-catalog-apiserver 4.1.9 True False False 65d service-catalog-controller-manager 4.1.9 True False False 65d storage 4.1.9 True False False 2d5h # # oc describe machineconfigpools.machineconfiguration.openshift.io master Name: master [...] Paused: false Status: Conditions: Last Transition Time: 2019-06-06T16:15:54Z Message: Reason: Status: False Type: RenderDegraded Last Transition Time: 2019-08-06T08:51:22Z Message: Reason: Status: False Type: Updated Last Transition Time: 2019-08-08T06:58:27Z Message: Reason: Status: True Type: Degraded Last Transition Time: 2019-08-08T06:58:27Z Message: Node master0.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc9 r1.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1", Node m bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1" Reason: 3 nodes are reporting degraded status on sync Status: True Type: NodeDegraded Last Transition Time: 2019-08-06T08:51:22Z Message: Reason: All nodes are updating to rendered-master-011012ac07673f2d2decbf9da8292e33 Status: True Type: Updating [..] # # oc logs -f machine-config-daemon-nbshs .... I0811 09:41:14.662272 5286 daemon.go:695] Current+desired config: rendered-master-8bc942b7163ff02019d5f1c7b76103f1 I0811 09:41:14.662293 5286 update.go:836] using pending config same as desired I0811 09:41:14.670193 5286 daemon.go:435] Detected a login session before the daemon took over on first boot I0811 09:41:14.670213 5286 daemon.go:436] Applying annotation: machineconfiguration.openshift.io/ssh I0811 09:41:14.709376 5286 update.go:836] Validating against pending config rendered-master-8bc942b7163ff02019d5f1c7b76103f1 I0811 09:41:14.712159 5286 daemon.go:894] Validating against pending config rendered-master-8bc942b7163ff02019d5f1c7b76103f1 E0811 09:41:14.712257 5286 daemon.go:1113] parsing booted osImageURL: parsing reference: "<not pivoted>": invalid reference format E0811 09:41:14.712288 5286 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-master-8b I0811 09:41:14.717262 5286 update.go:723] logger doesn't support --jounald, grepping the journal I0811 09:41:29.896396 5286 daemon.go:695] Current+desired config: rendered-master-8bc942b7163ff02019d5f1c7b76103f1 I0811 09:41:29.896416 5286 update.go:836] using pending config same as desired I0811 09:41:29.905312 5286 daemon.go:435] Detected a login session before the daemon took over on first boot I0811 09:41:29.905330 5286 daemon.go:436] Applying annotation: machineconfiguration.openshift.io/ssh I0811 09:41:29.950745 5286 update.go:836] Validating against pending config rendered-master-8bc942b7163ff02019d5f1c7b76103f1 I0811 09:41:29.953548 5286 daemon.go:894] Validating against pending config rendered-master-8bc942b7163ff02019d5f1c7b76103f1 E0811 09:41:29.953624 5286 daemon.go:1113] parsing booted osImageURL: parsing reference: "<not pivoted>": invalid reference format E0811 09:41:29.953639 5286 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-master-8b I0811 09:41:29.958447 5286 update.go:723] logger doesn't support --jounald, grepping the journal I0811 09:41:45.769615 5286 daemon.go:695] Current+desired config: rendered-master-8bc942b7163ff02019d5f1c7b76103f1 I0811 09:41:45.771509 5286 update.go:836] using pending config same as desired I0811 09:41:45.781050 5286 daemon.go:435] Detected a login session before the daemon took over on first boot I0811 09:41:45.781416 5286 daemon.go:436] Applying annotation: machineconfiguration.openshift.io/ssh I0811 09:41:45.802674 5286 update.go:836] Validating against pending config rendered-master-8bc942b7163ff02019d5f1c7b76103f1 I0811 09:41:45.805374 5286 daemon.go:894] Validating against pending config rendered-master-8bc942b7163ff02019d5f1c7b76103f1 E0811 09:41:45.805437 5286 daemon.go:1113] parsing booted osImageURL: parsing reference: "<not pivoted>": invalid reference format E0811 09:41:45.805452 5286 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-master-8b # # for host in master{0..2}; do echo $host;ssh core@$host rpm-ostree status;done master0 State: idle AutomaticUpdates: disabled Deployments: * ostree://4761842e05f09b91e50d0f1174e53de83c97d5455c13606ed5d4e8024da6221c Version: 410.8.20190724.0 (2019-07-24T20:02:52Z) pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ec1c67e595fbee3b151422643c9b0ed4ec4ca52e4d77dfc45553e4a741cee797 CustomOrigin: Managed by pivot tool Version: 410.8.20190718.1 (2019-07-18T15:15:00Z) master1 State: idle AutomaticUpdates: disabled Deployments: * ostree://4761842e05f09b91e50d0f1174e53de83c97d5455c13606ed5d4e8024da6221c Version: 410.8.20190724.0 (2019-07-24T20:02:52Z) pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ec1c67e595fbee3b151422643c9b0ed4ec4ca52e4d77dfc45553e4a741cee797 CustomOrigin: Managed by pivot tool Version: 410.8.20190718.1 (2019-07-18T15:15:00Z) master2 State: idle AutomaticUpdates: disabled Deployments: * ostree://4761842e05f09b91e50d0f1174e53de83c97d5455c13606ed5d4e8024da6221c Version: 410.8.20190724.0 (2019-07-24T20:02:52Z) pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ec1c67e595fbee3b151422643c9b0ed4ec4ca52e4d77dfc45553e4a741cee797 CustomOrigin: Managed by pivot tool Version: 410.8.20190718.1 (2019-07-18T15:15:00Z) # - I already run rpm-ostree rollback on the masters, didn't help - killed actually all pods in openshift-machine-config-operator hoping it would trigger a reconsie - as the cluster is in that state now for > 4 days, I wonder how to proceed ... Is there a way to force progressing/reconsile ans the satate seems not to be updated: ~~~~ Last Transition Time: 2019-08-08T06:58:27Z Message: Node master0.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc9 r1.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1", Node m bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1" Reason: 3 nodes are reporting degraded status on sync Status: True Type: NodeDegraded Last Transition Time: 2019-08-06T08:51:22Z Message: Reason: All nodes are updating to rendered-master-011012ac07673f2d2decbf9da8292e33 Status: True Type: Updating ~~~~ In between there has been an issue with updating network as well, which I solved by killing the pods, it is still pending at 89% ...
in the mcd logs (mn27w-logs.out) im seeing: ``` I0811 08:16:27.204094 4279 daemon.go:200] Booted osImageURL: <not pivoted> (410.8.20190724.0) ... ... E0811 08:16:55.890117 4279 daemon.go:1113] parsing booted osImageURL: parsing reference: "<not pivoted>": invalid reference format E0811 08:16:55.890140 4279 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1 ``` is this a UPI installation by chance?
What platform is this? AWS? VMWare? Bare metal? Where did you get the "bootimages"?
Some other notes: in machine-config.yaml" ``` conditions: - lastTransitionTime: 2019-08-06T09:01:33Z message: Cluster not available for 4.1.9 status: "False" type: Available - lastTransitionTime: 2019-08-09T04:37:50Z message: Working towards 4.1.9 status: "True" type: Progressing - lastTransitionTime: 2019-08-06T09:01:33Z message: 'Unable to apply 4.1.9: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-8bc942b7163ff02019d5f1c7b76103f1 expected 83392b13a5c17e56656acf3f7b0031e3303fb5c0 has 365c1cfd14de5b0e3b85e0fc815b0060f36ab955, retrying' reason: FailedToSync status: "True" type: Degraded ``` and in version.yaml: ``` conditions: - lastTransitionTime: 2019-06-07T03:43:42Z message: Done applying 4.1.8 status: "True" type: Available - lastTransitionTime: 2019-08-11T09:22:15Z message: Cluster operator machine-config is still updating reason: ClusterOperatorNotAvailable status: "True" type: Failing - lastTransitionTime: 2019-08-10T10:36:05Z message: 'Unable to apply 4.1.9: the cluster operator machine-config has not yet successfully rolled out' reason: ClusterOperatorNotAvailable status: "True" type: Progressing - lastTransitionTime: 2019-08-07T11:32:40Z status: "True" ```
looking at the nodes in the first must gather there seem to be only 2 masters (and 2 workers)? If this is the case, etcd quorum guard will not support a rolling update for fewer than 3 masters.
I tested upgrading from 4.1.4 > 4.1.7 and then from 4.1.7 > 4.1.8 today on an AWS cluster with no issues.
(In reply to Kirsten Garrison from comment #12) > looking at the nodes in the first must gather there seem to be only 2 > masters (and 2 workers)? If this is the case, etcd quorum guard will not > support a rolling update for fewer than 3 masters. Yes the first one I only had 2 masters, but the second must gather log has 3
Yes this is a UPI install on bare metal. If by bootimages you mean the kernel/initrd/metal-bios they came from https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.1/latest/ per the docs at https://cloud.redhat.com/openshift/install/metal/user-provisioned
Your comment about the quorum guard made me check the current state of the cluster and the quorum guard for master3 is failing and it turns out that the etcd on master3 is not healthy complaining about tls bad certificate issues. This issue manifests itself with the upgrade because I can run a 4.1.4 cluster for two weeks without any etcd issues and when I do the upgrade this comes up. Attaching another must-gather logs including MCO and etcd.
Created attachment 1607021 [details] must-gather-etcd-tls-errors.tar
Both the peer and server certs for etcd-2 are expired: sh-4.4# openssl x509 -noout -text -in /etc/kubernetes/static-pod-resources/etcd-member/system\:etcd-peer\:ETCD-2.BEDROCK4.CCS.ORNL.GOV.crt Certificate: Data: Version: 3 (0x2) Serial Number: 58:aa:3b:30:98:25:ff:f6:e3:c3:1d:4b:6f:35:71:74:00:ef:45:c5 Signature Algorithm: sha256WithRSAEncryption Issuer: CN = kube-csr-signer_@1564678823 Validity Not Before: Aug 1 18:57:00 2019 GMT Not After : Aug 2 16:23:48 2019 GMT sh-4.4# openssl x509 -noout -text -in /etc/kubernetes/static-pod-resources/etcd-member/system\:etcd-server\:ETCD-2.BEDROCK4.CCS.ORNL.GOV.crt Certificate: Data: Version: 3 (0x2) Serial Number: 6f:d1:4a:76:11:fc:e9:b1:86:e4:d9:ed:44:f8:2e:ce:d5:25:62:e0 Signature Algorithm: sha256WithRSAEncryption Issuer: CN = kube-csr-signer_@1564678823 Validity Not Before: Aug 1 18:56:00 2019 GMT Not After : Aug 2 16:23:08 2019 GMT
`Issuer: CN = kube-csr-signer_@1564678823 ` This is not valid. etcd certs should be signed by etcd-cert-signer[1] not kube-csr-signer. etcd requires 2 separate chains of trust one for KVS (server and peer) and another for metrics. These certs are valid for 3 years. Issuer: OU = openshift, CN = etcd-signer Validity Not Before: Aug 22 12:53:00 2019 GMT Not After : Aug 21 12:53:00 2022 GMT [1] https://github.com/openshift/installer/blob/release-4.1/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L235
Closing as this is not a bug, but a certificate configuration error.
I am afraid, i have to repopen this as I get the same erroro as in bug title, but with valid certs: [root@bastion ~]# for host in master{0..2}; do echo $host;ssh core@$host openssl x509 -noout -text -in /etc/kubernetes/static-pod-resources/etcd-member/system\:etcd-peer\:etcd-*.test.coreos.local.crt |egrep -A2 'Valid|Issue';done master0 Issuer: OU = openshift, CN = etcd-signer Validity Not Before: Jun 6 15:59:00 2019 GMT Not After : Jun 5 15:59:00 2022 GMT master1 Issuer: OU = openshift, CN = etcd-signer Validity Not Before: Jun 6 15:59:00 2019 GMT Not After : Jun 5 15:59:00 2022 GMT master2 Issuer: OU = openshift, CN = etcd-signer Validity Not Before: Jun 6 15:59:00 2019 GMT Not After : Jun 5 15:59:00 2022 GMT [root@bastion ~]# still I do get [root@bastion ~]# oc describe machineconfigpool master|grep -A5 Degraded Type: RenderDegraded Last Transition Time: 2019-08-06T08:51:22Z Message: Reason: Status: False Type: Updated -- Type: Degraded Last Transition Time: 2019-08-08T06:58:27Z Message: Node master0.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1", Node master1.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1", Node master2.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1" Reason: 3 nodes are reporting degraded status on sync Status: True Type: NodeDegraded Last Transition Time: 2019-08-06T08:51:22Z Message: Reason: All nodes are updating to rendered-master-3444cb84703a72295e73662a1cd4d91f Status: True Type: Updating -- Degraded Machine Count: 3 Machine Count: 3 Observed Generation: 12 Ready Machine Count: 0 Unavailable Machine Count: 3 Updated Machine Count: 0 [root@bastion ~]# and as this state is the same now for at least 2 weeks, i do not recon it solves itself
(In reply to daniel from comment #22) > I am afraid, i have to repopen this as I get the same erroro as in bug > title, but with valid certs: > Can you open a new bug for your issue Daniel? and include must-gather from your cluster, that error may come from pretty different situations. Thanks a lot for understanding so we'll have a sane way to track issues around BZ :)