Bug 1734531 - master pool stuck in degraded "during bootstrap: unexpected on-disk state validating against rendered-master-..."
Summary: master pool stuck in degraded "during bootstrap: unexpected on-disk state val...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.1.z
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Kirsten Garrison
QA Contact: Micah Abbott
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-30 19:21 UTC by Jason Kincl
Modified: 2019-09-16 13:27 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-23 17:12:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
must-gather logs (392.30 KB, application/gzip)
2019-07-30 19:21 UTC, Jason Kincl
no flags Details
must-gather logs for fresh cluster install with error (201.58 KB, application/gzip)
2019-08-02 13:36 UTC, Jason Kincl
no flags Details
must-gather-etcd-tls-errors.tar (17.71 MB, application/x-tar)
2019-08-22 14:44 UTC, Jason Kincl
no flags Details

Description Jason Kincl 2019-07-30 19:21:56 UTC
Created attachment 1594812 [details]
must-gather logs

Description of problem:

While upgrading from 4.1.4 to 4.1.7 the master nodes were marked Degraded. The nodes were marked SchedulingDisabled during the upgrade and not changed.

There have been a few different scenarios that I have been debugging as I have dug into this issue.

Version-Release number of selected component (if applicable):

4.1.4 and 4.1.7

How reproducible:

I encountered this by upgrading from 4.1.4 to 4.1.7

Actual results:

MCO is hung and cannot progress with the 4.1.7 upgrade

Expected results:

MCO finishes running it's upgrade steps and the upgrade is finished

Additional info:

I don't really know what caused this and I have done a lot of debugging and mucking around with things trying to unstick the process and figure out how MCO works. There doesn't appear to be a good way to get a node out of Degraded status. I would like to know the steps required to reset MCO into a good state.

Comment 1 Colin Walters 2019-07-30 19:36:56 UTC
One of the MCDs is stuck in drain:

2019-07-30T13:17:04.895320414Z I0730 13:17:04.895296    4213 update.go:836] Update prepared; beginning drain
2019-07-30T15:27:27.941121869Z I0730 15:27:27.941073    4213 update.go:134] Draining failed with: the server was unable to return a response in the time allotted, but may still be processing the request (get pods), retrying
2019-07-30T15:28:07.942531937Z I0730 15:28:07.942495    4213 update.go:134] Draining failed with: Timeout: request did not complete within requested timeout 30s, retrying

Comment 2 Antonio Murdaca 2019-07-30 19:38:45 UTC
We'll need to backport the kubernetes-drain fixes anyway and logs will help us figure out why it's stuck there

Comment 3 Jason Kincl 2019-08-02 13:36:42 UTC
Created attachment 1600022 [details]
must-gather logs for fresh cluster install with error

I did a fresh install of a cluster at 4.1.4 and then upgraded to 4.1.7 and then to 4.1.8 and saw the same errors. Attached are the must-gather logs from right after it happened and before I took any steps on my own to correct the issue (none of which worked)

Comment 4 Jason Kincl 2019-08-07 19:06:07 UTC
Did an upgrade of the 4.1.8 cluster to 4.1.9 and am still seeing the error while trying to progress to the newest version.

From the machine-config ClusterOperator

Unable to apply 4.1.9: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-77fedea5c08df3344292ab91a9ce1298 expected 83392b13a5c17e56656acf3f7b0031e3303fb5c0 has 02c07496ba0417b3e12b78fb32baf6293d314f79, retrying

Comment 5 Antonio Murdaca 2019-08-08 07:29:00 UTC
(In reply to Jason Kincl from comment #4)
> Did an upgrade of the 4.1.8 cluster to 4.1.9 and am still seeing the error
> while trying to progress to the newest version.
> 
> From the machine-config ClusterOperator
> 
> Unable to apply 4.1.9: timed out waiting for the condition during
> syncRequiredMachineConfigPools: pool master has not progressed to latest
> configuration: controller version mismatch for
> rendered-master-77fedea5c08df3344292ab91a9ce1298 expected
> 83392b13a5c17e56656acf3f7b0031e3303fb5c0 has
> 02c07496ba0417b3e12b78fb32baf6293d314f79, retrying

Does this reconcile eventually? That message is saying that the MCC hasn't generated the newest rendered machineconfigs for the new version which is ok as the MCC may have not run yet.

Comment 6 Jason Kincl 2019-08-09 12:01:19 UTC
(In reply to Antonio Murdaca from comment #5)
> Does this reconcile eventually? That message is saying that the MCC hasn't
> generated the newest rendered machineconfigs for the new version which is ok
> as the MCC may have not run yet.

No it has been trying to reconcile since I put in the bug and it has not completed.

Comment 7 daniel 2019-08-11 09:52:54 UTC
Looks like my cluster is stuck in upgrading on the masters
for backround, the customer was upgraded since 4.1.1 though all versions in between.
Once I run into https://bugzilla.redhat.com/show_bug.cgi?id=1723327 which was solved by rolling back ostree on the affected nodes,
but that does not do th trick this time


# oc get clusterversions.config.openshift.io 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.8     True        True          23h     Working towards 4.1.9: 89% complete
# oc get clusteroperators.config.openshift.io 
NAME                                 VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                       4.1.9     True        False         False      65d
cloud-credential                     4.1.9     True        False         False      65d
cluster-autoscaler                   4.1.9     True        False         False      65d
console                              4.1.9     True        False         False      16d
dns                                  4.1.9     True        False         False      65d
image-registry                       4.1.9     True        False         False      2d4h
ingress                              4.1.9     True        False         False      65d
kube-apiserver                       4.1.9     True        False         False      65d
kube-controller-manager              4.1.9     True        False         False      65d
kube-scheduler                       4.1.9     True        False         False      65d
machine-api                          4.1.9     True        False         False      65d
machine-config                       4.1.8     False       True          True       5d
marketplace                          4.1.9     True        False         False      79m
monitoring                           4.1.9     True        False         False      2d4h
network                              4.1.9     True        False         False      65d
node-tuning                          4.1.9     True        False         False      2d5h
openshift-apiserver                  4.1.9     True        False         False      58m
openshift-controller-manager         4.1.9     True        False         False      65d
openshift-samples                    4.1.9     True        False         False      2d5h
operator-lifecycle-manager           4.1.9     True        False         False      65d
operator-lifecycle-manager-catalog   4.1.9     True        False         False      65d
service-ca                           4.1.9     True        False         False      65d
service-catalog-apiserver            4.1.9     True        False         False      65d
service-catalog-controller-manager   4.1.9     True        False         False      65d
storage                              4.1.9     True        False         False      2d5h
# 
# oc describe machineconfigpools.machineconfiguration.openshift.io master
Name:         master
[...]
  Paused:                              false
Status:
  Conditions:
    Last Transition Time:  2019-06-06T16:15:54Z
    Message:
    Reason:
    Status:                False
    Type:                  RenderDegraded
    Last Transition Time:  2019-08-06T08:51:22Z
    Message:
    Reason:
    Status:                False
    Type:                  Updated
    Last Transition Time:  2019-08-08T06:58:27Z
    Message:
    Reason:
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2019-08-08T06:58:27Z
    Message:               Node master0.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc9
r1.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1", Node m
 bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1"
    Reason:                3 nodes are reporting degraded status on sync
    Status:                True
    Type:                  NodeDegraded
    Last Transition Time:  2019-08-06T08:51:22Z
    Message:
    Reason:                All nodes are updating to rendered-master-011012ac07673f2d2decbf9da8292e33
    Status:                True
    Type:                  Updating
[..]
#
# oc logs -f machine-config-daemon-nbshs
....
I0811 09:41:14.662272    5286 daemon.go:695] Current+desired config: rendered-master-8bc942b7163ff02019d5f1c7b76103f1
I0811 09:41:14.662293    5286 update.go:836] using pending config same as desired
I0811 09:41:14.670193    5286 daemon.go:435] Detected a login session before the daemon took over on first boot
I0811 09:41:14.670213    5286 daemon.go:436] Applying annotation: machineconfiguration.openshift.io/ssh
I0811 09:41:14.709376    5286 update.go:836] Validating against pending config rendered-master-8bc942b7163ff02019d5f1c7b76103f1
I0811 09:41:14.712159    5286 daemon.go:894] Validating against pending config rendered-master-8bc942b7163ff02019d5f1c7b76103f1
E0811 09:41:14.712257    5286 daemon.go:1113] parsing booted osImageURL: parsing reference: "<not pivoted>": invalid reference format
E0811 09:41:14.712288    5286 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-master-8b
I0811 09:41:14.717262    5286 update.go:723] logger doesn't support --jounald, grepping the journal
I0811 09:41:29.896396    5286 daemon.go:695] Current+desired config: rendered-master-8bc942b7163ff02019d5f1c7b76103f1
I0811 09:41:29.896416    5286 update.go:836] using pending config same as desired
I0811 09:41:29.905312    5286 daemon.go:435] Detected a login session before the daemon took over on first boot
I0811 09:41:29.905330    5286 daemon.go:436] Applying annotation: machineconfiguration.openshift.io/ssh
I0811 09:41:29.950745    5286 update.go:836] Validating against pending config rendered-master-8bc942b7163ff02019d5f1c7b76103f1
I0811 09:41:29.953548    5286 daemon.go:894] Validating against pending config rendered-master-8bc942b7163ff02019d5f1c7b76103f1
E0811 09:41:29.953624    5286 daemon.go:1113] parsing booted osImageURL: parsing reference: "<not pivoted>": invalid reference format
E0811 09:41:29.953639    5286 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-master-8b
I0811 09:41:29.958447    5286 update.go:723] logger doesn't support --jounald, grepping the journal
I0811 09:41:45.769615    5286 daemon.go:695] Current+desired config: rendered-master-8bc942b7163ff02019d5f1c7b76103f1
I0811 09:41:45.771509    5286 update.go:836] using pending config same as desired
I0811 09:41:45.781050    5286 daemon.go:435] Detected a login session before the daemon took over on first boot
I0811 09:41:45.781416    5286 daemon.go:436] Applying annotation: machineconfiguration.openshift.io/ssh
I0811 09:41:45.802674    5286 update.go:836] Validating against pending config rendered-master-8bc942b7163ff02019d5f1c7b76103f1
I0811 09:41:45.805374    5286 daemon.go:894] Validating against pending config rendered-master-8bc942b7163ff02019d5f1c7b76103f1
E0811 09:41:45.805437    5286 daemon.go:1113] parsing booted osImageURL: parsing reference: "<not pivoted>": invalid reference format
E0811 09:41:45.805452    5286 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-master-8b
#

# for host in master{0..2}; do echo $host;ssh core@$host rpm-ostree status;done
master0
State: idle
AutomaticUpdates: disabled
Deployments:
* ostree://4761842e05f09b91e50d0f1174e53de83c97d5455c13606ed5d4e8024da6221c
                   Version: 410.8.20190724.0 (2019-07-24T20:02:52Z)

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ec1c67e595fbee3b151422643c9b0ed4ec4ca52e4d77dfc45553e4a741cee797
              CustomOrigin: Managed by pivot tool
                   Version: 410.8.20190718.1 (2019-07-18T15:15:00Z)
master1
State: idle
AutomaticUpdates: disabled
Deployments:
* ostree://4761842e05f09b91e50d0f1174e53de83c97d5455c13606ed5d4e8024da6221c
                   Version: 410.8.20190724.0 (2019-07-24T20:02:52Z)

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ec1c67e595fbee3b151422643c9b0ed4ec4ca52e4d77dfc45553e4a741cee797
              CustomOrigin: Managed by pivot tool
                   Version: 410.8.20190718.1 (2019-07-18T15:15:00Z)
master2
State: idle
AutomaticUpdates: disabled
Deployments:
* ostree://4761842e05f09b91e50d0f1174e53de83c97d5455c13606ed5d4e8024da6221c
                   Version: 410.8.20190724.0 (2019-07-24T20:02:52Z)

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ec1c67e595fbee3b151422643c9b0ed4ec4ca52e4d77dfc45553e4a741cee797
              CustomOrigin: Managed by pivot tool
                   Version: 410.8.20190718.1 (2019-07-18T15:15:00Z)

#

- I already run rpm-ostree rollback on the masters, didn't help
- killed actually all pods in openshift-machine-config-operator hoping it would trigger a reconsie
- as the cluster is in that state now for > 4 days, I wonder how to proceed ... Is there a way to force progressing/reconsile ans the satate seems not to be updated:
~~~~
    Last Transition Time:  2019-08-08T06:58:27Z
    Message:               Node master0.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc9
r1.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1", Node m
 bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1"
    Reason:                3 nodes are reporting degraded status on sync
    Status:                True
    Type:                  NodeDegraded
    Last Transition Time:  2019-08-06T08:51:22Z
    Message:
    Reason:                All nodes are updating to rendered-master-011012ac07673f2d2decbf9da8292e33
    Status:                True
    Type:                  Updating
~~~~

In between there has been an issue with updating network as well, which I solved by killing the pods, it is still pending at 89% ...

Comment 9 Kirsten Garrison 2019-08-19 19:38:31 UTC
in the mcd logs (mn27w-logs.out) im seeing:
```
I0811 08:16:27.204094    4279 daemon.go:200] Booted osImageURL: <not pivoted> (410.8.20190724.0)
...
...
E0811 08:16:55.890117    4279 daemon.go:1113] parsing booted osImageURL: parsing reference: "<not pivoted>": invalid reference format
E0811 08:16:55.890140    4279 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1
```

is this a UPI installation by chance?

Comment 10 Colin Walters 2019-08-19 19:53:30 UTC
What platform is this?  AWS?  VMWare?  Bare metal?  Where did you get the "bootimages"?

Comment 11 Kirsten Garrison 2019-08-19 19:59:26 UTC
Some other notes:

in machine-config.yaml"
```
  conditions:
  - lastTransitionTime: 2019-08-06T09:01:33Z
    message: Cluster not available for 4.1.9
    status: "False"
    type: Available
  - lastTransitionTime: 2019-08-09T04:37:50Z
    message: Working towards 4.1.9
    status: "True"
    type: Progressing
  - lastTransitionTime: 2019-08-06T09:01:33Z
    message: 'Unable to apply 4.1.9: timed out waiting for the condition during syncRequiredMachineConfigPools:
      pool master has not progressed to latest configuration: controller version mismatch
      for rendered-master-8bc942b7163ff02019d5f1c7b76103f1 expected 83392b13a5c17e56656acf3f7b0031e3303fb5c0
      has 365c1cfd14de5b0e3b85e0fc815b0060f36ab955, retrying'
    reason: FailedToSync
    status: "True"
    type: Degraded
```

and in version.yaml:
```
conditions:
  - lastTransitionTime: 2019-06-07T03:43:42Z
    message: Done applying 4.1.8
    status: "True"
    type: Available
  - lastTransitionTime: 2019-08-11T09:22:15Z
    message: Cluster operator machine-config is still updating
    reason: ClusterOperatorNotAvailable
    status: "True"
    type: Failing
  - lastTransitionTime: 2019-08-10T10:36:05Z
    message: 'Unable to apply 4.1.9: the cluster operator machine-config has not yet
      successfully rolled out'
    reason: ClusterOperatorNotAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: 2019-08-07T11:32:40Z
    status: "True"
```

Comment 12 Kirsten Garrison 2019-08-21 19:19:53 UTC
looking at the nodes in the first must gather there seem to be only 2 masters (and 2 workers)?  If this is the case, etcd quorum guard will not support a rolling update for fewer than 3 masters.

Comment 13 Stephen Greene 2019-08-21 20:46:17 UTC
I tested upgrading from 4.1.4 > 4.1.7 and then from 4.1.7 > 4.1.8 today on an AWS cluster with no issues.

Comment 14 Jason Kincl 2019-08-22 14:09:21 UTC
(In reply to Kirsten Garrison from comment #12)
> looking at the nodes in the first must gather there seem to be only 2
> masters (and 2 workers)?  If this is the case, etcd quorum guard will not
> support a rolling update for fewer than 3 masters.

Yes the first one I only had 2 masters, but the second must gather log has 3

Comment 15 Jason Kincl 2019-08-22 14:11:47 UTC
Yes this is a UPI install on bare metal.

If by bootimages you mean the kernel/initrd/metal-bios they came from https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.1/latest/ per the docs at https://cloud.redhat.com/openshift/install/metal/user-provisioned

Comment 16 Jason Kincl 2019-08-22 14:33:14 UTC
Your comment about the quorum guard made me check the current state of the cluster and the quorum guard for master3 is failing and it turns out that the etcd on master3 is not healthy complaining about tls bad certificate issues. This issue manifests itself with the upgrade because I can run a 4.1.4 cluster for two weeks without any etcd issues and when I do the upgrade this comes up. Attaching another must-gather logs including MCO and etcd.

Comment 17 Jason Kincl 2019-08-22 14:44:03 UTC
Created attachment 1607021 [details]
must-gather-etcd-tls-errors.tar

Comment 18 Jason Kincl 2019-08-22 15:35:24 UTC
Both the peer and server certs for etcd-2 are expired:
sh-4.4# openssl x509 -noout -text -in /etc/kubernetes/static-pod-resources/etcd-member/system\:etcd-peer\:ETCD-2.BEDROCK4.CCS.ORNL.GOV.crt                                                                                                                                
Certificate:                                                                                                                                                                                                                                                              
    Data:                                                                                                                                                                                                                                                                 
        Version: 3 (0x2)                                                                                                                                                                                                                                                  
        Serial Number:                                                                                                                                                                                                                                                    
            58:aa:3b:30:98:25:ff:f6:e3:c3:1d:4b:6f:35:71:74:00:ef:45:c5                                                                                                                                                                                                   
        Signature Algorithm: sha256WithRSAEncryption                                                                                                                                                                                                                      
        Issuer: CN = kube-csr-signer_@1564678823                                                                                                                                                                                                                          
        Validity                                                                                                                                                                                                                                                          
            Not Before: Aug  1 18:57:00 2019 GMT
            Not After : Aug  2 16:23:48 2019 GMT

sh-4.4# openssl x509 -noout -text -in /etc/kubernetes/static-pod-resources/etcd-member/system\:etcd-server\:ETCD-2.BEDROCK4.CCS.ORNL.GOV.crt
Certificate:                                                     
    Data:                                                        
        Version: 3 (0x2) 
        Serial Number:                   
            6f:d1:4a:76:11:fc:e9:b1:86:e4:d9:ed:44:f8:2e:ce:d5:25:62:e0
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN = kube-csr-signer_@1564678823   
        Validity                              
            Not Before: Aug  1 18:56:00 2019 GMT
            Not After : Aug  2 16:23:08 2019 GMT

Comment 19 Sam Batschelet 2019-08-22 17:01:30 UTC
`Issuer: CN = kube-csr-signer_@1564678823 `

This is not valid. etcd certs should be signed by etcd-cert-signer[1] not kube-csr-signer. etcd requires 2 separate chains of trust one for KVS (server and peer) and another for metrics. These certs are valid for 3 years.

Issuer: OU = openshift, CN = etcd-signer
  Validity
      Not Before: Aug 22 12:53:00 2019 GMT
      Not After : Aug 21 12:53:00 2022 GMT

[1] https://github.com/openshift/installer/blob/release-4.1/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L235

Comment 20 Kirsten Garrison 2019-08-22 17:38:35 UTC
Closing as this is not a bug, but a certificate configuration error.

Comment 22 daniel 2019-08-23 05:15:43 UTC
I am afraid, i have to repopen this as I get the same erroro as in bug title, but with valid certs:

[root@bastion ~]# for host in master{0..2}; do echo $host;ssh core@$host openssl x509 -noout -text -in /etc/kubernetes/static-pod-resources/etcd-member/system\:etcd-peer\:etcd-*.test.coreos.local.crt |egrep -A2 'Valid|Issue';done
master0
        Issuer: OU = openshift, CN = etcd-signer
        Validity
            Not Before: Jun  6 15:59:00 2019 GMT
            Not After : Jun  5 15:59:00 2022 GMT
master1
        Issuer: OU = openshift, CN = etcd-signer
        Validity
            Not Before: Jun  6 15:59:00 2019 GMT
            Not After : Jun  5 15:59:00 2022 GMT
master2
        Issuer: OU = openshift, CN = etcd-signer
        Validity
            Not Before: Jun  6 15:59:00 2019 GMT
            Not After : Jun  5 15:59:00 2022 GMT
[root@bastion ~]# 


still I do get 

[root@bastion ~]# oc describe machineconfigpool master|grep -A5 Degraded
    Type:                  RenderDegraded
    Last Transition Time:  2019-08-06T08:51:22Z
    Message:               
    Reason:                
    Status:                False
    Type:                  Updated
--
    Type:                  Degraded
    Last Transition Time:  2019-08-08T06:58:27Z
    Message:               Node master0.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1", Node master1.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1", Node master2.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1"
    Reason:                3 nodes are reporting degraded status on sync
    Status:                True
    Type:                  NodeDegraded
    Last Transition Time:  2019-08-06T08:51:22Z
    Message:               
    Reason:                All nodes are updating to rendered-master-3444cb84703a72295e73662a1cd4d91f
    Status:                True
    Type:                  Updating
--
  Degraded Machine Count:     3
  Machine Count:              3
  Observed Generation:        12
  Ready Machine Count:        0
  Unavailable Machine Count:  3
  Updated Machine Count:      0
[root@bastion ~]# 


and as this state is the same now for at least 2 weeks, i do not recon it solves itself

Comment 23 Antonio Murdaca 2019-08-23 17:12:05 UTC
(In reply to daniel from comment #22)
> I am afraid, i have to repopen this as I get the same erroro as in bug
> title, but with valid certs:
> 

Can you open a new bug for your issue Daniel? and include must-gather from your cluster, that error may come from pretty different situations. Thanks a lot for understanding so we'll have a sane way to track issues around BZ :)


Note You need to log in before you can comment on or make changes to this bug.