Bug 1741817

Summary:	After do certificate recovery the clusteroperator: machine-config will be degraded
Product:	OpenShift Container Platform	Reporter:	zhou ying <yinzhou>
Component:	Machine Config Operator	Assignee:	Kirsten Garrison <kgarriso>
Status:	CLOSED ERRATA	QA Contact:	Micah Abbott <miabbott>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.2.0	CC:	ahoffer, akaris, amurdaca, kgarriso, mfuruta, rphillips, walters
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-16 06:36:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description zhou ying 2019-08-16 07:37:36 UTC

Description of problem:
With 4.2 env after do the certificate recovery, the machine-config operator will always be degraded status.

Version-Release number of selected component (if applicable):
Payload: 4.2.0-0.nightly-2019-08-15-205330

How reproducible:
Always

Steps to Reproduce:
1. Follow the doc: https://bz-1740869--ocpdocs.netlify.com/openshift-enterprise/latest/disaster_recovery/scenario-3-expired-certs.html  to do certificate recovery for 4.2 env;
2. Check the recovey env status


Actual results:
2. The machine-config operator's status will always be degraded.
[root@dhcp-140-138 ~]# oc get co machine-config
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.2.0-0.nightly-2019-08-15-205330   False       False         True       3h37m

[root@dhcp-140-138 ~]# oc get po -n openshift-machine-config-operator
NAME                                         READY   STATUS    RESTARTS   AGE
etcd-quorum-guard-6fc65496bb-5qd82           1/1     Running   0          5h13m
etcd-quorum-guard-6fc65496bb-bgjmw           1/1     Running   1          5h25m
etcd-quorum-guard-6fc65496bb-tjdkc           0/1     Pending   0          4h32m
machine-config-controller-6967446484-jl2dj   1/1     Running   0          4h32m
machine-config-daemon-5c2bb                  1/1     Running   2          5h30m
machine-config-daemon-6jtct                  1/1     Running   1          5h30m
machine-config-daemon-gp6cb                  1/1     Running   1          5h30m
machine-config-daemon-kp57x                  1/1     Running   1          5h30m
machine-config-daemon-rk56n                  1/1     Running   1          5h30m
machine-config-operator-5977f5d69f-dp6nk     1/1     Running   0          4h32m
machine-config-server-fb7vf                  1/1     Running   2          5h30m
machine-config-server-r8dzx                  1/1     Running   1          5h30m
machine-config-server-xlbcc                  1/1     Running   1          5h30m
[root@dhcp-140-138 ~]# oc exec machine-config-daemon-kp57x -n openshift-machine-config-operator date
Fri Aug 16 07:33:02 UTC 2019
[root@dhcp-140-138 ~]# oc logs -f po/machine-config-daemon-kp57x -n openshift-machine-config-operator |grep E0816
E0816 07:31:48.409923    4496 daemon.go:1268] content mismatch for file /etc/kubernetes/kubelet-ca.crt: -----BEGIN CERTIFICATE-----
E0816 07:31:48.409958    4496 writer.go:127] Marking Degraded due to: unexpected on-disk state validating against rendered-master-9a9714cbe1d227749d88902f9456b912


[root@dhcp-140-138 ~]# oc get po -n openshift-machine-api
NAME                                           READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-5dd845686c-ssqtt   1/1     Running   1          168m
machine-api-controllers-7bf6574644-jqn96       3/3     Running   16         171m

Some error logs from the related container:
oc logs -f po/machine-api-controllers-7bf6574644-jqn96  -c  nodelink-controller  -n openshift-machine-api |grep E0816 
E0816 03:49:48.936328       1 reflector.go:270] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to watch *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?resourceVersion=315990&timeoutSeconds=471&watch=true: dial tcp 172.30.0.1:443: connect: connection refused
E0816 04:11:44.107284       1 reflector.go:270] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to watch *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?resourceVersion=456065&timeoutSeconds=478&watch=true: dial tcp 172.30.0.1:443: connect: connection refused
E0816 05:24:36.926309       1 reflector.go:270] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to watch *v1beta1.Machine: Get https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines?resourceVersion=467756&timeoutSeconds=540&watch=true: dial tcp 172.30.0.1:443: connect: connection refused


[root@dhcp-140-138 ~]# oc logs -f po/machine-api-controllers-7bf6574644-jqn96 -c controller-manager  -n openshift-machine-api  |grep E0816
E0816 03:49:48.838418       1 reflector.go:270] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to watch *v1beta1.MachineSet: Get https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets?resourceVersion=103744&timeoutSeconds=503&watch=true: dial tcp 172.30.0.1:443: connect: connection refused
E0816 05:24:36.997932       1 reflector.go:270] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to watch *v1beta1.MachineSet: Get https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets?resourceVersion=467753&timeoutSeconds=593&watch=true: dial tcp 172.30.0.1:443: connect: connection refused
E0816 05:44:20.239823       1 reflector.go:270] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to watch *v1beta1.Machine: Get https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines?resourceVersion=929824&timeoutSeconds=396&watch=true: dial tcp 172.30.0.1:443: connect: connection refused

Expected results:
2. Should not be degraded.

Additional info:

Comment 1 zhou ying 2019-08-16 07:42:16 UTC

[root@dhcp-140-138 ~]# oc describe clusteroperator/machine-config
Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-08-16T01:37:48Z
  Generation:          1
  Resource Version:    1225508
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-config
  UID:                 73e29081-bfc6-11e9-9c78-06fda8b33d36
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-08-16T02:19:17Z
    Message:               Cluster not available for 4.2.0-0.nightly-2019-08-15-205330
    Status:                False
    Type:                  Available
    Last Transition Time:  2019-08-16T01:38:51Z
    Message:               Cluster version is 4.2.0-0.nightly-2019-08-15-205330
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2019-08-16T02:19:17Z
    Message:               Failed to resync 4.2.0-0.nightly-2019-08-15-205330 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 1, updated: 1, unavailable: 2)
    Reason:                FailedToSync
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2019-08-16T01:38:50Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:
    Last Sync Error:  error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 1, updated: 1, unavailable: 2)
    Master:           pool is degraded because nodes fail with "2 nodes are reporting degraded status on sync": "Node ip-10-0-157-99.ap-southeast-2.compute.internal is reporting: \"unexpected on-disk state validating against rendered-master-9a9714cbe1d227749d88902f9456b912\", Node ip-10-0-135-169.ap-southeast-2.compute.internal is reporting: \"unexpected on-disk state validating against rendered-master-96bbedbc04dde9dbd3de9089016ed76f\""
    Worker:           all 2 nodes are at latest configuration rendered-worker-15e1a7e6f084ad5335ca16971d1e6b2b
  Related Objects:
    Group:     
    Name:      openshift-machine-config-operator
    Resource:  namespaces
    Group:     machineconfiguration.openshift.io
    Name:      master
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      worker
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      cluster
    Resource:  controllerconfigs
  Versions:
    Name:     operator
    Version:  4.2.0-0.nightly-2019-08-15-205330
Events:       <none>
[root@dhcp-140-138 ~]# oc get co machine-config -o yaml 
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-08-16T01:37:48Z"
  generation: 1
  name: machine-config
  resourceVersion: "1225508"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config
  uid: 73e29081-bfc6-11e9-9c78-06fda8b33d36
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-08-16T02:19:17Z"
    message: Cluster not available for 4.2.0-0.nightly-2019-08-15-205330
    status: "False"
    type: Available
  - lastTransitionTime: "2019-08-16T01:38:51Z"
    message: Cluster version is 4.2.0-0.nightly-2019-08-15-205330
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-08-16T02:19:17Z"
    message: 'Failed to resync 4.2.0-0.nightly-2019-08-15-205330 because: timed out
      waiting for the condition during syncRequiredMachineConfigPools: error pool
      master is not ready, retrying. Status: (pool degraded: true total: 3, ready
      1, updated: 1, unavailable: 2)'
    reason: FailedToSync
    status: "True"
    type: Degraded
  - lastTransitionTime: "2019-08-16T01:38:50Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  extension:
    lastSyncError: 'error pool master is not ready, retrying. Status: (pool degraded:
      true total: 3, ready 1, updated: 1, unavailable: 2)'
    master: 'pool is degraded because nodes fail with "2 nodes are reporting degraded
      status on sync": "Node ip-10-0-157-99.ap-southeast-2.compute.internal is reporting:
      \"unexpected on-disk state validating against rendered-master-9a9714cbe1d227749d88902f9456b912\",
      Node ip-10-0-135-169.ap-southeast-2.compute.internal is reporting: \"unexpected
      on-disk state validating against rendered-master-96bbedbc04dde9dbd3de9089016ed76f\""'
    worker: all 2 nodes are at latest configuration rendered-worker-15e1a7e6f084ad5335ca16971d1e6b2b
  relatedObjects:
  - group: ""
    name: openshift-machine-config-operator
    resource: namespaces
  - group: machineconfiguration.openshift.io
    name: master
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: worker
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: cluster
    resource: controllerconfigs
  versions:
  - name: operator
    version: 4.2.0-0.nightly-2019-08-15-205330

Comment 3 Erica von Buelow 2019-08-20 21:48:56 UTC

I wonder if this is an issue coming from https://github.com/openshift/machine-config-operator/pull/965. Looking into reproducing (those recovery steps are long!)

Comment 5 Kirsten Garrison 2019-08-23 19:53:36 UTC

Can you please provide must-gather from your cluster?

I am seeing an etcd-quorum-guard pod in a pending state in the logs you pasted above.

Comment 6 Kirsten Garrison 2019-08-23 21:17:37 UTC

So this bug is likely occurring because the DR instructions require the user to "Copy the /etc/kubernetes/kubelet-ca.crt file to all other master hosts and nodes."

As expected, we will hit an unexpected on disk state as the MCO does not expect for users to manually update nodes.

To fix this we will have to find a way to make the necessary update of the now expired certs and also update the ignition config so that the content mismatch does not occur.

Comment 7 Kirsten Garrison 2019-08-26 17:44:42 UTC

Some extra background: this DR scenario deals with when a cluster has been suspended for some indeterminate amount of time. As a result, the certs are expired and must be rotated so that kubelet can restart everything.

Comment 8 Colin Walters 2019-08-26 18:09:34 UTC

> So this bug is likely occurring because the DR instructions require the user to "Copy the /etc/kubernetes/kubelet-ca.crt file to all other master hosts and nodes."

Right.  And I guess what we're running into here is fallout from
https://github.com/openshift/machine-config-operator/pull/245

Specifically see https://github.com/openshift/machine-config-operator/issues/662#issuecomment-506472687

Reboot coordination is a prerequisite for automatic reconciliation.

I can't think of an easy workaround here because what we're looking for is the *old* certificate which won't work...

Comment 9 Colin Walters 2019-08-26 18:18:52 UTC

I guess without going full reboot coordination,
we could maybe add a `force: true` field to MachineConfigPool which would roll out as an annotation to the by the node controller only when it's trying to change config, and that would tell the MCD to skip validating.

Or...maybe this is the simplest hack:

```
diff --git a/pkg/daemon/daemon.go b/pkg/daemon/daemon.go
index be4af619..934e6993 100644
--- a/pkg/daemon/daemon.go
+++ b/pkg/daemon/daemon.go
@@ -892,8 +892,10 @@ func (dn *Daemon) checkStateOnFirstRun() error {
 		glog.Infof("Validating against current config %s", state.currentConfig.GetName())
 		expectedConfig = state.currentConfig
 	}
-	if !dn.validateOnDiskState(expectedConfig) {
-		return fmt.Errorf("unexpected on-disk state validating against %s", expectedConfig.GetName())
+	if _, err := os.Stat("/run/machine-config-daemon-force"); err != nil {
+		if !dn.validateOnDiskState(expectedConfig) {
+			return fmt.Errorf("unexpected on-disk state validating against %s", expectedConfig.GetName())
+		}
 	}
 	glog.Info("Validated on-disk state")
 
```

Comment 10 Kirsten Garrison 2019-08-26 19:18:24 UTC

Update: discussed with Colin and he's working on a fix (prob using the last solution above), PR forthcoming =)

Comment 11 Kirsten Garrison 2019-08-26 20:27:07 UTC

Not exactly sure, but our MCO PR alone can't close this PR, so we're also waiting to find out who will be updating the docs with an extra command as 9. e. in the DR directions linked above:

> $ touch /run/machine-config-daemon-force

Comment 12 Ryan Phillips 2019-08-26 20:29:25 UTC

I've talked to Andrea; I am coordinating a doc PR with her.

Comment 13 Ryan Phillips 2019-08-27 13:02:50 UTC

Doc PR: https://github.com/openshift/openshift-docs/pull/16399

Comment 15 zhou ying 2019-08-28 09:40:24 UTC

Confirmed with payload: 4.2.0-0.nightly-2019-08-28-004049 , and follow the Doc pr, the issue has fixed.

Comment 16 errata-xmlrpc 2019-10-16 06:36:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922