Description of problem: With 4.2 env after do the certificate recovery, the machine-config operator will always be degraded status. Version-Release number of selected component (if applicable): Payload: 4.2.0-0.nightly-2019-08-15-205330 How reproducible: Always Steps to Reproduce: 1. Follow the doc: https://bz-1740869--ocpdocs.netlify.com/openshift-enterprise/latest/disaster_recovery/scenario-3-expired-certs.html to do certificate recovery for 4.2 env; 2. Check the recovey env status Actual results: 2. The machine-config operator's status will always be degraded. [root@dhcp-140-138 ~]# oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-config 4.2.0-0.nightly-2019-08-15-205330 False False True 3h37m [root@dhcp-140-138 ~]# oc get po -n openshift-machine-config-operator NAME READY STATUS RESTARTS AGE etcd-quorum-guard-6fc65496bb-5qd82 1/1 Running 0 5h13m etcd-quorum-guard-6fc65496bb-bgjmw 1/1 Running 1 5h25m etcd-quorum-guard-6fc65496bb-tjdkc 0/1 Pending 0 4h32m machine-config-controller-6967446484-jl2dj 1/1 Running 0 4h32m machine-config-daemon-5c2bb 1/1 Running 2 5h30m machine-config-daemon-6jtct 1/1 Running 1 5h30m machine-config-daemon-gp6cb 1/1 Running 1 5h30m machine-config-daemon-kp57x 1/1 Running 1 5h30m machine-config-daemon-rk56n 1/1 Running 1 5h30m machine-config-operator-5977f5d69f-dp6nk 1/1 Running 0 4h32m machine-config-server-fb7vf 1/1 Running 2 5h30m machine-config-server-r8dzx 1/1 Running 1 5h30m machine-config-server-xlbcc 1/1 Running 1 5h30m [root@dhcp-140-138 ~]# oc exec machine-config-daemon-kp57x -n openshift-machine-config-operator date Fri Aug 16 07:33:02 UTC 2019 [root@dhcp-140-138 ~]# oc logs -f po/machine-config-daemon-kp57x -n openshift-machine-config-operator |grep E0816 E0816 07:31:48.409923 4496 daemon.go:1268] content mismatch for file /etc/kubernetes/kubelet-ca.crt: -----BEGIN CERTIFICATE----- E0816 07:31:48.409958 4496 writer.go:127] Marking Degraded due to: unexpected on-disk state validating against rendered-master-9a9714cbe1d227749d88902f9456b912 [root@dhcp-140-138 ~]# oc get po -n openshift-machine-api NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-5dd845686c-ssqtt 1/1 Running 1 168m machine-api-controllers-7bf6574644-jqn96 3/3 Running 16 171m Some error logs from the related container: oc logs -f po/machine-api-controllers-7bf6574644-jqn96 -c nodelink-controller -n openshift-machine-api |grep E0816 E0816 03:49:48.936328 1 reflector.go:270] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to watch *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?resourceVersion=315990&timeoutSeconds=471&watch=true: dial tcp 172.30.0.1:443: connect: connection refused E0816 04:11:44.107284 1 reflector.go:270] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to watch *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?resourceVersion=456065&timeoutSeconds=478&watch=true: dial tcp 172.30.0.1:443: connect: connection refused E0816 05:24:36.926309 1 reflector.go:270] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to watch *v1beta1.Machine: Get https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines?resourceVersion=467756&timeoutSeconds=540&watch=true: dial tcp 172.30.0.1:443: connect: connection refused [root@dhcp-140-138 ~]# oc logs -f po/machine-api-controllers-7bf6574644-jqn96 -c controller-manager -n openshift-machine-api |grep E0816 E0816 03:49:48.838418 1 reflector.go:270] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to watch *v1beta1.MachineSet: Get https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets?resourceVersion=103744&timeoutSeconds=503&watch=true: dial tcp 172.30.0.1:443: connect: connection refused E0816 05:24:36.997932 1 reflector.go:270] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to watch *v1beta1.MachineSet: Get https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets?resourceVersion=467753&timeoutSeconds=593&watch=true: dial tcp 172.30.0.1:443: connect: connection refused E0816 05:44:20.239823 1 reflector.go:270] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to watch *v1beta1.Machine: Get https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines?resourceVersion=929824&timeoutSeconds=396&watch=true: dial tcp 172.30.0.1:443: connect: connection refused Expected results: 2. Should not be degraded. Additional info:
[root@dhcp-140-138 ~]# oc describe clusteroperator/machine-config Name: machine-config Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2019-08-16T01:37:48Z Generation: 1 Resource Version: 1225508 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: 73e29081-bfc6-11e9-9c78-06fda8b33d36 Spec: Status: Conditions: Last Transition Time: 2019-08-16T02:19:17Z Message: Cluster not available for 4.2.0-0.nightly-2019-08-15-205330 Status: False Type: Available Last Transition Time: 2019-08-16T01:38:51Z Message: Cluster version is 4.2.0-0.nightly-2019-08-15-205330 Status: False Type: Progressing Last Transition Time: 2019-08-16T02:19:17Z Message: Failed to resync 4.2.0-0.nightly-2019-08-15-205330 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 1, updated: 1, unavailable: 2) Reason: FailedToSync Status: True Type: Degraded Last Transition Time: 2019-08-16T01:38:50Z Reason: AsExpected Status: True Type: Upgradeable Extension: Last Sync Error: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 1, updated: 1, unavailable: 2) Master: pool is degraded because nodes fail with "2 nodes are reporting degraded status on sync": "Node ip-10-0-157-99.ap-southeast-2.compute.internal is reporting: \"unexpected on-disk state validating against rendered-master-9a9714cbe1d227749d88902f9456b912\", Node ip-10-0-135-169.ap-southeast-2.compute.internal is reporting: \"unexpected on-disk state validating against rendered-master-96bbedbc04dde9dbd3de9089016ed76f\"" Worker: all 2 nodes are at latest configuration rendered-worker-15e1a7e6f084ad5335ca16971d1e6b2b Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: master Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: worker Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: cluster Resource: controllerconfigs Versions: Name: operator Version: 4.2.0-0.nightly-2019-08-15-205330 Events: <none> [root@dhcp-140-138 ~]# oc get co machine-config -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2019-08-16T01:37:48Z" generation: 1 name: machine-config resourceVersion: "1225508" selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config uid: 73e29081-bfc6-11e9-9c78-06fda8b33d36 spec: {} status: conditions: - lastTransitionTime: "2019-08-16T02:19:17Z" message: Cluster not available for 4.2.0-0.nightly-2019-08-15-205330 status: "False" type: Available - lastTransitionTime: "2019-08-16T01:38:51Z" message: Cluster version is 4.2.0-0.nightly-2019-08-15-205330 status: "False" type: Progressing - lastTransitionTime: "2019-08-16T02:19:17Z" message: 'Failed to resync 4.2.0-0.nightly-2019-08-15-205330 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 1, updated: 1, unavailable: 2)' reason: FailedToSync status: "True" type: Degraded - lastTransitionTime: "2019-08-16T01:38:50Z" reason: AsExpected status: "True" type: Upgradeable extension: lastSyncError: 'error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 1, updated: 1, unavailable: 2)' master: 'pool is degraded because nodes fail with "2 nodes are reporting degraded status on sync": "Node ip-10-0-157-99.ap-southeast-2.compute.internal is reporting: \"unexpected on-disk state validating against rendered-master-9a9714cbe1d227749d88902f9456b912\", Node ip-10-0-135-169.ap-southeast-2.compute.internal is reporting: \"unexpected on-disk state validating against rendered-master-96bbedbc04dde9dbd3de9089016ed76f\""' worker: all 2 nodes are at latest configuration rendered-worker-15e1a7e6f084ad5335ca16971d1e6b2b relatedObjects: - group: "" name: openshift-machine-config-operator resource: namespaces - group: machineconfiguration.openshift.io name: master resource: machineconfigpools - group: machineconfiguration.openshift.io name: worker resource: machineconfigpools - group: machineconfiguration.openshift.io name: cluster resource: controllerconfigs versions: - name: operator version: 4.2.0-0.nightly-2019-08-15-205330
I wonder if this is an issue coming from https://github.com/openshift/machine-config-operator/pull/965. Looking into reproducing (those recovery steps are long!)
Can you please provide must-gather from your cluster? I am seeing an etcd-quorum-guard pod in a pending state in the logs you pasted above.
So this bug is likely occurring because the DR instructions require the user to "Copy the /etc/kubernetes/kubelet-ca.crt file to all other master hosts and nodes." As expected, we will hit an unexpected on disk state as the MCO does not expect for users to manually update nodes. To fix this we will have to find a way to make the necessary update of the now expired certs and also update the ignition config so that the content mismatch does not occur.
Some extra background: this DR scenario deals with when a cluster has been suspended for some indeterminate amount of time. As a result, the certs are expired and must be rotated so that kubelet can restart everything.
> So this bug is likely occurring because the DR instructions require the user to "Copy the /etc/kubernetes/kubelet-ca.crt file to all other master hosts and nodes." Right. And I guess what we're running into here is fallout from https://github.com/openshift/machine-config-operator/pull/245 Specifically see https://github.com/openshift/machine-config-operator/issues/662#issuecomment-506472687 Reboot coordination is a prerequisite for automatic reconciliation. I can't think of an easy workaround here because what we're looking for is the *old* certificate which won't work...
I guess without going full reboot coordination, we could maybe add a `force: true` field to MachineConfigPool which would roll out as an annotation to the by the node controller only when it's trying to change config, and that would tell the MCD to skip validating. Or...maybe this is the simplest hack: ``` diff --git a/pkg/daemon/daemon.go b/pkg/daemon/daemon.go index be4af619..934e6993 100644 --- a/pkg/daemon/daemon.go +++ b/pkg/daemon/daemon.go @@ -892,8 +892,10 @@ func (dn *Daemon) checkStateOnFirstRun() error { glog.Infof("Validating against current config %s", state.currentConfig.GetName()) expectedConfig = state.currentConfig } - if !dn.validateOnDiskState(expectedConfig) { - return fmt.Errorf("unexpected on-disk state validating against %s", expectedConfig.GetName()) + if _, err := os.Stat("/run/machine-config-daemon-force"); err != nil { + if !dn.validateOnDiskState(expectedConfig) { + return fmt.Errorf("unexpected on-disk state validating against %s", expectedConfig.GetName()) + } } glog.Info("Validated on-disk state") ```
Update: discussed with Colin and he's working on a fix (prob using the last solution above), PR forthcoming =)
Not exactly sure, but our MCO PR alone can't close this PR, so we're also waiting to find out who will be updating the docs with an extra command as 9. e. in the DR directions linked above: > $ touch /run/machine-config-daemon-force
I've talked to Andrea; I am coordinating a doc PR with her.
Doc PR: https://github.com/openshift/openshift-docs/pull/16399
Confirmed with payload: 4.2.0-0.nightly-2019-08-28-004049 , and follow the Doc pr, the issue has fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922