Bug 1741817
| Summary: | After do certificate recovery the clusteroperator: machine-config will be degraded | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | zhou ying <yinzhou> |
| Component: | Machine Config Operator | Assignee: | Kirsten Garrison <kgarriso> |
| Status: | CLOSED ERRATA | QA Contact: | Micah Abbott <miabbott> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.2.0 | CC: | ahoffer, akaris, amurdaca, kgarriso, mfuruta, rphillips, walters |
| Target Milestone: | --- | ||
| Target Release: | 4.2.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-10-16 06:36:19 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
zhou ying
2019-08-16 07:37:36 UTC
[root@dhcp-140-138 ~]# oc describe clusteroperator/machine-config
Name: machine-config
Namespace:
Labels: <none>
Annotations: <none>
API Version: config.openshift.io/v1
Kind: ClusterOperator
Metadata:
Creation Timestamp: 2019-08-16T01:37:48Z
Generation: 1
Resource Version: 1225508
Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config
UID: 73e29081-bfc6-11e9-9c78-06fda8b33d36
Spec:
Status:
Conditions:
Last Transition Time: 2019-08-16T02:19:17Z
Message: Cluster not available for 4.2.0-0.nightly-2019-08-15-205330
Status: False
Type: Available
Last Transition Time: 2019-08-16T01:38:51Z
Message: Cluster version is 4.2.0-0.nightly-2019-08-15-205330
Status: False
Type: Progressing
Last Transition Time: 2019-08-16T02:19:17Z
Message: Failed to resync 4.2.0-0.nightly-2019-08-15-205330 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 1, updated: 1, unavailable: 2)
Reason: FailedToSync
Status: True
Type: Degraded
Last Transition Time: 2019-08-16T01:38:50Z
Reason: AsExpected
Status: True
Type: Upgradeable
Extension:
Last Sync Error: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 1, updated: 1, unavailable: 2)
Master: pool is degraded because nodes fail with "2 nodes are reporting degraded status on sync": "Node ip-10-0-157-99.ap-southeast-2.compute.internal is reporting: \"unexpected on-disk state validating against rendered-master-9a9714cbe1d227749d88902f9456b912\", Node ip-10-0-135-169.ap-southeast-2.compute.internal is reporting: \"unexpected on-disk state validating against rendered-master-96bbedbc04dde9dbd3de9089016ed76f\""
Worker: all 2 nodes are at latest configuration rendered-worker-15e1a7e6f084ad5335ca16971d1e6b2b
Related Objects:
Group:
Name: openshift-machine-config-operator
Resource: namespaces
Group: machineconfiguration.openshift.io
Name: master
Resource: machineconfigpools
Group: machineconfiguration.openshift.io
Name: worker
Resource: machineconfigpools
Group: machineconfiguration.openshift.io
Name: cluster
Resource: controllerconfigs
Versions:
Name: operator
Version: 4.2.0-0.nightly-2019-08-15-205330
Events: <none>
[root@dhcp-140-138 ~]# oc get co machine-config -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
creationTimestamp: "2019-08-16T01:37:48Z"
generation: 1
name: machine-config
resourceVersion: "1225508"
selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config
uid: 73e29081-bfc6-11e9-9c78-06fda8b33d36
spec: {}
status:
conditions:
- lastTransitionTime: "2019-08-16T02:19:17Z"
message: Cluster not available for 4.2.0-0.nightly-2019-08-15-205330
status: "False"
type: Available
- lastTransitionTime: "2019-08-16T01:38:51Z"
message: Cluster version is 4.2.0-0.nightly-2019-08-15-205330
status: "False"
type: Progressing
- lastTransitionTime: "2019-08-16T02:19:17Z"
message: 'Failed to resync 4.2.0-0.nightly-2019-08-15-205330 because: timed out
waiting for the condition during syncRequiredMachineConfigPools: error pool
master is not ready, retrying. Status: (pool degraded: true total: 3, ready
1, updated: 1, unavailable: 2)'
reason: FailedToSync
status: "True"
type: Degraded
- lastTransitionTime: "2019-08-16T01:38:50Z"
reason: AsExpected
status: "True"
type: Upgradeable
extension:
lastSyncError: 'error pool master is not ready, retrying. Status: (pool degraded:
true total: 3, ready 1, updated: 1, unavailable: 2)'
master: 'pool is degraded because nodes fail with "2 nodes are reporting degraded
status on sync": "Node ip-10-0-157-99.ap-southeast-2.compute.internal is reporting:
\"unexpected on-disk state validating against rendered-master-9a9714cbe1d227749d88902f9456b912\",
Node ip-10-0-135-169.ap-southeast-2.compute.internal is reporting: \"unexpected
on-disk state validating against rendered-master-96bbedbc04dde9dbd3de9089016ed76f\""'
worker: all 2 nodes are at latest configuration rendered-worker-15e1a7e6f084ad5335ca16971d1e6b2b
relatedObjects:
- group: ""
name: openshift-machine-config-operator
resource: namespaces
- group: machineconfiguration.openshift.io
name: master
resource: machineconfigpools
- group: machineconfiguration.openshift.io
name: worker
resource: machineconfigpools
- group: machineconfiguration.openshift.io
name: cluster
resource: controllerconfigs
versions:
- name: operator
version: 4.2.0-0.nightly-2019-08-15-205330
I wonder if this is an issue coming from https://github.com/openshift/machine-config-operator/pull/965. Looking into reproducing (those recovery steps are long!) Can you please provide must-gather from your cluster? I am seeing an etcd-quorum-guard pod in a pending state in the logs you pasted above. So this bug is likely occurring because the DR instructions require the user to "Copy the /etc/kubernetes/kubelet-ca.crt file to all other master hosts and nodes." As expected, we will hit an unexpected on disk state as the MCO does not expect for users to manually update nodes. To fix this we will have to find a way to make the necessary update of the now expired certs and also update the ignition config so that the content mismatch does not occur. Some extra background: this DR scenario deals with when a cluster has been suspended for some indeterminate amount of time. As a result, the certs are expired and must be rotated so that kubelet can restart everything. > So this bug is likely occurring because the DR instructions require the user to "Copy the /etc/kubernetes/kubelet-ca.crt file to all other master hosts and nodes." Right. And I guess what we're running into here is fallout from https://github.com/openshift/machine-config-operator/pull/245 Specifically see https://github.com/openshift/machine-config-operator/issues/662#issuecomment-506472687 Reboot coordination is a prerequisite for automatic reconciliation. I can't think of an easy workaround here because what we're looking for is the *old* certificate which won't work... I guess without going full reboot coordination,
we could maybe add a `force: true` field to MachineConfigPool which would roll out as an annotation to the by the node controller only when it's trying to change config, and that would tell the MCD to skip validating.
Or...maybe this is the simplest hack:
```
diff --git a/pkg/daemon/daemon.go b/pkg/daemon/daemon.go
index be4af619..934e6993 100644
--- a/pkg/daemon/daemon.go
+++ b/pkg/daemon/daemon.go
@@ -892,8 +892,10 @@ func (dn *Daemon) checkStateOnFirstRun() error {
glog.Infof("Validating against current config %s", state.currentConfig.GetName())
expectedConfig = state.currentConfig
}
- if !dn.validateOnDiskState(expectedConfig) {
- return fmt.Errorf("unexpected on-disk state validating against %s", expectedConfig.GetName())
+ if _, err := os.Stat("/run/machine-config-daemon-force"); err != nil {
+ if !dn.validateOnDiskState(expectedConfig) {
+ return fmt.Errorf("unexpected on-disk state validating against %s", expectedConfig.GetName())
+ }
}
glog.Info("Validated on-disk state")
```
Update: discussed with Colin and he's working on a fix (prob using the last solution above), PR forthcoming =) Not exactly sure, but our MCO PR alone can't close this PR, so we're also waiting to find out who will be updating the docs with an extra command as 9. e. in the DR directions linked above:
> $ touch /run/machine-config-daemon-force
I've talked to Andrea; I am coordinating a doc PR with her. Confirmed with payload: 4.2.0-0.nightly-2019-08-28-004049 , and follow the Doc pr, the issue has fixed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |