Description of problem: Used a OCP 4.7 build. Observed that machine-config co is in degraded(=true) state Version-Release number of selected component (if applicable): # oc version Client Version: 4.7.0-0.nightly-ppc64le-2020-12-08-141649 Server Version: 4.7.0-0.nightly-ppc64le-2020-12-08-141649 Kubernetes Version: v1.19.2+ad738ba # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False True 4h17m baremetal 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 17h cloud-credential 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 18h cluster-autoscaler 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 17h config-operator 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 17h console 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 4h25m csi-snapshot-controller 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 16h dns 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 17h etcd 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 17h image-registry 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 17h ingress 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 4h30m insights 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 17h kube-apiserver 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 17h kube-controller-manager 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 17h kube-scheduler 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 17h kube-storage-version-migrator 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 4h30m machine-api 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 17h machine-approver 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 17h machine-config False True True 17h marketplace 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 16h monitoring 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 4h27m network 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 4h29m node-tuning 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 17h openshift-apiserver 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 16h openshift-controller-manager 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True True False 17h openshift-samples 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 17h operator-lifecycle-manager 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 17h operator-lifecycle-manager-catalog 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 17h operator-lifecycle-manager-packageserver 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 4h23m service-ca 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 17h storage 4.7.0-0.nightly-ppc64le-2020-12-08-141649 True False False 17h # oc get pods --all-namespaces | grep -v "Running\|Completed" NAMESPACE NAME READY STATUS RESTARTS AGE openshift-kube-apiserver installer-4-master-2 0/1 CreateContainerError 0 17h openshift-kube-apiserver kube-apiserver-master-1 0/5 Init:0/1 0 34s # oc describe co machine-config Name: machine-config Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true include.release.openshift.io/self-managed-high-availability: true include.release.openshift.io/single-node-developer: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-12-09T14:31:30Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:exclude.release.openshift.io/internal-openshift-hosted: f:include.release.openshift.io/self-managed-high-availability: f:include.release.openshift.io/single-node-developer: f:spec: f:status: .: f:versions: Manager: cluster-version-operator Operation: Update Time: 2020-12-09T14:31:30Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:extension: .: f:master: f:worker: f:relatedObjects: Manager: machine-config-operator Operation: Update Time: 2020-12-10T09:19:27Z Resource Version: 456067 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: f050562f-9ca7-4047-81aa-334885cbcc0b Spec: Status: Conditions: Last Transition Time: 2020-12-09T15:38:12Z Message: Working towards 4.7.0-0.nightly-ppc64le-2020-12-08-141649 Status: True Type: Progressing Last Transition Time: 2020-12-09T15:41:44Z Reason: One or more machine config pool is degraded, please see `oc get mcp` for further details and resolve before upgrading Status: False Type: Upgradeable Last Transition Time: 2020-12-09T15:52:21Z Message: Unable to apply 4.7.0-0.nightly-ppc64le-2020-12-08-141649: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3) Reason: RequiredPoolsFailed Status: True Type: Degraded Last Transition Time: 2020-12-09T15:52:21Z Message: Cluster not available for 4.7.0-0.nightly-ppc64le-2020-12-08-141649 Status: False Type: Available Extension: Master: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node master-0 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-f577f6a0e84fd5116925f21e284d2d3b\\\" not found\", Node master-2 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-f577f6a0e84fd5116925f21e284d2d3b\\\" not found\", Node master-1 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-f577f6a0e84fd5116925f21e284d2d3b\\\" not found\"" Worker: all 2 nodes are at latest configuration rendered-worker-40b7876d3bc63acdabb74b34a23b4cf2 Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: Resource: controllerconfigs Group: machineconfiguration.openshift.io Name: Resource: kubeletconfigs Group: machineconfiguration.openshift.io Name: Resource: containerruntimeconfigs Group: machineconfiguration.openshift.io Name: Resource: machineconfigs Group: Name: Resource: nodes Events: <none>
Detail provided above is not enough to analyze what's going on. Can you please provide must-gather log?
Getting the foll. error with the must-gather output : # oc adm must-gather [must-gather ] OUT Using must-gather plug-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e18b2922002f09bd3a367ec760fa8974625adbf30e2338339a2fb7d2c4030d37 [must-gather ] OUT namespace/openshift-must-gather-d9dfn created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-kqbxb created [must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e18b2922002f09bd3a367ec760fa8974625adbf30e2338339a2fb7d2c4030d37 created [must-gather-n979w] OUT gather did not start: timed out waiting for the condition [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-kqbxb deleted [must-gather ] OUT namespace/openshift-must-gather-d9dfn deleted error: gather did not start for pod must-gather-n979w: timed out waiting for the condition Let me know if any specific logs are required.
Hi Sinny, Am getting this same behavior as above mentioned ( machine-config cluster operator being in degraded state) # oc version Client Version: 4.7.0-0.nightly-ppc64le-2020-12-04-050650 Server Version: 4.7.0-0.nightly-ppc64le-2020-12-04-050650 Kubernetes Version: v1.19.2+ad738ba # oc get pods --all-namespaces | grep -v "Running\|Completed" NAMESPACE NAME READY STATUS RESTARTS AGE # oc describe co machine-config Name: machine-config Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true include.release.openshift.io/self-managed-high-availability: true include.release.openshift.io/single-node-developer: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-12-10T13:46:14Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:exclude.release.openshift.io/internal-openshift-hosted: f:include.release.openshift.io/self-managed-high-availability: f:include.release.openshift.io/single-node-developer: f:spec: f:status: .: f:versions: Manager: cluster-version-operator Operation: Update Time: 2020-12-10T13:46:15Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:extension: .: f:master: f:worker: f:relatedObjects: Manager: machine-config-operator Operation: Update Time: 2020-12-11T10:17:51Z Resource Version: 506473 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: dcc0400e-e48a-47e6-af2c-5d4bd37e96f9 Spec: Status: Conditions: Last Transition Time: 2020-12-10T13:55:00Z Message: Working towards 4.7.0-0.nightly-ppc64le-2020-12-04-050650 Status: True Type: Progressing Last Transition Time: 2020-12-10T13:56:59Z Reason: One or more machine config pool is degraded, please see `oc get mcp` for further details and resolve before upgrading Status: False Type: Upgradeable Last Transition Time: 2020-12-10T14:19:03Z Message: Unable to apply 4.7.0-0.nightly-ppc64le-2020-12-04-050650: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3) Reason: RequiredPoolsFailed Status: True Type: Degraded Last Transition Time: 2020-12-10T14:19:03Z Message: Cluster not available for 4.7.0-0.nightly-ppc64le-2020-12-04-050650 Status: False Type: Available Extension: Master: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node master-0 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-4815cfdfa836bd807b316b48bbd134a6\\\" not found\", Node master-1 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-4815cfdfa836bd807b316b48bbd134a6\\\" not found\", Node master-2 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-4815cfdfa836bd807b316b48bbd134a6\\\" not found\"" Worker: all 2 nodes are at latest configuration rendered-worker-516f195712733d60a1a38afba9946dbe Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: Resource: controllerconfigs Group: machineconfiguration.openshift.io Name: Resource: kubeletconfigs Group: machineconfiguration.openshift.io Name: Resource: containerruntimeconfigs Group: machineconfiguration.openshift.io Name: Resource: machineconfigs Group: Name: Resource: nodes Events: <none> Must gather logs shared here - https://drive.google.com/file/d/1JGO1nYWqrdDj87Vk5PXS4XCm1a29-sDi/view?usp=sharing ( have shared with both - sinny and alisha). Let me know if anything else is required to debug this issue. Regards, Amit
Hi, Could you give general view access to that must-gather? I've also requested access to help take a look. For rendered-xxx not found issues, we also need the "day 1 config" on the nodes to check for the mismatch. Could you follow the steps in https://github.com/openshift/machine-config-operator/issues/2114#issuecomment-700122866 and gather that for us as well?
Hi, Have added you to must-gather logs + access given for day1 config logs from all my cluster nodes - https://drive.google.com/drive/folders/1W_5kN_NAcbWmlHzFoTE3OFw38TqBCpVo?usp=sharing .
Possibly another data point. I installed 4.7.0-0.nightly-ppc64le-2020-12-14-080110, and got `machine-config` operator Degraded, while `authentication` is fine (not Degraded). If you want something from this cluster, let me know.
@Amit Can you please update the must gather to have general viewer permission? I just launched a cluster successfully using 4.7.0-0.nightly-2020-12-14-165231, so it's unclear to me whether this is a persistent problem or not. Looking at the nightly page: the ones reported here aren't available... I'm not super familiar with 4.7.0-0.nightly-ppc64le... they seem to be specific builds? And all of the complaints here seem to be using those? Where are those images coming from? How does it differ from whats in https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/#4.7.0-0.nightly What type of cluster is this?
Based on the slack threads: https://coreos.slack.com/archives/CH76YSYSC/p1608039447264800 This seems to be something called Power and the failures seem specific to that? Is there someone who owns it?
To supplement a bit, your cluster looks very odd. For one, there are three rendered-masters being referenced in various places, when there is only one on the system. This is the first time I've seen this. So in your operator status (as well as all 3 daemon pods on the masters), its looking for something called "rendered-master-4815cfdfa836bd807b316b48bbd134a6", which should have been the rendered config these masters booted with, generated by the bootstrap machine-config-controller. This is what's supposed to be in "/etc/mcs-machine-config-content.json", BUT if you look at the file contents you gave for /etc/machine-config-content.json you'll see instead its rendered-master-7aaca65a4c6888af4ad9b13674624cd4, something different. And finally, the one on the system generated by master MCC is named rendered-master-29d0c4d6ab430c17be92453ecc000935, yet another config. So I don't even know what would have been in rendered-master-4815cfdfa836bd807b316b48bbd134a6. On top of that, if you compare the contents of the two machineconfigs, you'll see that there is quite a big difference. The one on the system: 1. has the CA (which the bootstrap doesnt) 2. is missing /etc/kubernetes/apiserver-url.env 3. is missing /etc/systemd/system/kubelet.service.d/20-logging.conf so depending what added those 2, might be a good place to start checking. Again to reiterate: 1. rendered-master-4815cfdfa836bd807b316b48bbd134a6 -> what the master node thinks its desired content is. Did you modify that yourself? 2. rendered-master-7aaca65a4c6888af4ad9b13674624cd4 -> what the master was served (its config at the beginning) -> we have this but its not referenced anywhere 3. rendered-master-29d0c4d6ab430c17be92453ecc000935 -> the actual running config, which is missing the above stanzas I feel like how you're deploying the MCO is probably causing the above drift
Ah I just noticed something: in-cluster: "machineconfiguration.openshift.io/generated-by-controller-version": "bb2630cee2ee42a39638e968da9601c726467494" bootstrap: "machineconfiguration.openshift.io/generated-by-controller-version": "34d1cf050f528b220e95a32b747c48a5004ce1f0" Normally these are the same. Sounds like you have a different controller version deployed in bootstrap and in-cluster?
Noting that this is not a standard deployment: uid: 2cccd45e-232a-474c-ad82-5be1c6a2d58a spec: cloudConfig: name: "" platformSpec: type: None status: ... platform: None platformStatus: type: None Aside from Yu Qi's points above: Can you please give us info about these deployments/how does an install for this differ from an OCP install? Can you give us steps to reproduce? When did you start seeing this failure? How often is it failing? Poking around I found some jobs that seem to use the same nightlies (4.7.0-0.nightly-ppc64le...): https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#release-openshift-origin-installer-e2e-remote-libvirt - but I'm not seeing the MCP failure reflected there (the master pools are not degraded) Could someone explain why? Moving over to multi-arch to get more info about this..
We don't think this bug will be resolved before the end of the current sprint (Dec 26th). So I'm adding UpcomingSprint for this sprint.
Just adding that after a conversation with Hiro, this is only seen on a powerVS/VM deployment...we have not seen it here in a libvirt environment or a plain baremetal env
My cluster doesn't show `authentication` operator as Degraded, but still shows the same (no version and Degraded) for `machine-config` operator. And given #10, there may be something that Power automation does differently than x86 during install? Yussuf @yshaikh, can you take a look?
(Sorry, new to BZ, reposting my previous comment with better link/cc, etc..) My cluster doesn't show `authentication` operator as Degraded, but still shows the same (no version and Degraded) for `machine-config` operator. And given https://bugzilla.redhat.com/show_bug.cgi?id=1906321#c10, there may be something that Power automation does differently than x86 during install? Yussuf @yshaikh, can you take a look?
I do not think that anything changed in between these builds to the automation we have. Also I think new Power VM deploys are working fine from past couple of days. May be @aprabhu can confirm this?
After running some more tests on PowerVM, we see that this issue occurs only when cluster proxy is used for the installation. Let me know if you need any additional logs for this. # oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-config False True True 4d20h # oc get proxy cluster -oyaml apiVersion: config.openshift.io/v1 kind: Proxy metadata: creationTimestamp: "2020-12-17T07:39:41Z" generation: 1 managedFields: - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: .: {} f:httpProxy: {} f:httpsProxy: {} f:noProxy: {} f:trustedCA: .: {} f:name: {} f:status: .: {} f:httpProxy: {} f:httpsProxy: {} manager: cluster-bootstrap operation: Update time: "2020-12-17T07:39:41Z" - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:status: f:noProxy: {} manager: cluster-network-operator operation: Update time: "2020-12-17T07:48:15Z" name: cluster resourceVersion: "3248" uid: 0eaa8cae-cf1f-4564-8e79-c0015ad9c5dd spec: httpProxy: http://pravind-47test-bastion-0:3128 httpsProxy: http://pravind-47test-bastion-0:3128 noProxy: .pravind-47test.redhat.com,192.168.26.0/24 trustedCA: name: "" status: httpProxy: http://pravind-47test-bastion-0:3128 httpsProxy: http://pravind-47test-bastion-0:3128 noProxy: .cluster.local,.pravind-47test.redhat.com,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,172.30.0.0/16,192.168.26.0/24,api-int.pravind-47test.redhat.com,etcd-0.,etcd-1.,etcd-2.,localhost
After chatting with the Power team, it was determined that this bug is a "partial" blocker, as the bug blocks the team's testing when external proxy is enabled. Therefore, I'm changing the "blocker" flag to "Blocker+"
Relaying a message from Prashanth. Hi all, could someone check if this bug is the same issue as BZ 1901034?
Yes, it does appear to be similar. Below details are on a cluster on PowerVM which is identical to the findings in BZ 1901034. 1. The value for etcdDiscoveryDomain is empty: # oc get infrastructures.config.openshift.io cluster -o yaml apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2020-12-31T06:51:25Z" generation: 1 managedFields: - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: .: {} f:cloudConfig: .: {} f:name: {} f:platformSpec: .: {} f:type: {} f:status: .: {} f:apiServerInternalURI: {} f:apiServerURL: {} f:etcdDiscoveryDomain: {} f:infrastructureName: {} f:platform: {} f:platformStatus: .: {} f:type: {} manager: cluster-bootstrap operation: Update time: "2020-12-31T06:51:25Z" name: cluster resourceVersion: "541" uid: ef112996-3e44-4cae-bcd6-e1ffa941d0b9 spec: cloudConfig: name: "" platformSpec: type: None status: apiServerInternalURI: https://api-int.satwsin-latest.169.48.22.245.nip.io:6443 apiServerURL: https://api.satwsin-latest.169.48.22.245.nip.io:6443 etcdDiscoveryDomain: "" infrastructureName: satwsin-latest-97p77 platform: None platformStatus: type: None 2. The etcd fqdn entries for noProxy are missing the domain. # oc get proxies.config.openshift.io cluster -o yaml apiVersion: config.openshift.io/v1 kind: Proxy metadata: creationTimestamp: "2020-12-31T06:51:26Z" generation: 1 managedFields: - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: .: {} f:httpProxy: {} f:httpsProxy: {} f:noProxy: {} f:trustedCA: .: {} f:name: {} f:status: .: {} f:httpProxy: {} f:httpsProxy: {} manager: cluster-bootstrap operation: Update time: "2020-12-31T06:51:26Z" - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:status: f:noProxy: {} manager: cluster-network-operator operation: Update time: "2020-12-31T07:01:01Z" name: cluster resourceVersion: "3409" uid: 2cae1f6c-0576-4cf1-87c8-1a76b426a7ac spec: httpProxy: http://satwsin-latest-bastion-0:3128 httpsProxy: http://satwsin-latest-bastion-0:3128 noProxy: .satwsin-latest.169.48.22.245.nip.io,192.168.25.0/24 trustedCA: name: "" status: httpProxy: http://satwsin-latest-bastion-0:3128 httpsProxy: http://satwsin-latest-bastion-0:3128 noProxy: .cluster.local,.satwsin-latest.169.48.22.245.nip.io,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,172.30.0.0/16,192.168.25.0/24,api-int.satwsin-latest.169.48.22.245.nip.io,etcd-0.,etcd-1.,etcd-2.,localhost
Closing this bug as evident in Comment 22, that this bug shares the identical findings as BZ 1901034. The explanation of the regression that caused this bug can be found here: https://bugzilla.redhat.com/show_bug.cgi?id=1909502#c3 Please feel free to re-open if the issue still occurs after the other bug has been fixed, or if similar bug with variant errors occurs. *** This bug has been marked as a duplicate of bug 1901034 ***
Verified the bug on 4.7.0-0.nightly-ppc64le-2021-01-24-004926 For fresh install with global proxy enabled on 4.7.0-0.nightly-ppc64le-2021-01-24-004926, after installation completed successfully, check the noProxy list set in proxy/cluster: Co status: --- machine-config 4.7.0-0.nightly-ppc64le-2021-01-24-004926 True False False 143m oc get proxy cluster -o yaml --- status: noProxy: .cluster.local,.satwsin1-proxy.redhat.com,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,172.30.0.0/16,9.114.96.0/22,api-int.satwsin1-proxy.redhat.com,localhost
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days