Description of problem: We creates an AWS 4.2 cluster with 3 master and 2 worker nodes. Then created a new machineset to add a GPU enabled node and successfully deployed NFD (Node Feature Discovery) and SRO (Special Resource Operator). All the nodes were ready. We attempted a cluster upgrade from an already approved and successful upgrade path: From: 4.2.0-0.nightly-2019-08-26-235330 To: 4.2.0-0.nightly-2019-08-27-072819 https://openshift-release.svc.ci.openshift.org/releasetag/4.2.0-0.nightly-2019-08-27-072819 Upgrade got stuck at: # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-08-26-235330 True True 30h Unable to apply 4.2.0-0.nightly-2019-08-27-072819: the cluster operator machine-config has not yet successfully rolled out The machine-config-daemon-tdnz2 pod corresponding to the newly added node from new machineset was stuck in terminating state MCO logs show: I0828 04:01:41.636086 1 start.go:42] Version: v4.2.0-201908270219-dirty (30f8923eedded54ad22cf6536934202d417b7a26) E0828 04:03:37.282947 1 event.go:247] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"machine-config", GenerateName:"", Namespace:"openshift-machine-config-operator", SelfLink:"/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config", UID:"a1ed9357-c93a-11e9-9782-06551ddd529e", ResourceVersion:"40778", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63702555727, loc:(*time.Location)(0x2364300)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"machine-config-operator-86bf5579fd-mzhjd_8ae1784a-c948-11e9-8c8d-0a580a800039\",\"leaseDurationSeconds\":90,\"acquireTime\":\"2019-08-28T04:03:37Z\",\"renewTime\":\"2019-08-28T04:03:37Z\",\"leaderTransitions\":1}"}, OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'no kind is registered for the type v1.ConfigMap in scheme "github.com/openshift/machine-config-operator/cmd/common/helpers.go:30"'. Will not report event: 'Normal' 'LeaderElection' 'machine-config-operator-86bf5579fd-mzhjd_8ae1784a-c948-11e9-8c8d-0a580a800039 became leader' I0828 04:03:37.496439 1 operator.go:246] Starting MachineConfigOperator I0828 04:03:37.501860 1 event.go:209] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"a2001fbd-c93a-11e9-9782-06551ddd529e", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator started a version change from [{operator 4.2.0-0.nightly-2019-08-26-235330}] to [{operator 4.2.0-0.nightly-2019-08-27-072819}] E0828 06:43:39.194456 1 operator.go:312] timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 5, ready: 5, unavailable: 5) Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-08-26-235330 How reproducible: Happened once Steps to Reproduce: 1. install a 3 master 2 worker node new OC cluster in AWS from nightly build 4.2.0-0.nightly-2019-08-26-235330 2. Deploy NFD operator: - cd $GOPTAH/src/github.com/openshift - git clone https://github.com/openshift/cluster-nfd-operator - cd cluster-nfd-operator - make deploy 3. Create a new machineset to add a new g3.8xlarge GPU enabled node oc create -f <new_gpu_enabled_node_machinest> wait till new node is ready ( oc get nodes) 4. Deploy SRO - cd $GOPTAH/src/github.com/ - git clone https://github.com/zvonkok/special-resource-operator.git - cd special-resource-operator - make deploy 5. Patch upstream registry path: - 19795 oc patch clusterversion/version --patch '{"spec":{"upstream":"https://openshift-release.svc.ci.openshift.org/graph"}}' --type=merge 7. Check supported upgrade versions: - # oc adm upgrade Cluster version is 4.2.0-0.nightly-2019-08-26-235330 Updates: VERSION IMAGE 4.2.0-0.nightly-2019-08-27-042756 registry.svc.ci.openshift.org/ocp/release@sha256:ec062707cbdf7b9eb56540e2d1c187c3abc78f174e038217d89d170c737a5ed6 4.2.0-0.nightly-2019-08-27-072819 registry.svc.ci.openshift.org/ocp/release@sha256:2da92dc586e09704bd3677e87eb3a9dc8e6a8be8f2ce68f2b7b223d035692005 4.2.0-0.nightly-2019-08-27-061931 registry.svc.ci.openshift.org/ocp/release@sha256:65ab2c9539ed20bf30ab1b1ea9c221c979a321f51986c903979bf9abc7deec8c 8. Upgrade cluster # oc adm upgrade --force=true --to-image registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-08-27-072819 Actual results: # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-08-26-235330 True True 28m Unable to apply 4.2.0-0.nightly-2019-08-27-072819: the cluster operator machine-config has not yet successfully rolled out Expected results: Upgrade successful to new version Additional info: Links to must-gather logs and other oc cmd logs in next comment
Do you have a currently broken cluster after the upgrade where I can jump to and further debug? could you provide the kubeconfig?
It looks like the DS for the machine-config-daemon couldn't run on the GPU enabled node and that goes beyond the MCO afaict. Looks like the DS pod is scheduled but the container can't run (hence missing logs): - lastProbeTime: null lastTransitionTime: 2019-08-28T04:04:48Z message: 'containers with unready status: [machine-config-daemon]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: 2019-08-28T03:37:32Z status: "True" type: PodScheduled
More clues from the "Terminating" MCD on the GPU node: oc get pods -l k8s-app=machine-config-daemon --field-selector spec.nodeName=ip-10-0-143-101.us-west-2.compute.internal NAME READY STATUS RESTARTS AGE machine-config-daemon-tdnz2 0/1 Terminating 0 33h Then in the journal: Aug 29 12:50:49 ip-10-0-143-101 hyperkube[1566]: I0829 12:50:49.371411 1566 kubelet_pods.go:934] Pod "machine-config-daemon-tdnz2_openshift-machine-config-operator(2aefcab2-c945-11e9-8b16-0a269aae9e58)" is terminated, but some volumes have not been cleaned up Investigating further on the GPU enabled node, I can see a volume laying around: [root@ip-10-0-143-101 /]# mount | grep daemon tmpfs on /run/nvidia/driver/host/run/nvidia/driver/host/var/lib/kubelet/pods/2aefcab2-c945-11e9-8b16-0a269aae9e58/volumes/kubernetes.io~secret/machine-config-daemon-token-6swft type tmpfs (rw,relatime,seclabel) [root@ip-10-0-143-101 /]# umount /run/nvidia/driver/host/run/nvidia/driver/host/var/lib/kubelet/pods/2aefcab2-c945-11e9-8b16-0a269aae9e58/volumes/kubernetes.io~secret/machine-config-daemon-token-6swft [root@ip-10-0-143-101 /]# mount | grep daemon [root@ip-10-0-143-101 /]# Then, re-grepping the journal: Aug 29 12:58:11 ip-10-0-143-101 hyperkube[1566]: E0829 12:58:11.527171 1566 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/secret/2aefcab2-c945-11e9-8b16-0a269aae9e58-machine-config-daemon-token-6swft\" (\"2aefcab2-c945-11e9-8b16-0a269aae9e58\")" failed. No retries permitted until 2019-08-29 13:00:13.5271424 +0000 UTC m=+120204.917524039 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"machine-config-daemon-token-6swft\" (UniqueName: \"kubernetes.io/secret/2aefcab2-c945-11e9-8b16-0a269aae9e58-machine-config-daemon-token-6swft\") pod \"2aefcab2-c945-11e9-8b16-0a269aae9e58\" (UID: \"2aefcab2-c945-11e9-8b16-0a269aae9e58\") : unlinkat /var/lib/kubelet/pods/2aefcab2-c945-11e9-8b16-0a269aae9e58/volumes/kubernetes.io~secret/machine-config-daemon-token-6swft: device or resource busy" So what I think it's happening is that the NVIDIA pod (whatever it does), mounts the rootfs as the MCD does and that falls into a leaked mount, preventing the pod from terminating on the MCD side. I'm not sure how this is supposed to work on the nvidia side also.
Inside the nvidia-driver-ctr, we can see the leaked mounts: bash-4.4# mount | grep daemon tmpfs on /host/run/nvidia/driver/host/var/lib/kubelet/pods/2aefcab2-c945-11e9-8b16-0a269aae9e58/volumes/kubernetes.io~secret/machine-config-daemon-token-6swft type tmpfs (rw,relatime,seclabel) tmpfs on /run/nvidia/driver/host/run/nvidia/driver/host/var/lib/kubelet/pods/2aefcab2-c945-11e9-8b16-0a269aae9e58/volumes/kubernetes.io~secret/machine-config-daemon-token-6swft type tmpfs (rw,relatime,seclabel) Pretty sure all this is because both the MCD and the NVIDIA driver ctr are mounting the rootfs. Probably a propagation issue, I'm not sure.
Umounting both the mounts in the "nvidia-driver-ctr" starts a correct rollout and unblock the situation here at least, but again, I'm not really sure how this is supposed to work. I guess I'll ping the containers team here and not sure we can control what the nvidia ctr does.
As a last thing to do to reconcile the cluster, wait for the update for the MCO to be done and everything will be upgraded. Of course, this is not a solution nor IMO a workaround. Anyhow, the easiest way to reproduce this (which we've seen also impacts upgrades), is to just oc delete the MCD running on the GPU node once the node is added.
This is where the nvidia driver ctr mounts host: https://github.com/zvonkok/special-resource-operator/blob/master/assets/state-driver/0500_daemonset.yaml#L49-L52 Do we really need the full host mount there? what for? This daemonset for nvidia requiring the host mount is also making it leak mounts from other containers that mount the host (like the MCD) and causing the Termination issue explained above.
Somehow also, the SRO doesn't seem to be able to restart the NVIDIA daemonset... is that a bug? I can see the daemonset pod trying to go up, and then Terminate itself over and over.
This is what that operator does right now (over and over) and the nvidia DS isn't started also: {"level":"info","ts":1567086418.0162215,"logger":"controller_specialresource","msg":"Reconciling SpecialResource","Request.Namespace":"openshift-sro-operator","Request.Name":"gpu"} {"level":"info","ts":1567086418.0262527,"logger":"controller_specialresource","msg":"Looking for","ServiceAccount":"nvidia-driver","Namespace":"openshift-sro"} {"level":"info","ts":1567086418.029905,"logger":"controller_specialresource","msg":"Not found, creating","ServiceAccount":"nvidia-driver","Namespace":"openshift-sro"} {"level":"info","ts":1567086418.034802,"logger":"controller_specialresource","msg":"Looking for","Role":"nvidia-driver","Namespace":"openshift-sro"} {"level":"info","ts":1567086418.0377996,"logger":"controller_specialresource","msg":"Not found, creating","Role":"nvidia-driver","Namespace":"openshift-sro"} {"level":"info","ts":1567086418.043725,"logger":"controller_specialresource","msg":"Looking for","RoleBinding":"nvidia-driver","Namespace":"openshift-sro"} {"level":"info","ts":1567086418.0531864,"logger":"controller_specialresource","msg":"Not found, creating","RoleBinding":"nvidia-driver","Namespace":"openshift-sro"} {"level":"info","ts":1567086418.064452,"logger":"controller_specialresource","msg":"Looking for","ConfigMap":"nvidia-driver","Namespace":"openshift-sro"} {"level":"info","ts":1567086418.071187,"logger":"controller_specialresource","msg":"Not found, creating","ConfigMap":"nvidia-driver","Namespace":"openshift-sro"} {"level":"info","ts":1567086418.0787044,"logger":"controller_specialresource","msg":"Looking for","SecurityContextConstraints":"nvidia-driver","Namespace":"default"} {"level":"info","ts":1567086418.086377,"logger":"controller_specialresource","msg":"Found","SecurityContextConstraints":"nvidia-driver","Namespace":"default"} {"level":"info","ts":1567086418.0947955,"logger":"controller_specialresource","msg":"4.18.0-80.7.2.el8_0.x86_64","Request.Namespace":"default","Request.Name":"Node"} {"level":"info","ts":1567086418.0948265,"logger":"controller_specialresource","msg":"Looking for","DaemonSet":"nvidia-driver-daemonset","Namespace":"openshift-sro"} {"level":"info","ts":1567086418.0973034,"logger":"controller_specialresource","msg":"Not found, creating","DaemonSet":"nvidia-driver-daemonset","Namespace":"openshift-sro"} {"level":"info","ts":1567086418.103533,"logger":"controller_specialresource","msg":"DEBUG: DaemonSet","LabelSelector":"app=nvidia-driver-daemonset"} {"level":"info","ts":1567086418.1089997,"logger":"controller_specialresource","msg":"DEBUG: DaemonSet","NumberOfDaemonSets":1} {"level":"info","ts":1567086418.1091046,"logger":"controller_specialresource","msg":"DEBUG: DaemonSet","NumberUnavailable":0} {"level":"info","ts":1567086418.109186,"logger":"controller_specialresource","msg":"DEBUG: Pod","LabelSelector":"app=nvidia-driver-daemonset"} {"level":"info","ts":1567086418.1214077,"logger":"controller_specialresource","msg":"DEBUG: Pod","NumberOfPods":0} {"level":"info","ts":1567086418.1214294,"logger":"controller_specialresource","msg":"SpecialResource","ResourceStatus":"NotReady"} {"level":"info","ts":1567086418.1214664,"logger":"controller_specialresource","msg":"Reconciling SpecialResource","Request.Namespace":"openshift-sro","Request.Name":"gpu"} {"level":"info","ts":1567086418.1294944,"logger":"controller_specialresource","msg":"Reconciling SpecialResource","Request.Namespace":"openshift-sro","Request.Name":"gpu"}
TL;DR; this issue isn't strictly related to the MCO, the MCO is just mounting the /rootfs cause it needs to in order to manage a node. What's happening here is that another pod (coming from https://github.com/zvonkok/special-resource-operator as explained in comment https://bugzilla.redhat.com/show_bug.cgi?id=1746854#c9) is also mounting the / of the node. What that causes is that mounts from other containers (MCD) are leaked into this pod that mounts / as well. When that happens, kubelet will be stuck at terminating because it can't umount a volume belonging to the MCD but mounted into the other pod (in this case it's the MCD token that kube injects). to verify the above tl;dr; look at: https://bugzilla.redhat.com/show_bug.cgi?id=1746854#c5 - you can find the kubelet logs as well. Anyhow, one way to unstuck an upgrade or a general wedge is to jump on the other pod that mounts / and manually umount the volumes belonging to the MCD (https://bugzilla.redhat.com/show_bug.cgi?id=1746854#c6). After a while, the kubelet is going to reconcile the situation and can now umount the volume from MCD and proceed to correctly terminate the pod. I'm not sure whose issue is this also, I'm also not sure the operator at https://github.com/zvonkok/special-resource-operator is a fully supported one for 4.2? (I'm saying that just because it's not under the openshift org on github). As for when this is happening, the reporter claims "happened once" so I'm more leaning toward this is something happening sometimes at the container runtime (maybe kernel) level. I don't see anything wrong otherwise in multiple pods mounting the rootfs as long as mounts aren't leaked and held by other pods (causing these sort of issues).
Spoke to Clayton about this here: https://coreos.slack.com/archives/CEKNRGF25/p1567095110087200 Since the special-resource-operator isn't in the payload, and this is an issue with that operator and mounting /rootfs, and ultimately causing issue to the MCO (core operator), we believe this has to be fixed on the SRO side. There are ways to avoid mounts leaking like this as well. Since I can't find a component for the SRO, I'm going to close this but please reopen it if there's a component (or move this upstream on github).
Re-opening and assigning to correct component.
Walid, I have created a new version, please test. # git clone https://github.com/openshift-psap/special-resource-operator.git # git checkout release-4.2 # PULLPOLICY=Always make deploy
Failed to deploy the sro nvidia pods in namespace openshift-sro, only the special-resource-operator pod was deployed. # git clone https://github.com/openshift-psap/special-resource-operator.git # git checkout release-4.2 # PULLPOLICY=Always make deploy customresourcedefinition.apiextensions.k8s.io/specialresources.sro.openshift.io created sleep 1 for obj in namespace.yaml service_account.yaml role.yaml role_binding.yaml operator.yaml crds/sro_v1alpha1_specialresource_cr.yaml; do \ sed 's+REPLACE_IMAGE+quay.io/openshift-psap/special-resource-operator:release-4.2+g; s+REPLACE_NAMESPACE+openshift-sro+g; s+Always+Always+' deploy/$obj | kubectl apply -f - ;\ sleep 1 ;\ done namespace/openshift-sro created serviceaccount/special-resource-operator created role.rbac.authorization.k8s.io/special-resource-operator created clusterrole.rbac.authorization.k8s.io/special-resource-operator created rolebinding.rbac.authorization.k8s.io/special-resource-operator created clusterrolebinding.rbac.authorization.k8s.io/special-resource-operator created deployment.apps/special-resource-operator created specialresource.sro.openshift.io/example-specialresource created specialresource.sro.openshift.io/example-specialresource unchanged # oc get pods -n openshift-sro NAME READY STATUS RESTARTS AGE special-resource-operator-7cbb8f5d67-mqj26 1/1 Running 0 34s # oc get events -n openshift-sro LAST SEEN TYPE REASON OBJECT MESSAGE 45s Normal Scheduled pod/special-resource-operator-7cbb8f5d67-mqj26 Successfully assigned openshift-sro/special-resource-operator-7cbb8f5d67-mqj26 to ip-10-0-138-35.us-west-2.compute.internal 37s Normal Pulling pod/special-resource-operator-7cbb8f5d67-mqj26 Pulling image "quay.io/openshift-psap/special-resource-operator:release-4.2" 34s Normal Pulled pod/special-resource-operator-7cbb8f5d67-mqj26 Successfully pulled image "quay.io/openshift-psap/special-resource-operator:release-4.2" 34s Normal Created pod/special-resource-operator-7cbb8f5d67-mqj26 Created container special-resource-operator 34s Normal Started pod/special-resource-operator-7cbb8f5d67-mqj26 Started container special-resource-operator 45s Normal SuccessfulCreate replicaset/special-resource-operator-7cbb8f5d67 Created pod: special-resource-operator-7cbb8f5d67-mqj26 45s Normal ScalingReplicaSet deployment/special-resource-operator Scaled up replica set special-resource-operator-7cbb8f5d67 to 1 # oc logs -n openshift-sro special-resource-operator-7cbb8f5d67-mqj26 {"level":"info","ts":1570043587.0136995,"logger":"cmd","msg":"Go Version: go1.11.13"} {"level":"info","ts":1570043587.0139394,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"} {"level":"info","ts":1570043587.0139446,"logger":"cmd","msg":"Version of operator-sdk: v0.10.0"} {"level":"info","ts":1570043587.014225,"logger":"leader","msg":"Trying to become the leader."} {"level":"info","ts":1570043587.150485,"logger":"leader","msg":"No pre-existing lock was found."} {"level":"info","ts":1570043587.1559906,"logger":"leader","msg":"Became the leader."} {"level":"info","ts":1570043587.2638056,"logger":"cmd","msg":"Registering Components."} {"level":"info","ts":1570043587.2648346,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource-controller","source":"kind source: /, Kind="} {"level":"info","ts":1570043587.264936,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource-controller","source":"kind source: /, Kind="} {"level":"info","ts":1570043587.2650201,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource-controller","source":"kind source: /, Kind="} {"level":"info","ts":1570043587.265101,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource-controller","source":"kind source: /, Kind="} {"level":"info","ts":1570043587.2651818,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource-controller","source":"kind source: /, Kind="} {"level":"info","ts":1570043587.2652588,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource-controller","source":"kind source: /, Kind="} {"level":"info","ts":1570043587.2653337,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource-controller","source":"kind source: /, Kind="} {"level":"info","ts":1570043587.2654111,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource-controller","source":"kind source: /, Kind="} {"level":"info","ts":1570043587.2655103,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource-controller","source":"kind source: /, Kind="} {"level":"info","ts":1570043587.4164715,"logger":"cmd","msg":"Could not create metrics Service","error":"failed to create or get service for metrics: services \"special-resource-operator-metrics\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>"} {"level":"info","ts":1570043587.4311242,"logger":"cmd","msg":"Starting the Cmd."} {"level":"info","ts":1570043587.5314167,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"specialresource-controller"} {"level":"info","ts":1570043587.6317081,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"specialresource-controller","worker count":1} {"level":"info","ts":1570043587.6317961,"logger":"controller_specialresource","msg":"Reconciling SpecialResource","Request.Namespace":"","Request.Name":"gpu"} # oc get crd NAME CREATED AT alertmanagers.monitoring.coreos.com 2019-10-01T14:23:39Z apiservers.config.openshift.io 2019-10-01T14:16:56Z authentications.config.openshift.io 2019-10-01T14:16:56Z authentications.operator.openshift.io 2019-10-01T14:17:17Z baremetalhosts.metal3.io 2019-10-01T14:19:03Z builds.config.openshift.io 2019-10-01T14:16:56Z catalogsourceconfigs.operators.coreos.com 2019-10-01T14:17:16Z catalogsources.operators.coreos.com 2019-10-01T14:17:23Z clusterautoscalers.autoscaling.openshift.io 2019-10-01T14:17:15Z clusternetworks.network.openshift.io 2019-10-01T14:17:32Z clusteroperators.config.openshift.io 2019-10-01T14:16:54Z clusterresourcequotas.quota.openshift.io 2019-10-01T14:16:55Z clusterserviceversions.operators.coreos.com 2019-10-01T14:17:22Z clusterversions.config.openshift.io 2019-10-01T14:16:54Z configs.imageregistry.operator.openshift.io 2019-10-01T14:17:15Z configs.samples.operator.openshift.io 2019-10-01T14:17:15Z consoleclidownloads.console.openshift.io 2019-10-01T14:17:15Z consoleexternalloglinks.console.openshift.io 2019-10-01T14:17:18Z consolelinks.console.openshift.io 2019-10-01T14:17:16Z consolenotifications.console.openshift.io 2019-10-01T14:17:20Z consoles.config.openshift.io 2019-10-01T14:16:56Z consoles.operator.openshift.io 2019-10-01T14:17:21Z containerruntimeconfigs.machineconfiguration.openshift.io 2019-10-01T14:18:18Z controllerconfigs.machineconfiguration.openshift.io 2019-10-01T14:18:15Z credentialsrequests.cloudcredential.openshift.io 2019-10-01T14:17:16Z dnses.config.openshift.io 2019-10-01T14:16:57Z dnses.operator.openshift.io 2019-10-01T14:17:16Z dnsrecords.ingress.operator.openshift.io 2019-10-01T14:17:16Z egressnetworkpolicies.network.openshift.io 2019-10-01T14:17:33Z featuregates.config.openshift.io 2019-10-01T14:16:57Z hostsubnets.network.openshift.io 2019-10-01T14:17:32Z imagecontentsourcepolicies.operator.openshift.io 2019-10-01T14:16:57Z images.config.openshift.io 2019-10-01T14:16:57Z infrastructures.config.openshift.io 2019-10-01T14:16:57Z ingresscontrollers.operator.openshift.io 2019-10-01T14:17:18Z ingresses.config.openshift.io 2019-10-01T14:16:58Z installplans.operators.coreos.com 2019-10-01T14:17:23Z kubeapiservers.operator.openshift.io 2019-10-01T14:17:15Z kubecontrollermanagers.operator.openshift.io 2019-10-01T14:17:15Z kubeletconfigs.machineconfiguration.openshift.io 2019-10-01T14:18:17Z kubeschedulers.operator.openshift.io 2019-10-01T14:17:15Z machineautoscalers.autoscaling.openshift.io 2019-10-01T14:17:18Z machineconfigpools.machineconfiguration.openshift.io 2019-10-01T14:18:16Z machineconfigs.machineconfiguration.openshift.io 2019-10-01T14:18:13Z machinedisruptionbudgets.healthchecking.openshift.io 2019-10-01T14:18:12Z machinehealthchecks.healthchecking.openshift.io 2019-10-01T14:18:09Z machines.machine.openshift.io 2019-10-01T14:18:05Z machinesets.machine.openshift.io 2019-10-01T14:18:06Z mcoconfigs.machineconfiguration.openshift.io 2019-10-01T14:17:20Z netnamespaces.network.openshift.io 2019-10-01T14:17:33Z network-attachment-definitions.k8s.cni.cncf.io 2019-10-01T14:17:28Z networks.config.openshift.io 2019-10-01T14:16:58Z networks.operator.openshift.io 2019-10-01T14:16:59Z nodefeaturediscoveries.nfd.openshift.io 2019-10-02T13:20:59Z oauths.config.openshift.io 2019-10-01T14:16:58Z openshiftapiservers.operator.openshift.io 2019-10-01T14:17:15Z openshiftcontrollermanagers.operator.openshift.io 2019-10-01T14:17:18Z operatorgroups.operators.coreos.com 2019-10-01T14:17:29Z operatorhubs.config.openshift.io 2019-10-01T14:16:55Z operatorsources.operators.coreos.com 2019-10-01T14:17:18Z podmonitors.monitoring.coreos.com 2019-10-01T14:23:40Z projects.config.openshift.io 2019-10-01T14:16:58Z prometheuses.monitoring.coreos.com 2019-10-01T14:23:39Z prometheusrules.monitoring.coreos.com 2019-10-01T14:23:40Z proxies.config.openshift.io 2019-10-01T14:16:55Z rolebindingrestrictions.authorization.openshift.io 2019-10-01T14:16:55Z schedulers.config.openshift.io 2019-10-01T14:16:58Z securitycontextconstraints.security.openshift.io 2019-10-01T14:16:56Z servicecas.operator.openshift.io 2019-10-01T14:17:18Z servicecatalogapiservers.operator.openshift.io 2019-10-01T14:17:17Z servicecatalogcontrollermanagers.operator.openshift.io 2019-10-01T14:17:17Z servicemonitors.monitoring.coreos.com 2019-10-01T14:23:39Z specialresources.sro.openshift.io 2019-10-02T19:12:48Z subscriptions.operators.coreos.com 2019-10-01T14:17:23Z tuneds.tuned.openshift.io 2019-10-01T14:17:16Z # oc get crd specialresources.sro.openshift.io -o yaml apiVersion: apiextensions.k8s.io/v1beta1 kind: CustomResourceDefinition metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"apiextensions.k8s.io/v1beta1","kind":"CustomResourceDefinition","metadata":{"annotations":{},"name":"specialresources.sro.openshift.io"},"spec":{"group":"sro.openshift.io","names":{"kind":"SpecialResource","listKind":"SpecialResourceList","plural":"specialresources","singular":"specialresource"},"scope":"Namespaced","subresources":{"status":{}},"validation":{"openAPIV3Schema":{"properties":{"apiVersion":{"description":"APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#resources","type":"string"},"kind":{"description":"Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#types-kinds","type":"string"},"metadata":{"type":"object"},"spec":{"type":"object"},"status":{"type":"object"}}}},"version":"v1alpha1","versions":[{"name":"v1alpha1","served":true,"storage":true}]}} creationTimestamp: "2019-10-02T19:12:48Z" generation: 1 name: specialresources.sro.openshift.io resourceVersion: "487599" selfLink: /apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/specialresources.sro.openshift.io uid: 9f868fd5-e548-11e9-9bbd-0a69f2192e3e spec: conversion: strategy: None group: sro.openshift.io names: kind: SpecialResource listKind: SpecialResourceList plural: specialresources singular: specialresource scope: Namespaced subresources: status: {} validation: openAPIV3Schema: properties: apiVersion: description: 'APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#resources' type: string kind: description: 'Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#types-kinds' type: string metadata: type: object spec: type: object status: type: object version: v1alpha1 versions: - name: v1alpha1 served: true storage: true status: acceptedNames: kind: SpecialResource listKind: SpecialResourceList plural: specialresources singular: specialresource conditions: - lastTransitionTime: "2019-10-02T19:12:48Z" message: no conflicts found reason: NoConflicts status: "True" type: NamesAccepted - lastTransitionTime: null message: the initial names have been accepted reason: InitialNamesAccepted status: "True" type: Established storedVersions: - v1alpha1 # oc get all -n openshift-sro NAME READY STATUS RESTARTS AGE pod/special-resource-operator-7cbb8f5d67-mqj26 1/1 Running 0 2m54s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/special-resource-operator 1/1 1 1 2m54s NAME DESIRED CURRENT READY AGE replicaset.apps/special-resource-operator-7cbb8f5d67 1 1 1 2m54s
SRO was successfully deployed after the PR update and git cloning the release-4.2 branch from https://github.com/openshift-psap/special-resource-operator.git. This was on OCP 4.2.0-0.nightly-2019-10-01-210901 on AWS. Upgrade from version 4.2.0-0.nightly-2019-10-01-210901 to version 4.2.0-0.nightly-2019-10-02-001405 was successful: # oc adm upgrade status Cluster version is 4.2.0-0.nightly-2019-10-02-001405 Updates: VERSION IMAGE 4.2.0-0.nightly-2019-10-02-122541 registry.svc.ci.openshift.org/ocp/release@sha256:f8f943fc4d321485cfd2559c2f610382891d8b8d60410be58d679aaf1c2cf660 # oc get pods -n openshift-machine-config-operator NAME READY STATUS RESTARTS AGE etcd-quorum-guard-85457c8498-5whbz 1/1 Running 0 26h etcd-quorum-guard-85457c8498-dbjc7 1/1 Running 0 26h etcd-quorum-guard-85457c8498-frhqk 1/1 Running 0 26h machine-config-controller-6cf9b7bc4-tsztb 1/1 Running 0 26h machine-config-daemon-22rjb 1/1 Running 0 26h machine-config-daemon-9ggwg 1/1 Running 0 25h machine-config-daemon-czj8q 1/1 Running 0 25h machine-config-daemon-f2pq6 1/1 Running 0 26h machine-config-daemon-kgtqj 1/1 Running 0 25h machine-config-daemon-lc5mk 1/1 Running 0 26h machine-config-daemon-p8rrj 1/1 Running 0 22h machine-config-operator-7777d6568f-xl9nw 1/1 Running 0 8m4s machine-config-server-7tk28 1/1 Running 0 26h machine-config-server-fj9ps 1/1 Running 0 26h machine-config-server-n4bt6 1/1 Running 0 26h # oc logs -n openshift-machine-config-operator machine-config-operator-7777d6568f-xl9nw I1004 03:52:42.140321 1 start.go:42] Version: v4.2.0-201910010606-dirty (b8898db9af98e5c3d6a450ae123121677b0dbcb3) E1004 03:54:37.799449 1 event.go:247] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"machine-config", GenerateName:"", Namespace:"openshift-machine-config-operator", SelfLink:"/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config", UID:"fa69a4d0-e580-11e9-b32c-06c96f353c88", ResourceVersion:"499408", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63705664573, loc:(*time.Location)(0x2365480)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"machine-config-operator-7777d6568f-xl9nw_6a99a99b-e65a-11e9-a2cc-0a580a810068\",\"leaseDurationSeconds\":90,\"acquireTime\":\"2019-10-04T03:54:37Z\",\"renewTime\":\"2019-10-04T03:54:37Z\",\"leaderTransitions\":1}"}, OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'no kind is registered for the type v1.ConfigMap in scheme "github.com/openshift/machine-config-operator/cmd/common/helpers.go:30"'. Will not report event: 'Normal' 'LeaderElection' 'machine-config-operator-7777d6568f-xl9nw_6a99a99b-e65a-11e9-a2cc-0a580a810068 became leader' I1004 03:54:38.117217 1 operator.go:246] Starting MachineConfigOperator I1004 03:54:38.121766 1 event.go:209] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"fa7c985d-e580-11e9-b32c-06c96f353c88", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator started a version change from [{operator 4.2.0-0.nightly-2019-10-01-210901}] to [{operator 4.2.0-0.nightly-2019-10-02-001405}] I1004 03:54:43.312637 1 event.go:209] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"fa7c985d-e580-11e9-b32c-06c96f353c88", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator version changed from [{operator 4.2.0-0.nightly-2019-10-01-210901}] to [{operator 4.2.0-0.nightly-2019-10-02-001405}] # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.2.0-0.nightly-2019-10-02-001405 True False False 25h cloud-credential 4.2.0-0.nightly-2019-10-02-001405 True False False 26h cluster-autoscaler 4.2.0-0.nightly-2019-10-02-001405 True False False 26h console 4.2.0-0.nightly-2019-10-02-001405 True False False 25h dns 4.2.0-0.nightly-2019-10-02-001405 True False False 26h image-registry 4.2.0-0.nightly-2019-10-02-001405 True False False 25h ingress 4.2.0-0.nightly-2019-10-02-001405 True False False 25h insights 4.2.0-0.nightly-2019-10-02-001405 True False False 26h kube-apiserver 4.2.0-0.nightly-2019-10-02-001405 True False False 26h kube-controller-manager 4.2.0-0.nightly-2019-10-02-001405 True False False 26h kube-scheduler 4.2.0-0.nightly-2019-10-02-001405 True False False 26h machine-api 4.2.0-0.nightly-2019-10-02-001405 True False False 26h machine-config 4.2.0-0.nightly-2019-10-02-001405 True False False 26h marketplace 4.2.0-0.nightly-2019-10-02-001405 True False False 10m monitoring 4.2.0-0.nightly-2019-10-02-001405 True False False 25h network 4.2.0-0.nightly-2019-10-02-001405 True False False 26h node-tuning 4.2.0-0.nightly-2019-10-02-001405 True False False 11m openshift-apiserver 4.2.0-0.nightly-2019-10-02-001405 True False False 26h openshift-controller-manager 4.2.0-0.nightly-2019-10-02-001405 True False False 26h openshift-samples 4.2.0-0.nightly-2019-10-02-001405 True False False 11m operator-lifecycle-manager 4.2.0-0.nightly-2019-10-02-001405 True False False 26h operator-lifecycle-manager-catalog 4.2.0-0.nightly-2019-10-02-001405 True False False 26h operator-lifecycle-manager-packageserver 4.2.0-0.nightly-2019-10-02-001405 True False False 7h5m service-ca 4.2.0-0.nightly-2019-10-02-001405 True False False 26h service-catalog-apiserver 4.2.0-0.nightly-2019-10-02-001405 True False False 7h5m service-catalog-controller-manager 4.2.0-0.nightly-2019-10-02-001405 True False False 7h5m storage 4.2.0-0.nightly-2019-10-02-001405 True False False 11m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062