1746854 – OCP 4.2 - machine-config operator fails to rollout during cluster upgrade after adding new machineset

Bug 1746854 - OCP 4.2 - machine-config operator fails to rollout during cluster upgrade after adding new machineset

Summary: OCP 4.2 - machine-config operator fails to rollout during cluster upgrade aft...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Special Resource Operator
Sub Component:
Version:	4.2.0
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Zvonko Kosic
QA Contact:	Walid A.
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-29 11:47 UTC by Walid A.
Modified:	2020-03-11 01:41 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-23 11:05:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:06:00 UTC

Description Walid A. 2019-08-29 11:47:03 UTC

Description of problem:
We creates an AWS 4.2 cluster with 3 master and 2 worker nodes.  Then created a new machineset to add a GPU enabled node and successfully deployed  NFD (Node Feature Discovery) and SRO (Special Resource Operator).  All the nodes were ready.  We attempted a cluster upgrade from an already approved and successful upgrade path:
From: 4.2.0-0.nightly-2019-08-26-235330
To: 4.2.0-0.nightly-2019-08-27-072819
https://openshift-release.svc.ci.openshift.org/releasetag/4.2.0-0.nightly-2019-08-27-072819

Upgrade got stuck at: 
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-08-26-235330   True        True          30h     Unable to apply 4.2.0-0.nightly-2019-08-27-072819: the cluster operator machine-config has not yet successfully rolled out

The machine-config-daemon-tdnz2 pod corresponding to the newly added node from new machineset was stuck in terminating state

MCO logs show:
I0828 04:01:41.636086       1 start.go:42] Version: v4.2.0-201908270219-dirty (30f8923eedded54ad22cf6536934202d417b7a26)
E0828 04:03:37.282947       1 event.go:247] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"machine-config", GenerateName:"", Namespace:"openshift-machine-config-operator", SelfLink:"/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config", UID:"a1ed9357-c93a-11e9-9782-06551ddd529e", ResourceVersion:"40778", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63702555727, loc:(*time.Location)(0x2364300)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"machine-config-operator-86bf5579fd-mzhjd_8ae1784a-c948-11e9-8c8d-0a580a800039\",\"leaseDurationSeconds\":90,\"acquireTime\":\"2019-08-28T04:03:37Z\",\"renewTime\":\"2019-08-28T04:03:37Z\",\"leaderTransitions\":1}"}, OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'no kind is registered for the type v1.ConfigMap in scheme "github.com/openshift/machine-config-operator/cmd/common/helpers.go:30"'. Will not report event: 'Normal' 'LeaderElection' 'machine-config-operator-86bf5579fd-mzhjd_8ae1784a-c948-11e9-8c8d-0a580a800039 became leader'
I0828 04:03:37.496439       1 operator.go:246] Starting MachineConfigOperator
I0828 04:03:37.501860       1 event.go:209] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"a2001fbd-c93a-11e9-9782-06551ddd529e", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator started a version change from [{operator 4.2.0-0.nightly-2019-08-26-235330}] to [{operator 4.2.0-0.nightly-2019-08-27-072819}]
E0828 06:43:39.194456       1 operator.go:312] timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 5, ready: 5, unavailable: 5)

Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-08-26-235330

How reproducible:
Happened once

Steps to Reproduce:
1. install a 3 master 2 worker node new OC cluster in AWS from nightly build 4.2.0-0.nightly-2019-08-26-235330
2. Deploy NFD operator:  
   - cd $GOPTAH/src/github.com/openshift
   - git clone https://github.com/openshift/cluster-nfd-operator
   - cd cluster-nfd-operator
   - make deploy
3. Create a new machineset to add a new g3.8xlarge GPU enabled node
   oc create -f <new_gpu_enabled_node_machinest>
   wait till new node is ready ( oc get nodes)
4. Deploy SRO
   - cd $GOPTAH/src/github.com/
   - git clone https://github.com/zvonkok/special-resource-operator.git
   - cd special-resource-operator
   - make deploy
5. Patch upstream registry path:
   - 19795  oc patch clusterversion/version --patch '{"spec":{"upstream":"https://openshift-release.svc.ci.openshift.org/graph"}}' --type=merge
7. Check supported upgrade versions:
   - # oc adm upgrade
Cluster version is 4.2.0-0.nightly-2019-08-26-235330

Updates:

VERSION                           IMAGE
4.2.0-0.nightly-2019-08-27-042756 registry.svc.ci.openshift.org/ocp/release@sha256:ec062707cbdf7b9eb56540e2d1c187c3abc78f174e038217d89d170c737a5ed6
4.2.0-0.nightly-2019-08-27-072819 registry.svc.ci.openshift.org/ocp/release@sha256:2da92dc586e09704bd3677e87eb3a9dc8e6a8be8f2ce68f2b7b223d035692005
4.2.0-0.nightly-2019-08-27-061931 registry.svc.ci.openshift.org/ocp/release@sha256:65ab2c9539ed20bf30ab1b1ea9c221c979a321f51986c903979bf9abc7deec8c

8.  Upgrade cluster
# oc adm upgrade --force=true --to-image registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-08-27-072819

Actual results:
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-08-26-235330   True        True          28m     Unable to apply 4.2.0-0.nightly-2019-08-27-072819: the cluster operator machine-config has not yet successfully rolled out

Expected results:
Upgrade successful to new version

Additional info:
Links to must-gather logs and other oc cmd logs in next comment

Comment 2 Antonio Murdaca 2019-08-29 12:18:51 UTC

Do you have a currently broken cluster after the upgrade where I can jump to and further debug? could you provide the kubeconfig?

Comment 3 Antonio Murdaca 2019-08-29 12:24:26 UTC

It looks like the DS for the machine-config-daemon couldn't run on the GPU enabled node and that goes beyond the MCO afaict. Looks like the DS pod is scheduled but the container can't run (hence missing logs):

  - lastProbeTime: null
    lastTransitionTime: 2019-08-28T04:04:48Z
    message: 'containers with unready status: [machine-config-daemon]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: 2019-08-28T03:37:32Z
    status: "True"
    type: PodScheduled

Comment 5 Antonio Murdaca 2019-08-29 13:01:04 UTC

More clues from the "Terminating" MCD on the GPU node:


oc get pods -l k8s-app=machine-config-daemon  --field-selector spec.nodeName=ip-10-0-143-101.us-west-2.compute.internal                                                                                                         
NAME                          READY   STATUS        RESTARTS   AGE
machine-config-daemon-tdnz2   0/1     Terminating   0          33h

Then in the journal:

Aug 29 12:50:49 ip-10-0-143-101 hyperkube[1566]: I0829 12:50:49.371411    1566 kubelet_pods.go:934] Pod "machine-config-daemon-tdnz2_openshift-machine-config-operator(2aefcab2-c945-11e9-8b16-0a269aae9e58)" is terminated, but some volumes have not been cleaned up


Investigating further on the GPU enabled node, I can see a volume laying around:

[root@ip-10-0-143-101 /]# mount | grep daemon
tmpfs on /run/nvidia/driver/host/run/nvidia/driver/host/var/lib/kubelet/pods/2aefcab2-c945-11e9-8b16-0a269aae9e58/volumes/kubernetes.io~secret/machine-config-daemon-token-6swft type tmpfs (rw,relatime,seclabel)
[root@ip-10-0-143-101 /]# umount /run/nvidia/driver/host/run/nvidia/driver/host/var/lib/kubelet/pods/2aefcab2-c945-11e9-8b16-0a269aae9e58/volumes/kubernetes.io~secret/machine-config-daemon-token-6swft
[root@ip-10-0-143-101 /]# mount | grep daemon
[root@ip-10-0-143-101 /]#

Then, re-grepping the journal:

Aug 29 12:58:11 ip-10-0-143-101 hyperkube[1566]: E0829 12:58:11.527171    1566 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/secret/2aefcab2-c945-11e9-8b16-0a269aae9e58-machine-config-daemon-token-6swft\" (\"2aefcab2-c945-11e9-8b16-0a269aae9e58\")" failed. No retries permitted until 2019-08-29 13:00:13.5271424 +0000 UTC m=+120204.917524039 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"machine-config-daemon-token-6swft\" (UniqueName: \"kubernetes.io/secret/2aefcab2-c945-11e9-8b16-0a269aae9e58-machine-config-daemon-token-6swft\") pod \"2aefcab2-c945-11e9-8b16-0a269aae9e58\" (UID: \"2aefcab2-c945-11e9-8b16-0a269aae9e58\") : unlinkat /var/lib/kubelet/pods/2aefcab2-c945-11e9-8b16-0a269aae9e58/volumes/kubernetes.io~secret/machine-config-daemon-token-6swft: device or resource busy"



So what I think it's happening is that the NVIDIA pod (whatever it does), mounts the rootfs as the MCD does and that falls into a leaked mount, preventing the pod from terminating on the MCD side. I'm not sure how this is supposed to work on the nvidia side also.

Comment 6 Antonio Murdaca 2019-08-29 13:05:58 UTC

Inside the nvidia-driver-ctr, we can see the leaked mounts:

bash-4.4# mount | grep daemon
tmpfs on /host/run/nvidia/driver/host/var/lib/kubelet/pods/2aefcab2-c945-11e9-8b16-0a269aae9e58/volumes/kubernetes.io~secret/machine-config-daemon-token-6swft type tmpfs (rw,relatime,seclabel)
tmpfs on /run/nvidia/driver/host/run/nvidia/driver/host/var/lib/kubelet/pods/2aefcab2-c945-11e9-8b16-0a269aae9e58/volumes/kubernetes.io~secret/machine-config-daemon-token-6swft type tmpfs (rw,relatime,seclabel)


Pretty sure all this is because both the MCD and the NVIDIA driver ctr are mounting the rootfs. Probably a propagation issue, I'm not sure.

Comment 7 Antonio Murdaca 2019-08-29 13:07:54 UTC

Umounting both the mounts in the "nvidia-driver-ctr" starts a correct rollout and unblock the situation here at least, but again, I'm not really sure how this is supposed to work. I guess I'll ping the containers team here and not sure we can control what the nvidia ctr does.

Comment 8 Antonio Murdaca 2019-08-29 13:21:00 UTC

As a last thing to do to reconcile the cluster, wait for the update for the MCO to be done and everything will be upgraded. Of course, this is not a solution nor IMO a workaround.

Anyhow, the easiest way to reproduce this (which we've seen also impacts upgrades), is to just oc delete the MCD running on the GPU node once the node is added.

Comment 9 Antonio Murdaca 2019-08-29 13:28:16 UTC

This is where the nvidia driver ctr mounts host:

https://github.com/zvonkok/special-resource-operator/blob/master/assets/state-driver/0500_daemonset.yaml#L49-L52

Do we really need the full host mount there? what for? This daemonset for nvidia requiring the host mount is also making it leak mounts from other containers that mount the host (like the MCD) and causing the Termination issue explained above.

Comment 10 Antonio Murdaca 2019-08-29 13:46:20 UTC

Somehow also, the SRO doesn't seem to be able to restart the NVIDIA daemonset... is that a bug? I can see the daemonset pod trying to go up, and then Terminate itself over and over.

Comment 11 Antonio Murdaca 2019-08-29 13:47:25 UTC

This is what that operator does right now (over and over) and the nvidia DS isn't started also:

{"level":"info","ts":1567086418.0162215,"logger":"controller_specialresource","msg":"Reconciling SpecialResource","Request.Namespace":"openshift-sro-operator","Request.Name":"gpu"}
{"level":"info","ts":1567086418.0262527,"logger":"controller_specialresource","msg":"Looking for","ServiceAccount":"nvidia-driver","Namespace":"openshift-sro"}
{"level":"info","ts":1567086418.029905,"logger":"controller_specialresource","msg":"Not found, creating","ServiceAccount":"nvidia-driver","Namespace":"openshift-sro"}
{"level":"info","ts":1567086418.034802,"logger":"controller_specialresource","msg":"Looking for","Role":"nvidia-driver","Namespace":"openshift-sro"}
{"level":"info","ts":1567086418.0377996,"logger":"controller_specialresource","msg":"Not found, creating","Role":"nvidia-driver","Namespace":"openshift-sro"}
{"level":"info","ts":1567086418.043725,"logger":"controller_specialresource","msg":"Looking for","RoleBinding":"nvidia-driver","Namespace":"openshift-sro"}
{"level":"info","ts":1567086418.0531864,"logger":"controller_specialresource","msg":"Not found, creating","RoleBinding":"nvidia-driver","Namespace":"openshift-sro"}
{"level":"info","ts":1567086418.064452,"logger":"controller_specialresource","msg":"Looking for","ConfigMap":"nvidia-driver","Namespace":"openshift-sro"}
{"level":"info","ts":1567086418.071187,"logger":"controller_specialresource","msg":"Not found, creating","ConfigMap":"nvidia-driver","Namespace":"openshift-sro"}
{"level":"info","ts":1567086418.0787044,"logger":"controller_specialresource","msg":"Looking for","SecurityContextConstraints":"nvidia-driver","Namespace":"default"}
{"level":"info","ts":1567086418.086377,"logger":"controller_specialresource","msg":"Found","SecurityContextConstraints":"nvidia-driver","Namespace":"default"}
{"level":"info","ts":1567086418.0947955,"logger":"controller_specialresource","msg":"4.18.0-80.7.2.el8_0.x86_64","Request.Namespace":"default","Request.Name":"Node"}
{"level":"info","ts":1567086418.0948265,"logger":"controller_specialresource","msg":"Looking for","DaemonSet":"nvidia-driver-daemonset","Namespace":"openshift-sro"}
{"level":"info","ts":1567086418.0973034,"logger":"controller_specialresource","msg":"Not found, creating","DaemonSet":"nvidia-driver-daemonset","Namespace":"openshift-sro"}
{"level":"info","ts":1567086418.103533,"logger":"controller_specialresource","msg":"DEBUG: DaemonSet","LabelSelector":"app=nvidia-driver-daemonset"}
{"level":"info","ts":1567086418.1089997,"logger":"controller_specialresource","msg":"DEBUG: DaemonSet","NumberOfDaemonSets":1}
{"level":"info","ts":1567086418.1091046,"logger":"controller_specialresource","msg":"DEBUG: DaemonSet","NumberUnavailable":0}
{"level":"info","ts":1567086418.109186,"logger":"controller_specialresource","msg":"DEBUG: Pod","LabelSelector":"app=nvidia-driver-daemonset"}
{"level":"info","ts":1567086418.1214077,"logger":"controller_specialresource","msg":"DEBUG: Pod","NumberOfPods":0}
{"level":"info","ts":1567086418.1214294,"logger":"controller_specialresource","msg":"SpecialResource","ResourceStatus":"NotReady"}
{"level":"info","ts":1567086418.1214664,"logger":"controller_specialresource","msg":"Reconciling SpecialResource","Request.Namespace":"openshift-sro","Request.Name":"gpu"}
{"level":"info","ts":1567086418.1294944,"logger":"controller_specialresource","msg":"Reconciling SpecialResource","Request.Namespace":"openshift-sro","Request.Name":"gpu"}

Comment 12 Antonio Murdaca 2019-08-29 20:07:02 UTC

TL;DR; this issue isn't strictly related to the MCO, the MCO is just mounting the /rootfs cause it needs to in order to manage a node. What's happening here is that another pod (coming from https://github.com/zvonkok/special-resource-operator as explained in comment https://bugzilla.redhat.com/show_bug.cgi?id=1746854#c9) is also mounting the / of the node. What that causes is that mounts from other containers (MCD) are leaked into this pod that mounts / as well. When that happens, kubelet will be stuck at terminating because it can't umount a volume belonging to the MCD but mounted into the other pod (in this case it's the MCD token that kube injects).

to verify the above tl;dr; look at: https://bugzilla.redhat.com/show_bug.cgi?id=1746854#c5 - you can find the kubelet logs as well.

Anyhow, one way to unstuck an upgrade or a general wedge is to jump on the other pod that mounts / and manually umount the volumes belonging to the MCD (https://bugzilla.redhat.com/show_bug.cgi?id=1746854#c6). After a while, the kubelet is going to reconcile the situation and can now umount the volume from MCD and proceed to correctly terminate the pod.

I'm not sure whose issue is this also, I'm also not sure the operator at https://github.com/zvonkok/special-resource-operator is a fully supported one for 4.2? (I'm saying that just because it's not under the openshift org on github).

As for when this is happening, the reporter claims "happened once" so I'm more leaning toward this is something happening sometimes at the container runtime (maybe kernel) level. I don't see anything wrong otherwise in multiple pods mounting the rootfs as long as mounts aren't leaked and held by other pods (causing these sort of issues).

Comment 13 Antonio Murdaca 2019-08-30 08:36:57 UTC

Spoke to Clayton about this here: https://coreos.slack.com/archives/CEKNRGF25/p1567095110087200

Since the special-resource-operator isn't in the payload, and this is an issue with that operator and mounting /rootfs, and ultimately causing issue to the MCO (core operator), we believe this has to be fixed on the SRO side. There are ways to avoid mounts leaking like this as well.


Since I can't find a component for the SRO, I'm going to close this but please reopen it if there's a component (or move this upstream on github).

Comment 14 Mike Fiedler 2019-09-04 12:03:30 UTC

Re-opening and assigning to correct component.

Comment 15 Zvonko Kosic 2019-09-27 17:51:04 UTC

Walid, 

I have created a new version, please test.

# git clone https://github.com/openshift-psap/special-resource-operator.git
# git checkout release-4.2
# PULLPOLICY=Always make deploy

Comment 16 Walid A. 2019-10-02 20:00:03 UTC

Failed to deploy the sro nvidia pods in namespace openshift-sro, only the special-resource-operator pod was deployed.

# git clone https://github.com/openshift-psap/special-resource-operator.git
# git checkout release-4.2
# PULLPOLICY=Always make deploy
customresourcedefinition.apiextensions.k8s.io/specialresources.sro.openshift.io created
sleep 1
for obj in namespace.yaml service_account.yaml role.yaml role_binding.yaml operator.yaml crds/sro_v1alpha1_specialresource_cr.yaml; do \
	sed 's+REPLACE_IMAGE+quay.io/openshift-psap/special-resource-operator:release-4.2+g; s+REPLACE_NAMESPACE+openshift-sro+g; s+Always+Always+' deploy/$obj | kubectl apply -f - ;\
	sleep 1 ;\
done	
namespace/openshift-sro created
serviceaccount/special-resource-operator created
role.rbac.authorization.k8s.io/special-resource-operator created
clusterrole.rbac.authorization.k8s.io/special-resource-operator created
rolebinding.rbac.authorization.k8s.io/special-resource-operator created
clusterrolebinding.rbac.authorization.k8s.io/special-resource-operator created
deployment.apps/special-resource-operator created
specialresource.sro.openshift.io/example-specialresource created
specialresource.sro.openshift.io/example-specialresource unchanged

# oc get pods -n openshift-sro
NAME                                         READY   STATUS    RESTARTS   AGE
special-resource-operator-7cbb8f5d67-mqj26   1/1     Running   0          34s

# oc get events -n openshift-sro
LAST SEEN   TYPE     REASON              OBJECT                                            MESSAGE
45s         Normal   Scheduled           pod/special-resource-operator-7cbb8f5d67-mqj26    Successfully assigned openshift-sro/special-resource-operator-7cbb8f5d67-mqj26 to ip-10-0-138-35.us-west-2.compute.internal
37s         Normal   Pulling             pod/special-resource-operator-7cbb8f5d67-mqj26    Pulling image "quay.io/openshift-psap/special-resource-operator:release-4.2"
34s         Normal   Pulled              pod/special-resource-operator-7cbb8f5d67-mqj26    Successfully pulled image "quay.io/openshift-psap/special-resource-operator:release-4.2"
34s         Normal   Created             pod/special-resource-operator-7cbb8f5d67-mqj26    Created container special-resource-operator
34s         Normal   Started             pod/special-resource-operator-7cbb8f5d67-mqj26    Started container special-resource-operator
45s         Normal   SuccessfulCreate    replicaset/special-resource-operator-7cbb8f5d67   Created pod: special-resource-operator-7cbb8f5d67-mqj26
45s         Normal   ScalingReplicaSet   deployment/special-resource-operator              Scaled up replica set special-resource-operator-7cbb8f5d67 to 1

# oc logs -n openshift-sro special-resource-operator-7cbb8f5d67-mqj26
{"level":"info","ts":1570043587.0136995,"logger":"cmd","msg":"Go Version: go1.11.13"}
{"level":"info","ts":1570043587.0139394,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1570043587.0139446,"logger":"cmd","msg":"Version of operator-sdk: v0.10.0"}
{"level":"info","ts":1570043587.014225,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1570043587.150485,"logger":"leader","msg":"No pre-existing lock was found."}
{"level":"info","ts":1570043587.1559906,"logger":"leader","msg":"Became the leader."}
{"level":"info","ts":1570043587.2638056,"logger":"cmd","msg":"Registering Components."}
{"level":"info","ts":1570043587.2648346,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1570043587.264936,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1570043587.2650201,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1570043587.265101,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1570043587.2651818,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1570043587.2652588,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1570043587.2653337,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1570043587.2654111,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1570043587.2655103,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"specialresource-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1570043587.4164715,"logger":"cmd","msg":"Could not create metrics Service","error":"failed to create or get service for metrics: services \"special-resource-operator-metrics\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>"}
{"level":"info","ts":1570043587.4311242,"logger":"cmd","msg":"Starting the Cmd."}
{"level":"info","ts":1570043587.5314167,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"specialresource-controller"}
{"level":"info","ts":1570043587.6317081,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"specialresource-controller","worker count":1}
{"level":"info","ts":1570043587.6317961,"logger":"controller_specialresource","msg":"Reconciling SpecialResource","Request.Namespace":"","Request.Name":"gpu"}


# oc get crd
NAME                                                        CREATED AT
alertmanagers.monitoring.coreos.com                         2019-10-01T14:23:39Z
apiservers.config.openshift.io                              2019-10-01T14:16:56Z
authentications.config.openshift.io                         2019-10-01T14:16:56Z
authentications.operator.openshift.io                       2019-10-01T14:17:17Z
baremetalhosts.metal3.io                                    2019-10-01T14:19:03Z
builds.config.openshift.io                                  2019-10-01T14:16:56Z
catalogsourceconfigs.operators.coreos.com                   2019-10-01T14:17:16Z
catalogsources.operators.coreos.com                         2019-10-01T14:17:23Z
clusterautoscalers.autoscaling.openshift.io                 2019-10-01T14:17:15Z
clusternetworks.network.openshift.io                        2019-10-01T14:17:32Z
clusteroperators.config.openshift.io                        2019-10-01T14:16:54Z
clusterresourcequotas.quota.openshift.io                    2019-10-01T14:16:55Z
clusterserviceversions.operators.coreos.com                 2019-10-01T14:17:22Z
clusterversions.config.openshift.io                         2019-10-01T14:16:54Z
configs.imageregistry.operator.openshift.io                 2019-10-01T14:17:15Z
configs.samples.operator.openshift.io                       2019-10-01T14:17:15Z
consoleclidownloads.console.openshift.io                    2019-10-01T14:17:15Z
consoleexternalloglinks.console.openshift.io                2019-10-01T14:17:18Z
consolelinks.console.openshift.io                           2019-10-01T14:17:16Z
consolenotifications.console.openshift.io                   2019-10-01T14:17:20Z
consoles.config.openshift.io                                2019-10-01T14:16:56Z
consoles.operator.openshift.io                              2019-10-01T14:17:21Z
containerruntimeconfigs.machineconfiguration.openshift.io   2019-10-01T14:18:18Z
controllerconfigs.machineconfiguration.openshift.io         2019-10-01T14:18:15Z
credentialsrequests.cloudcredential.openshift.io            2019-10-01T14:17:16Z
dnses.config.openshift.io                                   2019-10-01T14:16:57Z
dnses.operator.openshift.io                                 2019-10-01T14:17:16Z
dnsrecords.ingress.operator.openshift.io                    2019-10-01T14:17:16Z
egressnetworkpolicies.network.openshift.io                  2019-10-01T14:17:33Z
featuregates.config.openshift.io                            2019-10-01T14:16:57Z
hostsubnets.network.openshift.io                            2019-10-01T14:17:32Z
imagecontentsourcepolicies.operator.openshift.io            2019-10-01T14:16:57Z
images.config.openshift.io                                  2019-10-01T14:16:57Z
infrastructures.config.openshift.io                         2019-10-01T14:16:57Z
ingresscontrollers.operator.openshift.io                    2019-10-01T14:17:18Z
ingresses.config.openshift.io                               2019-10-01T14:16:58Z
installplans.operators.coreos.com                           2019-10-01T14:17:23Z
kubeapiservers.operator.openshift.io                        2019-10-01T14:17:15Z
kubecontrollermanagers.operator.openshift.io                2019-10-01T14:17:15Z
kubeletconfigs.machineconfiguration.openshift.io            2019-10-01T14:18:17Z
kubeschedulers.operator.openshift.io                        2019-10-01T14:17:15Z
machineautoscalers.autoscaling.openshift.io                 2019-10-01T14:17:18Z
machineconfigpools.machineconfiguration.openshift.io        2019-10-01T14:18:16Z
machineconfigs.machineconfiguration.openshift.io            2019-10-01T14:18:13Z
machinedisruptionbudgets.healthchecking.openshift.io        2019-10-01T14:18:12Z
machinehealthchecks.healthchecking.openshift.io             2019-10-01T14:18:09Z
machines.machine.openshift.io                               2019-10-01T14:18:05Z
machinesets.machine.openshift.io                            2019-10-01T14:18:06Z
mcoconfigs.machineconfiguration.openshift.io                2019-10-01T14:17:20Z
netnamespaces.network.openshift.io                          2019-10-01T14:17:33Z
network-attachment-definitions.k8s.cni.cncf.io              2019-10-01T14:17:28Z
networks.config.openshift.io                                2019-10-01T14:16:58Z
networks.operator.openshift.io                              2019-10-01T14:16:59Z
nodefeaturediscoveries.nfd.openshift.io                     2019-10-02T13:20:59Z
oauths.config.openshift.io                                  2019-10-01T14:16:58Z
openshiftapiservers.operator.openshift.io                   2019-10-01T14:17:15Z
openshiftcontrollermanagers.operator.openshift.io           2019-10-01T14:17:18Z
operatorgroups.operators.coreos.com                         2019-10-01T14:17:29Z
operatorhubs.config.openshift.io                            2019-10-01T14:16:55Z
operatorsources.operators.coreos.com                        2019-10-01T14:17:18Z
podmonitors.monitoring.coreos.com                           2019-10-01T14:23:40Z
projects.config.openshift.io                                2019-10-01T14:16:58Z
prometheuses.monitoring.coreos.com                          2019-10-01T14:23:39Z
prometheusrules.monitoring.coreos.com                       2019-10-01T14:23:40Z
proxies.config.openshift.io                                 2019-10-01T14:16:55Z
rolebindingrestrictions.authorization.openshift.io          2019-10-01T14:16:55Z
schedulers.config.openshift.io                              2019-10-01T14:16:58Z
securitycontextconstraints.security.openshift.io            2019-10-01T14:16:56Z
servicecas.operator.openshift.io                            2019-10-01T14:17:18Z
servicecatalogapiservers.operator.openshift.io              2019-10-01T14:17:17Z
servicecatalogcontrollermanagers.operator.openshift.io      2019-10-01T14:17:17Z
servicemonitors.monitoring.coreos.com                       2019-10-01T14:23:39Z
specialresources.sro.openshift.io                           2019-10-02T19:12:48Z
subscriptions.operators.coreos.com                          2019-10-01T14:17:23Z
tuneds.tuned.openshift.io                                   2019-10-01T14:17:16Z


# oc get crd specialresources.sro.openshift.io -o yaml
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apiextensions.k8s.io/v1beta1","kind":"CustomResourceDefinition","metadata":{"annotations":{},"name":"specialresources.sro.openshift.io"},"spec":{"group":"sro.openshift.io","names":{"kind":"SpecialResource","listKind":"SpecialResourceList","plural":"specialresources","singular":"specialresource"},"scope":"Namespaced","subresources":{"status":{}},"validation":{"openAPIV3Schema":{"properties":{"apiVersion":{"description":"APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#resources","type":"string"},"kind":{"description":"Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#types-kinds","type":"string"},"metadata":{"type":"object"},"spec":{"type":"object"},"status":{"type":"object"}}}},"version":"v1alpha1","versions":[{"name":"v1alpha1","served":true,"storage":true}]}}
  creationTimestamp: "2019-10-02T19:12:48Z"
  generation: 1
  name: specialresources.sro.openshift.io
  resourceVersion: "487599"
  selfLink: /apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/specialresources.sro.openshift.io
  uid: 9f868fd5-e548-11e9-9bbd-0a69f2192e3e
spec:
  conversion:
    strategy: None
  group: sro.openshift.io
  names:
    kind: SpecialResource
    listKind: SpecialResourceList
    plural: specialresources
    singular: specialresource
  scope: Namespaced
  subresources:
    status: {}
  validation:
    openAPIV3Schema:
      properties:
        apiVersion:
          description: 'APIVersion defines the versioned schema of this representation
            of an object. Servers should convert recognized schemas to the latest
            internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#resources'
          type: string
        kind:
          description: 'Kind is a string value representing the REST resource this
            object represents. Servers may infer this from the endpoint the client
            submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#types-kinds'
          type: string
        metadata:
          type: object
        spec:
          type: object
        status:
          type: object
  version: v1alpha1
  versions:
  - name: v1alpha1
    served: true
    storage: true
status:
  acceptedNames:
    kind: SpecialResource
    listKind: SpecialResourceList
    plural: specialresources
    singular: specialresource
  conditions:
  - lastTransitionTime: "2019-10-02T19:12:48Z"
    message: no conflicts found
    reason: NoConflicts
    status: "True"
    type: NamesAccepted
  - lastTransitionTime: null
    message: the initial names have been accepted
    reason: InitialNamesAccepted
    status: "True"
    type: Established
  storedVersions:
  - v1alpha1

# oc get all -n openshift-sro
NAME                                             READY   STATUS    RESTARTS   AGE
pod/special-resource-operator-7cbb8f5d67-mqj26   1/1     Running   0          2m54s

NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/special-resource-operator   1/1     1            1           2m54s

NAME                                                   DESIRED   CURRENT   READY   AGE
replicaset.apps/special-resource-operator-7cbb8f5d67   1         1         1       2m54s

Comment 17 Walid A. 2019-10-05 00:25:55 UTC

SRO was successfully deployed after the PR update and git cloning the release-4.2 branch from https://github.com/openshift-psap/special-resource-operator.git.

This was on OCP 4.2.0-0.nightly-2019-10-01-210901 on AWS.

Upgrade from version 4.2.0-0.nightly-2019-10-01-210901 to version 4.2.0-0.nightly-2019-10-02-001405 was successful:

# oc adm upgrade status
Cluster version is 4.2.0-0.nightly-2019-10-02-001405

Updates:

VERSION                           IMAGE
4.2.0-0.nightly-2019-10-02-122541 registry.svc.ci.openshift.org/ocp/release@sha256:f8f943fc4d321485cfd2559c2f610382891d8b8d60410be58d679aaf1c2cf660

# oc get pods -n openshift-machine-config-operator
NAME                                        READY   STATUS    RESTARTS   AGE
etcd-quorum-guard-85457c8498-5whbz          1/1     Running   0          26h
etcd-quorum-guard-85457c8498-dbjc7          1/1     Running   0          26h
etcd-quorum-guard-85457c8498-frhqk          1/1     Running   0          26h
machine-config-controller-6cf9b7bc4-tsztb   1/1     Running   0          26h
machine-config-daemon-22rjb                 1/1     Running   0          26h
machine-config-daemon-9ggwg                 1/1     Running   0          25h
machine-config-daemon-czj8q                 1/1     Running   0          25h
machine-config-daemon-f2pq6                 1/1     Running   0          26h
machine-config-daemon-kgtqj                 1/1     Running   0          25h
machine-config-daemon-lc5mk                 1/1     Running   0          26h
machine-config-daemon-p8rrj                 1/1     Running   0          22h
machine-config-operator-7777d6568f-xl9nw    1/1     Running   0          8m4s
machine-config-server-7tk28                 1/1     Running   0          26h
machine-config-server-fj9ps                 1/1     Running   0          26h
machine-config-server-n4bt6                 1/1     Running   0          26h


# oc logs -n openshift-machine-config-operator machine-config-operator-7777d6568f-xl9nw
I1004 03:52:42.140321       1 start.go:42] Version: v4.2.0-201910010606-dirty (b8898db9af98e5c3d6a450ae123121677b0dbcb3)
E1004 03:54:37.799449       1 event.go:247] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"machine-config", GenerateName:"", Namespace:"openshift-machine-config-operator", SelfLink:"/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config", UID:"fa69a4d0-e580-11e9-b32c-06c96f353c88", ResourceVersion:"499408", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63705664573, loc:(*time.Location)(0x2365480)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"machine-config-operator-7777d6568f-xl9nw_6a99a99b-e65a-11e9-a2cc-0a580a810068\",\"leaseDurationSeconds\":90,\"acquireTime\":\"2019-10-04T03:54:37Z\",\"renewTime\":\"2019-10-04T03:54:37Z\",\"leaderTransitions\":1}"}, OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'no kind is registered for the type v1.ConfigMap in scheme "github.com/openshift/machine-config-operator/cmd/common/helpers.go:30"'. Will not report event: 'Normal' 'LeaderElection' 'machine-config-operator-7777d6568f-xl9nw_6a99a99b-e65a-11e9-a2cc-0a580a810068 became leader'
I1004 03:54:38.117217       1 operator.go:246] Starting MachineConfigOperator
I1004 03:54:38.121766       1 event.go:209] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"fa7c985d-e580-11e9-b32c-06c96f353c88", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator started a version change from [{operator 4.2.0-0.nightly-2019-10-01-210901}] to [{operator 4.2.0-0.nightly-2019-10-02-001405}]
I1004 03:54:43.312637       1 event.go:209] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"fa7c985d-e580-11e9-b32c-06c96f353c88", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator version changed from [{operator 4.2.0-0.nightly-2019-10-01-210901}] to [{operator 4.2.0-0.nightly-2019-10-02-001405}]


# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.nightly-2019-10-02-001405   True        False         False      25h
cloud-credential                           4.2.0-0.nightly-2019-10-02-001405   True        False         False      26h
cluster-autoscaler                         4.2.0-0.nightly-2019-10-02-001405   True        False         False      26h
console                                    4.2.0-0.nightly-2019-10-02-001405   True        False         False      25h
dns                                        4.2.0-0.nightly-2019-10-02-001405   True        False         False      26h
image-registry                             4.2.0-0.nightly-2019-10-02-001405   True        False         False      25h
ingress                                    4.2.0-0.nightly-2019-10-02-001405   True        False         False      25h
insights                                   4.2.0-0.nightly-2019-10-02-001405   True        False         False      26h
kube-apiserver                             4.2.0-0.nightly-2019-10-02-001405   True        False         False      26h
kube-controller-manager                    4.2.0-0.nightly-2019-10-02-001405   True        False         False      26h
kube-scheduler                             4.2.0-0.nightly-2019-10-02-001405   True        False         False      26h
machine-api                                4.2.0-0.nightly-2019-10-02-001405   True        False         False      26h
machine-config                             4.2.0-0.nightly-2019-10-02-001405   True        False         False      26h
marketplace                                4.2.0-0.nightly-2019-10-02-001405   True        False         False      10m
monitoring                                 4.2.0-0.nightly-2019-10-02-001405   True        False         False      25h
network                                    4.2.0-0.nightly-2019-10-02-001405   True        False         False      26h
node-tuning                                4.2.0-0.nightly-2019-10-02-001405   True        False         False      11m
openshift-apiserver                        4.2.0-0.nightly-2019-10-02-001405   True        False         False      26h
openshift-controller-manager               4.2.0-0.nightly-2019-10-02-001405   True        False         False      26h
openshift-samples                          4.2.0-0.nightly-2019-10-02-001405   True        False         False      11m
operator-lifecycle-manager                 4.2.0-0.nightly-2019-10-02-001405   True        False         False      26h
operator-lifecycle-manager-catalog         4.2.0-0.nightly-2019-10-02-001405   True        False         False      26h
operator-lifecycle-manager-packageserver   4.2.0-0.nightly-2019-10-02-001405   True        False         False      7h5m
service-ca                                 4.2.0-0.nightly-2019-10-02-001405   True        False         False      26h
service-catalog-apiserver                  4.2.0-0.nightly-2019-10-02-001405   True        False         False      7h5m
service-catalog-controller-manager         4.2.0-0.nightly-2019-10-02-001405   True        False         False      7h5m
storage                                    4.2.0-0.nightly-2019-10-02-001405   True        False         False      11m

Comment 19 errata-xmlrpc 2020-01-23 11:05:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.