Bug 1956607

Summary: Multiple manifests lack cluster profile annotations
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Bare Metal Hardware ProvisioningAssignee: sdasu
Bare Metal Hardware Provisioning sub component: cluster-baremetal-operator QA Contact: Shelly Miron <smiron>
Status: VERIFIED --- Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, beth.white, rbartal
Version: 4.8Keywords: Triaged
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description W. Trevor King 2021-05-04 04:18:33 UTC
Many baremetal manifests lack cluster profile annotations [1], e.g. [2].  Checking fc.2:

$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.8.0-fc.2-x86_64
$ for X in manifests/*.yaml; do ENTRIES="$(yaml2json < "${X}" | jq -r '.[].metadata | select((.annotations // {}) | keys | tostring | contains("include.release.openshift.io/") | not) | .namespace + " " + .name + " " + (.annotations | tostring)')"; if test -
n "${ENTRIES}"; then echo "${X}"; echo "${ENTRIES}"; fi done
...
manifests/0000_30_baremetal-operator_01_baremetalhost.crd.yaml
baremetal-operator-system baremetal-operator-leader-election-role null
 baremetal-operator-manager-role null
 baremetal-operator-proxy-role null
 baremetal-operator-metrics-reader null
baremetal-operator-system baremetal-operator-leader-election-rolebinding null
 baremetal-operator-manager-rolebinding null
 baremetal-operator-proxy-rolebinding null
baremetal-operator-system baremetal-operator-ironic null
baremetal-operator-system baremetal-operator-controller-manager-metrics-service null
baremetal-operator-system baremetal-operator-controller-manager null
manifests/0000_31_cluster-baremetal-operator_05_prometheus_rbac.yaml
  null
manifests/0000_31_cluster-baremetal-operator_05_rbac.yaml
  null
...
manifests/0000_90_cluster-baremetal-operator_03_servicemonitor.yaml
openshift-machine-api cluster-baremetal-operator-servicemonitor {"exclude.release.openshift.io/internal-openshift-hosted":"true"}

Without profile annotations, the manifests will not be pushed into clusters, and any functionality which depends on them will be broken.  You probably want to add at least one cluster-profile annotation to each manifest or drop the manifest altogether.

[1]: https://github.com/openshift/enhancements/blob/0e2f2474ee858350c4dec0033439036426a5663d/enhancements/update/cluster-profiles.md
[2]: https://github.com/openshift/baremetal-operator/blob/d22c5f710cdce4a30f3d8a8457d45ad53dd3f3a2/config/render/capm3.yaml#L5

Comment 1 sdasu 2021-05-05 17:42:13 UTC
We actually use [1] to add annotations to manifests/0000_30_baremetal-operator_01_baremetalhost.crd.yaml

[1] - https://github.com/openshift/baremetal-operator/blob/master/config/crd/ocp/ocp_kustomization.yaml

[stack@localhost dev-scripts]$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.8.0-fc.2-x86_64
Extracted release payload from digest sha256:76fcf7d935c9acc8e3bef6b46a94360e61a5f56958316313d5a7d6a8b5c6484e created at 2021-04-30T14:14:56Z

[stack@localhost dev-scripts]$ head manifests/0000_30_baremetal-operator_01_baremetalhost.crd.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    controller-gen.kubebuilder.io/version: v0.4.1
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: null
  name: baremetalhosts.metal3.io
spec:

Are we missing some annotation? If, so which one?

Comment 2 W. Trevor King 2021-05-05 19:19:10 UTC
Should be the ones from comment 0.  Double checking the one you call out in comment 1:

$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.8.0-fc.2-x86_64
$ yaml2json <manifests/0000_30_baremetal-operator_01_baremetalhost.crd.yaml | jq -r '.[] | .kind + " " + .metadata.name + " " + (.metadata.annotations | tostring)'
CustomResourceDefinition baremetalhosts.metal3.io {"controller-gen.kubebuilder.io/version":"v0.4.1","include.release.openshift.io/self-managed-high-availability":"true","include.release.openshift.io/single-node-developer":"true"}
Role baremetal-operator-leader-election-role null
ClusterRole baremetal-operator-manager-role null
ClusterRole baremetal-operator-proxy-role null
ClusterRole baremetal-operator-metrics-reader null
RoleBinding baremetal-operator-leader-election-rolebinding null
ClusterRoleBinding baremetal-operator-manager-rolebinding null
ClusterRoleBinding baremetal-operator-proxy-rolebinding null
ConfigMap baremetal-operator-ironic null
Service baremetal-operator-controller-manager-metrics-service null
Deployment baremetal-operator-controller-manager null

So you set the annotations on the CRD itself, but not on the other resources Role, ClusterRoles, etcd. that you include in that file.

Comment 3 sdasu 2021-05-11 14:32:30 UTC
While adding the required annotations, ran into the following CI test failure:
INFO[2021-05-10T22:10:26Z] May 10 21:46:03.565: FAIL: Pods in platform namespaces are not following resource request/limit rules or do not have an exception granted: 
INFO[2021-05-10T22:10:26Z]   apps/v1/Deployment/openshift-machine-api/baremetal-operator-controller-manager/container/kube-rbac-proxy does not have a cpu request (rule: "apps/v1/Deployment/openshift-machine-api/baremetal-operator-controller-manager/container/kube-rbac-proxy/request[cpu]") 
INFO[2021-05-10T22:10:26Z]   apps/v1/Deployment/openshift-machine-api/baremetal-operator-controller-manager/container/kube-rbac-proxy does not have a memory request (rule: "apps/v1/Deployment/openshift-machine-api/baremetal-operator-controller-manager/container/kube-rbac-proxy/request[memory]") 
INFO[2021-05-10T22:10:26Z]   apps/v1/Deployment/openshift-machine-api/baremetal-operator-controller-manager/container/manager does not have a cpu request (rule: "apps/v1/Deployment/openshift-machine-api/baremetal-operator-controller-manager/container/manager/request[cpu]") 
INFO[2021-05-10T22:10:26Z]   apps/v1/Deployment/openshift-machine-api/baremetal-operator-controller-manager/container/manager does not have a memory request (rule: "apps/v1/Deployment/openshift-machine-api/baremetal-operator-controller-manager/container/manager/request[memory]") 


While we do some profiling on our end to figure what these values need to be, could we get an exception for openshift-machine-api/baremetal-operator-controller-manager/container/kube-rbac-proxy and openshift-machine-api/baremetal-operator-controller-manager/container/manager?

Comment 4 sdasu 2021-05-11 16:12:39 UTC
Ignore previous comment because we were able to get past it.

Comment 7 W. Trevor King 2021-06-03 19:06:00 UTC
I came back to this issue while revisiting in bug 1956606.  From [1], which repeats the procedure from comment 0 on 4.8.0-0.nightly-2021-06-03-084005:

manifests/0000_31_cluster-baremetal-operator_05_prometheus_rbac.yaml
  null
manifests/0000_31_cluster-baremetal-operator_05_rbac.yaml
  null
manifests/0000_90_cluster-baremetal-operator_03_servicemonitor.yaml
openshift-machine-api cluster-baremetal-operator-servicemonitor {"exclude.release.openshift.io/internal-openshift-hosted":"true"}

I think the first two are just from trailing --- in the YAML.  The final entry is because of [2], so the unused manifest is intentional.  I'm a bit fuzzy on why we removed the annotation instead of removing the whole manifest.  If it's been months since making the manifest a no-op, maybe we should just drop it?  Or is there a ticket tracking work to make the ServiceMonitor functional again?

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1956606#c4
[2]: https://github.com/openshift/cluster-baremetal-operator/pull/111/commits/4b78e012b6fa438f5ca597461d9597462bb15dc7