* Description of problem: Current implementation of the NFD-operator create a daemonset which is only taking nodes with "node-role.kubernetes.io/worker" into accout. Customers which do have separated infra nodes, won't be able to have nfd-workers deployed on infra nodes. * Version-Release number of selected component (if applicable): {"level":"info","ts":1597222546.4806738,"logger":"cmd","msg":"Go Version: go1.13.4"} {"level":"info","ts":1597222546.4808075,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"} {"level":"info","ts":1597222546.4808278,"logger":"cmd","msg":"Version of operator-sdk: v0.4.0+git"} registry.redhat.io/openshift4/ose-cluster-nfd-operator@sha256:c34f885bc15d76d45486b1a6fb41990d07896ff5e319905d937d6ccf6affb44c * Proposed title of this feature request Enable the NFD-operator to deploy NFD-workers on nodes other than the ones with a worker label. * What is the nature and description of the request? Currently the NFD operator deploys NFD-workers only on nodes which do have the node-role.kubernetes.io/worker label set. Customers who need to have separated infra nodes, are currently unable to have the NFD-workers deployed on those nodes. * Why does the customer need this? (List the business requirements here) For cost reasons, dedicated nodes for infrastructure pods have been created. Those nodes have a node-role.kubernetes.io/infra label set. Those nodes do have special hardware installed (f.e. special network-cards), which need to be identified by the scheduler. * List any affected packages or components registry.redhat.io/openshift4/ose-node-feature-discovery
Is deploying the operator from the Operator-Hub a constrain? if not, you could deploy the operator from the source and change the node selector in line https://github.com/openshift/cluster-nfd-operator/blob/release-4.4/assets/worker/0700_worker_daemonset.yaml#L17 ```yaml spec: nodeSelector: node-role.kubernetes.io/infra: "" ``` Follwing git clone git:openshift/cluster-nfd-operator.git cd cluster-nfd-operator git checkout release-4.4 sed -i '17 s/node-role.kubernetes.io\/worker/node-role.kubernetes.io\/infra/' assets/worker/0700_worker_daemonset.yaml make deploy If this RFE is to get this change as a dinamic choice via Operator-Hub, then it must be against release 4.6 and we can then cherry pick it to the desired version.
Hi Carlos, while some customers may be able to deploy upstream images, most of mine are not. They have to fulfill compliance regulations and won't be able to do so. Some even don't have access to github. So this request is to make the operator from Operator-Hub able to deploy NFD-worker nodes on "worker" and "infra" nodes. Infra nodes will be created as per the following KVS: https://access.redhat.com/solutions/4287111 Hope that answers your question. /Andreas
Hi Andreas my question is, are customers removing the node-role.kubernetes.io/worker label after labelling it as infra? that KVS only informs how to add labels to a node, but in a vanilla deployment, the label node-role.kubernetes.io/worker should be there. Also the KVS mentions "app" and "infra", which are 2 cases. I think if we open the door for infra nodes, we should consider then both cases for customers. Or are we sure only worker and infra are going to be the supported use cases? Another thing, you have filed this BZ against 4.4, but this would be a new feature of NFD, so this will need to go to master (4.7) and be back ported. is that ok? or this is an urgent fix from the field?
Hi Carlos, when following the KCS mentioned in #3, the worker labels remain, but the default node selector is set to nodes with an "app" role. % oc get scheduler/cluster -o json | jq ".spec.defaultNodeSelector" "node-role.kubernetes.io/app=" That way one will have to either configure the namespace with a proper annotation or will have to override the selector in some other positions. The NFD-operator does not allow such modification and also will have to deploy nodes on infra nodes (nfd-master, nfd-operators) but also on the worker nodes. In order to separate the infra nodes, one will then have to follow the instructions of the KCS and will have to annotate the NFD-namespace with "openshift.io/node-selector": "node-role.kubernetes.io/infra=" to override the default node-selector. This in turn will render the deployment of NFD-workers as impossible as this would require to either change the deployment (which is overwritten by the operator and would require a second deployment for the worker nodes) or one would have to deploy with the default nodeselector in place, which will render the NFD-master nodes to be stuck (as they are waiting for master nodes, which are not in available within the "app" labeled nodes) So we have the following situation: % oc get nodes -o wide NAME STATUS ROLES master0 Ready infra,master master1 Ready infra,master master2 Ready infra,master worker0 Ready infra,worker worker1 Ready infra,worker worker2 Ready app,worker worker3 Ready app,worker % oc get ns/operator-nfd -o json | jq '.metadata.annotations."openshift.io/node-selector"' "node-role.kubernetes.io/infra=" % oc get pods -o wide -n operator-nfd NAME READY STATUS RESTARTS AGE IP NODE nfd-master-5vn7d 1/1 Running 1 7d20h 10.128.4.98 master0 nfd-master-9xnrp 1/1 Running 0 7d20h 10.129.0.42 master2 nfd-master-mvlwv 1/1 Running 0 7d20h 10.129.2.86 master1 nfd-operator-5cf5c9b74d-hrnxd 1/1 Running 0 7d22h 10.129.0.30 master2 nfd-operator-d5bf59888-f9t8h 0/1 Running 0 3d 10.129.0.132 master2 nfd-worker-4k9f9 1/1 Running 10 7d20h 192.168.4.81 worker1 nfd-worker-76r26 1/1 Running 20 7d20h 192.168.4.80 worker0 As you can see, there are no NFD-workers running on the "app" labeled nodes because the annotation sets the node-selector to "infra" nodes. This would be required to have the NFD-master and operator pods started. The operator now would require a separated daemonset for the "app" labeled nodes, as it now seem to combine the NS toleration and the node-selector "worker" into account and only creates 2 replicas: % oc get ds -n operator-nfd NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE nfd-master 3 3 3 3 3 node-role.kubernetes.io/master= 16d nfd-worker 2 2 2 2 2 node-role.kubernetes.io/worker= 16d <== there are actually 4 worker nodes in the cluster. And this lack of functionality should be addressed with a 4.6.z release as this would be the one which is used by my customer. Currently there is no need for further backports other than this.
https://github.com/kubernetes-sigs/node-feature-discovery-operator/pull/31 adds this functionality. This bug is filed against 4.4, just to double check, we need this backported?
Hi, my customer is aiming for OCP-4.6 and are currently on 4.5. I don't think that we would need a backport to 4.4, but it would be great if we could test the change in a 4.5 release already. /Andreas
Verified on OCP 4.6.0-0.nightly-2020-10-08-182439 with NFD operator deployed from github master repo and also from OperatorHub. Followed steps in https://access.redhat.com/solutions/4287111 to create infra nodes. Infra nodes labeled by NFD
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.1 extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4198