Bug 1868616
Summary: | [RFE] enable NFD operator to deploy nfd-worker pods on nodes labeled other than "node-role.kubernetes.io/worker" | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Andreas Bleischwitz <ableisch> |
Component: | Node Feature Discovery Operator | Assignee: | Carlos Eduardo Arango Gutierrez <carangog> |
Status: | CLOSED ERRATA | QA Contact: | Walid A. <wabouham> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 4.4 | CC: | carangog, sejug |
Target Milestone: | --- | ||
Target Release: | 4.6.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-27 15:09:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Andreas Bleischwitz
2020-08-13 10:03:02 UTC
Is deploying the operator from the Operator-Hub a constrain? if not, you could deploy the operator from the source and change the node selector in line https://github.com/openshift/cluster-nfd-operator/blob/release-4.4/assets/worker/0700_worker_daemonset.yaml#L17 ```yaml spec: nodeSelector: node-role.kubernetes.io/infra: "" ``` Follwing git clone git:openshift/cluster-nfd-operator.git cd cluster-nfd-operator git checkout release-4.4 sed -i '17 s/node-role.kubernetes.io\/worker/node-role.kubernetes.io\/infra/' assets/worker/0700_worker_daemonset.yaml make deploy If this RFE is to get this change as a dinamic choice via Operator-Hub, then it must be against release 4.6 and we can then cherry pick it to the desired version. Hi Carlos, while some customers may be able to deploy upstream images, most of mine are not. They have to fulfill compliance regulations and won't be able to do so. Some even don't have access to github. So this request is to make the operator from Operator-Hub able to deploy NFD-worker nodes on "worker" and "infra" nodes. Infra nodes will be created as per the following KVS: https://access.redhat.com/solutions/4287111 Hope that answers your question. /Andreas Hi Andreas my question is, are customers removing the node-role.kubernetes.io/worker label after labelling it as infra? that KVS only informs how to add labels to a node, but in a vanilla deployment, the label node-role.kubernetes.io/worker should be there. Also the KVS mentions "app" and "infra", which are 2 cases. I think if we open the door for infra nodes, we should consider then both cases for customers. Or are we sure only worker and infra are going to be the supported use cases? Another thing, you have filed this BZ against 4.4, but this would be a new feature of NFD, so this will need to go to master (4.7) and be back ported. is that ok? or this is an urgent fix from the field? Hi Carlos, when following the KCS mentioned in #3, the worker labels remain, but the default node selector is set to nodes with an "app" role. % oc get scheduler/cluster -o json | jq ".spec.defaultNodeSelector" "node-role.kubernetes.io/app=" That way one will have to either configure the namespace with a proper annotation or will have to override the selector in some other positions. The NFD-operator does not allow such modification and also will have to deploy nodes on infra nodes (nfd-master, nfd-operators) but also on the worker nodes. In order to separate the infra nodes, one will then have to follow the instructions of the KCS and will have to annotate the NFD-namespace with "openshift.io/node-selector": "node-role.kubernetes.io/infra=" to override the default node-selector. This in turn will render the deployment of NFD-workers as impossible as this would require to either change the deployment (which is overwritten by the operator and would require a second deployment for the worker nodes) or one would have to deploy with the default nodeselector in place, which will render the NFD-master nodes to be stuck (as they are waiting for master nodes, which are not in available within the "app" labeled nodes) So we have the following situation: % oc get nodes -o wide NAME STATUS ROLES master0 Ready infra,master master1 Ready infra,master master2 Ready infra,master worker0 Ready infra,worker worker1 Ready infra,worker worker2 Ready app,worker worker3 Ready app,worker % oc get ns/operator-nfd -o json | jq '.metadata.annotations."openshift.io/node-selector"' "node-role.kubernetes.io/infra=" % oc get pods -o wide -n operator-nfd NAME READY STATUS RESTARTS AGE IP NODE nfd-master-5vn7d 1/1 Running 1 7d20h 10.128.4.98 master0 nfd-master-9xnrp 1/1 Running 0 7d20h 10.129.0.42 master2 nfd-master-mvlwv 1/1 Running 0 7d20h 10.129.2.86 master1 nfd-operator-5cf5c9b74d-hrnxd 1/1 Running 0 7d22h 10.129.0.30 master2 nfd-operator-d5bf59888-f9t8h 0/1 Running 0 3d 10.129.0.132 master2 nfd-worker-4k9f9 1/1 Running 10 7d20h 192.168.4.81 worker1 nfd-worker-76r26 1/1 Running 20 7d20h 192.168.4.80 worker0 As you can see, there are no NFD-workers running on the "app" labeled nodes because the annotation sets the node-selector to "infra" nodes. This would be required to have the NFD-master and operator pods started. The operator now would require a separated daemonset for the "app" labeled nodes, as it now seem to combine the NS toleration and the node-selector "worker" into account and only creates 2 replicas: % oc get ds -n operator-nfd NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE nfd-master 3 3 3 3 3 node-role.kubernetes.io/master= 16d nfd-worker 2 2 2 2 2 node-role.kubernetes.io/worker= 16d <== there are actually 4 worker nodes in the cluster. And this lack of functionality should be addressed with a 4.6.z release as this would be the one which is used by my customer. Currently there is no need for further backports other than this. https://github.com/kubernetes-sigs/node-feature-discovery-operator/pull/31 adds this functionality. This bug is filed against 4.4, just to double check, we need this backported? Hi, my customer is aiming for OCP-4.6 and are currently on 4.5. I don't think that we would need a backport to 4.4, but it would be great if we could test the change in a 4.5 release already. /Andreas Verified on OCP 4.6.0-0.nightly-2020-10-08-182439 with NFD operator deployed from github master repo and also from OperatorHub. Followed steps in https://access.redhat.com/solutions/4287111 to create infra nodes. Infra nodes labeled by NFD Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.1 extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4198 |