Bug 1853534
| Summary: | CAM Operator tries to spin up all nodes including infra nodes causing critical alerts | ||
|---|---|---|---|
| Product: | Migration Toolkit for Containers | Reporter: | Courtney Ruhm <cruhm> |
| Component: | General | Assignee: | Jason Montleon <jmontleo> |
| Status: | CLOSED ERRATA | QA Contact: | Xin jiang <xjiang> |
| Severity: | low | Docs Contact: | Avital Pinnick <apinnick> |
| Priority: | unspecified | ||
| Version: | 1.3.0 | CC: | ernelson, rjohnson, sregidor |
| Target Milestone: | --- | ||
| Target Release: | 1.4.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-11 12:54:46 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
It would probably be a good idea to do this and make it customizable. I'll do some testing to see what works from 3.7 on up to current and look at implementing something. It shouldn't be hard. My current thinking is we do something like this (set the nodeSelector to daemonset_node_selector if it's defined in the MigrationController CR), but it's completely untested. https://github.com/konveyor/mig-operator/pull/498/files I've been running some tests.
If we configure this:
daemonset_node_selector:
topology.kubernetes.io/zone: us-east-2c
We get this error in the migration operator pod:
igrationcontroller/tasks/main.yml:196\u001b[0m
^M\u001b[0;31mfatal: [localhost]: FAILED! => {\"changed\": false, \"error\": 422, \"msg\": \"Failed to patch object: {\\\"kind\\\":\\\"Status\\\",\\\"apiVer
sion\\\":\\\"v1\\\",\\\"metadata\\\":{},\\\"status\\\":\\\"Failure\\\",\\\"message\\\":\\\"DaemonSet.apps \\\\\\\"restic\\\\\\\" is invalid: [spec.template.
spec.nodeSelector: Invalid value: \\\\\\\"u'topology.kubernetes.io/zone'\\\\\\\": prefix part a DNS-1123 subdomain must consist of lower case alphanumeric c
haracters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])
?(\\\\\\\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.template.spec.nodeSelector: Invalid value: \\\\\\\"u'topology.kubernetes.io/zone'\\\\\\\": name part must
consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName', or 'my.name', or '123-abc', re
gex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]'), spec.template.spec.nodeSelector: Invalid value: \\\\\\\"u'us-east-2c'
This error occurs because the jinaj2 template is rendering the map with the unicode 'u' sign, and the resulting yaml info cannot be used to create the daemonset.
If we use this configuration instead:
daemonset_node_selector: '{ topology.kubernetes.io/zone: us-east-2b }'
The daemonset is configured ok, and the result is this one
$ oc get ds
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
restic 1 1 1 1 1 topology.kubernetes.io/zone=us-east-2b 17m
I'm not sure if that's the desired behavior. It seems a bit weird to have to provide the node selector map formatted as string.
Apart from that, we need to have into account that if we change the nodeselector configured in the migrationcontroller resource, instead of replacing the existing one, it will be added:
If we configure this replacing the old "topology" nodeselector:
daemonset_node_selector: '{ node-role.kubernetes.io/master: true }'
The result will be this (both, the old and the new one will be in the daemonset)
$ oc get ds
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
restic 0 0 0 0 0 node-role.kubernetes.io/master=true,topology.kubernetes.io/zone=us-east-2c 14m
If after configuring one nodeselector, we remove the "daemonset_node_selector", the nodeselector will not be removed.
Probably we need to delete the daemonset when we detect a change in the "daemonset_node_selector" configuration.
Apart from those problems, the configuration is actually done, and the ds runs the pods in the right nodes.
Please, could you check if this behavior is what we want/need? @jmontleo
We move the BZ to VERIFIED status, and will open a new BZ for the issues explained above. Verified using MTC 1.4.0 in an AWS 4.5 cluster. Some observations so far: When you specify the nodeSelector in the MigrationController CR Ansible is replacing dashes in the key with underscores. node-role becomes node_role, which is not a valid DNS-1123 name. When you specify a value of 'true' or 'false' we get hung up because the value needs to be sent as a string and not a bool. Working on a PR to address these at a minimum. node-role.kubernetes.io/master=true or node-role.kubernetes.io/worker=true do not work for me at all. It looks like it should be as you found node-role.kubernetes.io/worker='', which will schedule. By supplying it as a string you were overcoming all of these issues. After dealing with the above I did not encounter a problem with: topology.kubernetes.io/zone: us-east-1b With: https://github.com/konveyor/mig-operator/pull/570 I can successfully define the daemonset_node_selector like so: daemonset_node_selector: node-role.kubernetes.io/worker: "" topology.kubernetes.io/zone: us-east-1b Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Migration Toolkit for Containers (MTC) tool image release advisory 1.4.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5329 |
Description of problem: Hello, We installed and used the CAM tool to migrate workloads from our 3.x to 4.x cluster. Everything worked great, but I have a small bug to report. After CAM was installed on the cluster using the CAM operator (Cluster Application Migration Operator 1.2.3) , we started receiving critical alerts that the restic DaemonSet (DS) could not spin up all nodes. Two of the restic pods spun up, which were on our worker nodes Three of the nodes were failing, and were attempting to spin up on our infra nodes. I think the problem is that we have "infra" nodes, per the list below: [root@au1-ocpbast-p01 ~]# oc get nodes NAME STATUS ROLES AGE VERSION au1-ocpcomp-t01.ocp4-test.sarc.samsung.com Ready worker 58d v1.17.1 au1-ocpcomp-t02.ocp4-test.sarc.samsung.com Ready worker 58d v1.17.1 au1-ocpctrl-t01.ocp4-test.sarc.samsung.com Ready master 58d v1.17.1 au1-ocpctrl-t02.ocp4-test.sarc.samsung.com Ready master 58d v1.17.1 au1-ocpctrl-t03.ocp4-test.sarc.samsung.com Ready master 58d v1.17.1 au1-ocpinf-t01.ocp4-test.sarc.samsung.com Ready infra 58d v1.17.1 au1-ocpinf-t02.ocp4-test.sarc.samsung.com Ready infra 58d v1.17.1 au1-ocpinf-t03.ocp4-test.sarc.samsung.com Ready infra 58d v1.17.1 There was no nodeSelector configured on the restic DS, so it was trying to spin up on all non-master nodes. Fixed the issue by adding a nodeSelector (see below), which worked like a charm. Might want to update the operator with this fix. nodeSelector: node-role.kubernetes.io/worker: '' Version-Release number of selected component (if applicable): 4.4