Bug 1853534

Summary: CAM Operator tries to spin up all nodes including infra nodes causing critical alerts
Product: Migration Toolkit for Containers Reporter: Courtney Ruhm <cruhm>
Component: GeneralAssignee: Jason Montleon <jmontleo>
Status: CLOSED ERRATA QA Contact: Xin jiang <xjiang>
Severity: low Docs Contact: Avital Pinnick <apinnick>
Priority: unspecified    
Version: 1.3.0CC: ernelson, rjohnson, sregidor
Target Milestone: ---   
Target Release: 1.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-11 12:54:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Courtney Ruhm 2020-07-03 00:40:29 UTC
Description of problem:
Hello,

We installed and used the CAM tool to migrate workloads from our 3.x to 4.x cluster.  Everything worked great, but I have a small bug to report.

After CAM was installed on the cluster using the CAM operator (Cluster Application Migration Operator 1.2.3) , we started receiving critical alerts that the restic DaemonSet (DS) could not spin up all nodes.

Two of the restic pods spun up, which were on our worker nodes
Three of the nodes were failing, and were attempting to spin up on our infra nodes.

I think the problem is that we have "infra" nodes, per the list below: 
[root@au1-ocpbast-p01 ~]# oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
au1-ocpcomp-t01.ocp4-test.sarc.samsung.com   Ready    worker   58d   v1.17.1
au1-ocpcomp-t02.ocp4-test.sarc.samsung.com   Ready    worker   58d   v1.17.1
au1-ocpctrl-t01.ocp4-test.sarc.samsung.com   Ready    master   58d   v1.17.1
au1-ocpctrl-t02.ocp4-test.sarc.samsung.com   Ready    master   58d   v1.17.1
au1-ocpctrl-t03.ocp4-test.sarc.samsung.com   Ready    master   58d   v1.17.1
au1-ocpinf-t01.ocp4-test.sarc.samsung.com    Ready    infra    58d   v1.17.1
au1-ocpinf-t02.ocp4-test.sarc.samsung.com    Ready    infra    58d   v1.17.1
au1-ocpinf-t03.ocp4-test.sarc.samsung.com    Ready    infra    58d   v1.17.1

There was no nodeSelector configured on the restic DS, so it was trying to spin up on all non-master nodes.

Fixed the issue by adding a nodeSelector (see below), which worked like a charm.  Might want to update the operator with this fix.

     nodeSelector:
        node-role.kubernetes.io/worker: ''

Version-Release number of selected component (if applicable):

4.4

Comment 1 Jason Montleon 2020-09-23 17:16:39 UTC
It would probably be a good idea to do this and make it customizable. I'll do some testing to see what works from 3.7 on up to current and look at implementing something. It shouldn't be hard.

Comment 3 Jason Montleon 2020-11-18 13:50:33 UTC
My current thinking is we do something like this (set the nodeSelector to daemonset_node_selector if it's defined in the MigrationController CR), but it's completely untested.

https://github.com/konveyor/mig-operator/pull/498/files

Comment 7 Sergio 2021-01-21 15:10:21 UTC
I've been running some tests.

If we configure this:

    daemonset_node_selector:
      topology.kubernetes.io/zone: us-east-2c

We get this error in the migration operator pod:

igrationcontroller/tasks/main.yml:196\u001b[0m
^M\u001b[0;31mfatal: [localhost]: FAILED! => {\"changed\": false, \"error\": 422, \"msg\": \"Failed to patch object: {\\\"kind\\\":\\\"Status\\\",\\\"apiVer
sion\\\":\\\"v1\\\",\\\"metadata\\\":{},\\\"status\\\":\\\"Failure\\\",\\\"message\\\":\\\"DaemonSet.apps \\\\\\\"restic\\\\\\\" is invalid: [spec.template.
spec.nodeSelector: Invalid value: \\\\\\\"u'topology.kubernetes.io/zone'\\\\\\\": prefix part a DNS-1123 subdomain must consist of lower case alphanumeric c
haracters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])
?(\\\\\\\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.template.spec.nodeSelector: Invalid value: \\\\\\\"u'topology.kubernetes.io/zone'\\\\\\\": name part must
 consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', re
gex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]'), spec.template.spec.nodeSelector: Invalid value: \\\\\\\"u'us-east-2c'

This error occurs because the jinaj2 template is rendering the map with the unicode 'u' sign, and the resulting yaml info cannot be used to create the daemonset.


If we use this configuration instead:

 daemonset_node_selector: '{ topology.kubernetes.io/zone: us-east-2b }'


The daemonset is configured ok, and the result is this one

$ oc get ds
NAME     DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                            AGE
restic   1         1         1       1            1           topology.kubernetes.io/zone=us-east-2b   17m



I'm not sure if that's the desired behavior. It seems a bit weird to have to provide the node selector map formatted as string.


Apart from that, we need to have into account that if we change the nodeselector configured in the migrationcontroller resource, instead of  replacing the existing one, it will be added:

If we configure this replacing the old "topology" nodeselector:
 daemonset_node_selector: '{ node-role.kubernetes.io/master: true }'

The result will be this (both, the old and the new one will be in the daemonset)

$ oc get ds
NAME     DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                AGE
restic   0         0         0       0            0           node-role.kubernetes.io/master=true,topology.kubernetes.io/zone=us-east-2c   14m


If after configuring one nodeselector, we remove the "daemonset_node_selector", the nodeselector will not be removed.

Probably we need to delete the daemonset when we detect a change in the "daemonset_node_selector" configuration.


Apart from those problems, the configuration is actually done, and the ds runs the pods in the right nodes.

Please, could you check if this behavior is what we want/need? @jmontleo

Comment 8 Sergio 2021-01-22 15:53:47 UTC
We move the BZ to VERIFIED status, and will open a new BZ for the issues explained above.

Verified using MTC 1.4.0 in an AWS 4.5 cluster.

Comment 9 Jason Montleon 2021-02-03 20:34:53 UTC
Some observations so far:

When you specify the nodeSelector in the MigrationController CR Ansible is replacing dashes in the key with underscores. node-role becomes node_role, which is not a valid DNS-1123 name.

When you specify a value of 'true' or 'false' we get hung up because the value needs to be sent as a string and not a bool.

Working on a PR to address these at a minimum.

node-role.kubernetes.io/master=true or node-role.kubernetes.io/worker=true do not work for me at all. It looks like it should be as you found node-role.kubernetes.io/worker='', which will schedule.

By supplying it as a string you were overcoming all of these issues.

After dealing with the above I did not encounter a problem with:
topology.kubernetes.io/zone: us-east-1b

With: https://github.com/konveyor/mig-operator/pull/570

I can successfully define the daemonset_node_selector like so:
  daemonset_node_selector:
    node-role.kubernetes.io/worker: ""
    topology.kubernetes.io/zone: us-east-1b

Comment 11 errata-xmlrpc 2021-02-11 12:54:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Migration Toolkit for Containers (MTC) tool image release advisory 1.4.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5329