Bug 2002420

Summary: "Stage" pod not created for completed application pod, causing the "mig-controller" to stall
Product: Migration Toolkit for Containers Reporter: Derek Whatley <dwhatley>
Component: ControllerAssignee: Jaydip Gabani <jgabani>
Status: CLOSED ERRATA QA Contact: Xin jiang <xjiang>
Severity: urgent Docs Contact: Avital Pinnick <apinnick>
Priority: urgent    
Version: 1.6.0CC: ernelson, rjohnson
Target Milestone: ---   
Target Release: 1.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-29 14:36:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs showing the controller "creating" the stage pod and then waiting none

Description Derek Whatley 2021-09-08 18:38:37 UTC
Created attachment 1821605 [details]
Logs showing the controller "creating" the stage pod and then waiting

Description of problem:
When I create the test app mentioned in the reproducer steps below, the result is:
1. Logs indicate stage pods are getting launched for completed app 'validator' app Pod
2. stage pod is not actually launched
3. mig-controller stalls waiting for stage pod to enter running state, which will never happen.

I've been able to trace this issue to having been introduced in PR (doesn't happen pre this commit, happens after) https://github.com/konveyor/mig-controller/pull/1164, but I'm not sure of the complete intent and reasoning behind the changes made there.

Version-Release number of selected component (if applicable):
MTC 1.6.0. 
Clusters: OCP 4.8 AWS (control) / OCP 3.11 AWS (remote)

How reproducible:
Always

Steps to Reproduce:
1.Create a namespace and a quota

$ oc new-project ocp-31309-quotanoattach

Create this quota

$ cat <<EOF | oc create -f -
apiVersion: v1
kind: ResourceQuota
metadata:
  name: object-quota
  namespace: ocp-31309-quotanoattach
spec:
  hard:
    persistentvolumeclaims: "2"
    services.loadbalancers: "0"
    services.nodeports: "0"
    pods: "1"
    replicationcontrollers: "1"
    secrets: "6"
    configmaps: "4"
    services: "10"
    limits.cpu: "20"
    limits.memory: 20Gi
    requests.cpu: "10"
    requests.memory: 10Gi
EOF

2. Create a PVC

$ cat <<EOF | oc create -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: quoatdev-test
  namespace: ocp-31309-quotanoattach
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Mi
EOF


3. Provision the PVC

$ cat <<EOF | oc create -f -
apiVersion: v1
kind: Pod
metadata:
  name: provisioner-pod
  namespace: ocp-31309-quotanoattach
  labels:
    app: provision
spec:
  restartPolicy: OnFailure
  containers:
  - name: provisioner
    resources:
      limits:
        cpu: "0.01"
        memory: 128Mi
    image: alpine
    command: [ "/bin/sh", "-c", "--" ]
    args: [ "echo 'data inserted' > /data/vol/data.txt ; dd if=/dev/urandom of=/data/vol/binary.rnd  bs=1000000  count=1" ]
    volumeMounts:
    - name: testvolume
      mountPath: /data/vol
  volumes:
  - name: testvolume
    persistentVolumeClaim:
      claimName: quoatdev-test
EOF


4. Remove the provisioner pod once it's completed

$ oc delete pod provisioner-pod -n ocp-31309-quotanoattach

5. Create a validation pod job

$ cat <<EOF | oc create -f -
apiVersion: batch/v1
kind: Job
metadata:
  name: validator-job
  namespace: ocp-31309-quotanoattach
  labels:
    app: validation
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: validator
        image: alpine
        resources:
          limits:
            cpu: "0.01"
            memory: 128Mi
        command: [ "/bin/sh", "-c", "--" ]
        args:
          - set -e;
            echo 'Validating';
            cd /data/vol;
            ls data.txt;
            ls binary.rnd;
            export CONTENT=\$(cat data.txt);
            [[ "\$CONTENT" == 'data inserted' ]] ||  { echo 'Wrong data content' && exit 1; } ;
            export SIZE=\$( wc -c binary.rnd  | cut -d ' ' -f 1 );
            [[ \$SIZE  == '1000000' ]] || { echo 'Wrong binary file size' && exit 1; };
        volumeMounts:
        - name: testvolume
          mountPath: /data/vol
      volumes:
      - name: testvolume
        persistentVolumeClaim:
          claimName: quoatdev-test
  backoffLimit: 4
EOF


6. Migrate the namespace once the validator pod is completed (do not delete the validator pod)

Actual results:
Migration will get stuck waiting for stage pods to come online, but it never created the pods


Expected results:
Migration will either:
1) create stage pods and then wait for them
2) not create stage pods and not wait for them


Additional info:

Comment 2 Jaydip Gabani 2021-09-09 18:08:55 UTC
This PR cherry-pick the change in release branch - https://github.com/konveyor/mig-controller/pull/1199

Changing the status to MODIFIED

Comment 7 Xin jiang 2021-09-15 09:08:03 UTC
verified with mtc 1.6.0

registry.redhat.io/rhmtc/openshift-migration-controller-rhel8@sha256:3b5efa9c8197fe0313a2ab7eb184d135ba9749c9a4f0d15a6abb11c0d18b9194

Comment 9 errata-xmlrpc 2021-09-29 14:36:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Migration Toolkit for Containers (MTC) 1.6.0 security & bugfix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3694