Bug 1916554

Summary: Direct Volume Migration pods stuck at ContainerCreating status when PVC is in 'Terminating' state
Product: Migration Toolkit for Containers Reporter: Xin jiang <xjiang>
Component: GeneralAssignee: Jaydip Gabani <jgabani>
Status: CLOSED ERRATA QA Contact: Xin jiang <xjiang>
Severity: medium Docs Contact: Avital Pinnick <apinnick>
Priority: medium    
Version: 1.4.0CC: chezhang, ernelson, rjohnson, sregidor, whu
Target Milestone: ---   
Target Release: 1.4.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-15 08:15:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Xin jiang 2021-01-15 03:27:33 UTC
Description of problem:
a direct migration may stuck for a long time as the DVM pods are created and stay in Pending as they are unable to mount a PVC due to it being stuck in a Terminating state.

Version-Release number of selected component (if applicable):
MTC 1.4.0
registry.redhat.io/rhmtc/openshift-migration-controller-rhel8@sha256:4a29345d11d4d7b8cc8a6a5395a398a5c5f92bff6e2ad396caf6dd73731a8f4d
registry.redhat.io/rhmtc/openshift-migration-rhel7-operator@sha256:51c38fd418c923992375c9ad18e5db1c14e6d77d3d7a02803df33c64c9bece2f

How reproducible:
Always

Steps to Reproduce:
1) Create a Pod that mounts a PVC
2) Create a MigPlan that references the Pod and PVC with DVM=true
3) Delete the PVC while it is mounted to the Pod, PVC will be in terminating
4) Run a migration with the MigPlan

alternatively, you can use below automation to reproduce it 
$ ansible-playbook -i inventory.cam.yml ocp-32834-pvc-terminating.yml -e @config/direct_copy_defaults.yml

Actual results:
DVM pods are stuck at ContainerCreating.

$ oc get pod -n ocp-32834-pvc-terminating
NAME                                              READY     STATUS              RESTARTS   AGE
directvolumemigration-rsync-transfer-nginx-html   0/1       ContainerCreating   0          29m
directvolumemigration-rsync-transfer-nginx-logs   0/1       ContainerCreating   0          29m
directvolumemigration-stunnel-transfer            1/1       Running             0          29m
nginx-deployment-6fd5f9ddf8-r6p9j                 1/1       Running             0          32m

Expected results:
The direct migration should fail waiting for a period of time or it should check the PVC status before starting direct migration 


Additional info:
$ oc describe pod directvolumemigration-rsync-transfer-nginx-html -n ocp-32834-pvc-terminating

Events:
  Type     Reason       Age               From                                  Message
  ----     ------       ----              ----                                  -------
  Warning  FailedMount  20s (x3 over 4m)  kubelet, ip-172-18-4-43.ec2.internal  Unable to mount volumes for pod "directvolumemigration-rsync-transfer-nginx-html_ocp-32834-pvc-terminating(1c1ad6d5-56db-11eb-9e3c-0eb55368412b)": timeout expired waiting for volumes to attach or mount for pod "ocp-32834-pvc-terminating"/"directvolumemigration-rsync-transfer-nginx-html". list of unmounted volumes=[nginx-html]. list of unattached volumes=[nginx-html default-token-pm4vm]

Comment 1 Erik Nelson 2021-01-25 04:06:24 UTC
This is something we'd like to fix but not something I'd consider to be critically severe.

Comment 2 Xin jiang 2021-01-28 09:19:57 UTC
in MTC 3.11(controller)-> 4.7, it has different behavior. Not sure if they are same reason that Unable to mount volumes for pod. 

1. I didn't see DVM pod was started creating.
$ oc get migmigration 3db53140-6143-11eb-8e38-431dc1d3e8a0 -o yaml
....
status:
  conditions:
  - category: Advisory
    lastTransitionTime: "2021-01-28T08:32:41Z"
    message: 'Step: 36/47'
    reason: WaitForDirectVolumeMigrationToComplete
    status: "True"
    type: Running
  - category: Required
    lastTransitionTime: "2021-01-28T08:31:39Z"
    message: The migration is ready.
    status: "True"
    type: Ready
  itinerary: Final
  observedDigest: f19ac39779c0d0ee1443c3580dad86e38eafa02a1cb4bdff18cac9c14b520005
  phase: WaitForDirectVolumeMigrationToComplete
  pipeline:
  - completed: "2021-01-28T08:32:08Z"
    message: Completed
    name: Prepare
    started: "2021-01-28T08:31:39Z"
  - completed: "2021-01-28T08:32:33Z"
    message: Completed
    name: Backup
    progress:
    - 'Backup openshift-migration/3db53140-6143-11eb-8e38-431dc1d3e8a0-ddxlj: 76 out of estimated total of 76 objects backed up (15s)'
    started: "2021-01-28T08:32:08Z"
  - completed: "2021-01-28T08:32:39Z"
    message: Completed
    name: StageBackup
    started: "2021-01-28T08:32:33Z"
  - message: Skipped
    name: StageRestore
    skipped: true
  - completed: "2021-01-28T08:32:41Z"
    message: Waiting for Direct Image Migration to complete.
    name: DirectImage
    phase: WaitForDirectImageMigrationToComplete
    progress:
    - 1 total ImageStreams; 0 running; 1 successful; 0 failed
    - 'ImageStream ocp-django/django-psql-persistent (dism openshift-migration/3db53140-6143-11eb-8e38-431dc1d3e8a0-6v5nt-k8ngv): Completed '
    started: "2021-01-28T08:32:39Z"
  - name: DirectVolume
    phase: WaitForDirectVolumeMigrationToComplete
    progress:
    - 1 total volumes; 0 successful; 0 running; 0 failed
    started: "2021-01-28T08:32:41Z"
  - message: Not started
    name: Restore
  - message: Not started
    name: Cleanup
  startTimestamp: "2021-01-28T08:31:39Z"

$ oc get event -n ocp-django
......
1h          1h           1       django-psql-persistent-1-deploy.165e562824ce6e67   Pod                     spec.containers{deployment}               Normal   Killing                       kubelet, ip-172-18-7-104.ec2.internal    Killing container with id docker://deployment:Need to kill Pod
45m         45m          1       postgresql.165e584151ad2755                        DeploymentConfig                                                  Normal   ReplicationControllerScaled   deploymentconfig-controller              Scaled replication controller "postgresql-1" from 1 to 0
45m         45m          1       django-psql-persistent-1-vbdtm.165e58417b74be00    Pod                     spec.containers{django-psql-persistent}   Normal   Killing                       kubelet, ip-172-18-13-30.ec2.internal    Killing container with id docker://django-psql-persistent:Need to kill Pod
45m         45m          1       postgresql-1.165e58415a03fdc9                      ReplicationController                                             Normal   SuccessfulDelete              replication-controller                   Deleted pod: postgresql-1-j5kwc
45m         45m          1       django-psql-persistent-1.165e5841538da3c3          ReplicationController                                             Normal   SuccessfulDelete              replication-controller                   Deleted pod: django-psql-persistent-1-vbdtm
45m         45m          1       django-psql-persistent.165e58414fdba442            DeploymentConfig                                                  Normal   ReplicationControllerScaled   deploymentconfig-controller              Scaled replication controller "django-psql-persistent-1" from 1 to 0
45m         45m          1       postgresql-1-j5kwc.165e58416275175a                Pod                     spec.containers{postgresql}               Normal   Killing                       kubelet, ip-172-18-9-34.ec2.internal     Killing container with id docker://postgresql:Need to kill Pod

Comment 3 Jaydip Gabani 2021-02-25 15:46:06 UTC
https://github.com/konveyor/mig-controller/pull/958

the cp PR to bring change in the release branch is: https://github.com/konveyor/mig-controller/pull/972

Comment 7 Xin jiang 2021-03-03 14:19:53 UTC
verified. talked with Jaydip, it just shows up a warning on UI as below. The whole migration still is stuck there as if problem is fixed, the migration can continue the rest phases, it won't wast time.

Warning alert:Paused - waiting for route to be admitted
Pods directvolumemigration-rsync-transfer-mysql/ocp-24769-cakephp are stuck in Pending state for more than 10 mins

Comment 11 errata-xmlrpc 2021-03-15 08:15:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Migration Toolkit for Containers (MTC) image release advisory 1.4.2), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0814