Hide Forgot
+++ This bug was initially created as a clone of Bug #1982604 +++ Description of problem: When create migplan for an application with PV in AWS OCP3.9 cluster, the MigAnalytic fails to get the source cluster resources, all reported as 0 sometimes. The probability of occurrence is very high. Version-Release number of selected component (if applicable): MTC 1.5.0 image: quay-enterprise-quay-enterprise.apps.cam-tgt-21420.qe.devcluster.openshift.com/admin/openshift-migration-rhel7-operator:v1.5.0-23 Source cluster : AWS OCP 3.9 (controller) Target cluster: AWS OCP 4.8 How reproducible: 1. Prepare an nginx application in 3.9 source cluster $ ansible-playbook deploy-app.yml -e use_role=ocp-nginxpv -e namespace=ocp-24706-basicvolmig 2. Create indirect migration plan against nginx 3. Check analytics message in migration plan Actual results: The migplan will be ready status, but there is warning message “ Failed gathering extended PV usage information for PVs [nginx-logs nginx-html] in migplan. The MigAnalytic reported as 0 Expected results: The migplan will be ready status without warning and error message. The MigAnalytic reported as 0 Additional info: $ oc get pvc -n ocp-24706-basicvolmig NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE nginx-html Bound pvc-602321ba-e51c-11eb-b736-0e2f00abf38f 1Gi RWO gp2 1h nginx-logs Bound pvc-601ee567-e51c-11eb-b736-0e2f00abf38f 1Gi RWO gp2 1h $ oc get migplan ocp-24706-basicvolmig-migplan-1626319591 -o yaml apiVersion: migration.openshift.io/v1alpha1 kind: MigPlan metadata: ……... name: ocp-24706-basicvolmig-migplan-1626319591 namespace: openshift-migration spec: destMigClusterRef: name: target-cluster namespace: openshift-migration indirectImageMigration: true indirectVolumeMigration: true migStorageRef: name: automatic namespace: openshift-migration namespaces: - ocp-24706-basicvolmig persistentVolumes: - capacity: 1Gi name: pvc-601ee567-e51c-11eb-b736-0e2f00abf38f proposedCapacity: "0" pvc: accessModes: - ReadWriteOnce hasReference: true name: nginx-logs namespace: ocp-24706-basicvolmig selection: action: copy copyMethod: filesystem storageClass: gp2 storageClass: gp2 supported: actions: - copy - move copyMethods: - filesystem - snapshot - capacity: 1Gi name: pvc-602321ba-e51c-11eb-b736-0e2f00abf38f proposedCapacity: "0" pvc: accessModes: - ReadWriteOnce hasReference: true name: nginx-html namespace: ocp-24706-basicvolmig selection: action: copy copyMethod: filesystem storageClass: gp2 storageClass: gp2 supported: actions: - copy - move copyMethods: - filesystem - snapshot srcMigClusterRef: name: host namespace: openshift-migration status: conditions: - category: Required lastTransitionTime: 2021-07-15T03:26:36Z message: The `persistentVolumes` list has been updated with discovered PVs. reason: Done status: "True" type: PvsDiscovered - category: Required lastTransitionTime: 2021-07-15T03:26:36Z message: The storage resources have been created. reason: Done status: "True" type: StorageEnsured - category: Warn lastTransitionTime: 2021-07-15T04:11:44Z message: Failed gathering extended PV usage information for PVs [nginx-logs nginx-html], please see MigAnalytic openshift-migration/ocp-24706-basicvolmig-migplan-1626319591-szwd6 for details reason: FailedRunningDf status: "True" type: ExtendedPVAnalysisFailed - category: Required lastTransitionTime: 2021-07-15T03:26:36Z message: The migration plan is ready. status: "True" type: Ready destStorageClasses: - accessModes: - ReadWriteOnce default: true name: gp2 provisioner: kubernetes.io/aws-ebs - accessModes: - ReadWriteOnce name: gp2-csi provisioner: ebs.csi.aws.com excludedResources: ……… $ oc get miganalytic ocp-24706-basicvolmig-migplan-1626319591-szwd6 -n openshift-migration -o yaml apiVersion: migration.openshift.io/v1alpha1 kind: MigAnalytic metadata: ……... name: ocp-24706-basicvolmig-migplan-1626319591-szwd6 namespace: openshift-migration spec: analyzeExtendedPVCapacity: true analyzeImageCount: false analyzeK8SResources: false analyzePVCapacity: false migPlanRef: name: ocp-24706-basicvolmig-migplan-1626319591 namespace: openshift-migration status: analytics: excludedk8sResourceTotal: 0 imageCount: 0 imageSizeTotal: "0" incompatiblek8sResourceTotal: 0 k8sResourceTotal: 0 namespaces: - excludedK8SResourceTotal: 0 imageCount: 0 imageSizeTotal: "0" incompatibleK8SResourceTotal: 0 k8sResourceTotal: 0 namespace: ocp-24706-basicvolmig persistentVolumes: - actualCapacity: "0" comment: No change in PV capacity is needed. name: nginx-logs proposedCapacity: "0" requestedCapacity: 1Gi - actualCapacity: "0" comment: No change in PV capacity is needed. name: nginx-html proposedCapacity: "0" requestedCapacity: 1Gi pvCapacity: "0" pvCount: 0 percentComplete: 100 plan: ocp-24706-basicvolmig-migplan-1626319591 pvCapacity: "0" pvCount: 0 conditions: - category: Warn lastTransitionTime: 2021-07-15T03:26:34Z message: Failed gathering extended PV usage information for PVs [nginx-logs nginx-html] reason: FailedRunningDf status: "True" type: ExtendedPVAnalysisFailed - category: Required lastTransitionTime: 2021-07-15T03:26:34Z message: The analytic is ready. status: "True" type: Ready observedGeneration: 1 --- Additional comment from on 2021-07-15 09:16:22 UTC --- one similar bug : https://bugzilla.redhat.com/show_bug.cgi?id=1918504
I've seen this bug in my 3.9 environment (aws 3.9 -> aws 4.8). When I clicked in "refresh", the analytic reported the right values. Nevertheless, the warning regarding the resize functionality remained there, even after refreshing the migplan apart from refreshing the miganalytic.
This isn't as much a problem with Analytics as it is the pv resize feature which attempts to use the restic daemonset to determine the actual disk usage of the volumes. IIUC the failures are limited to using analytics for pv resize when migrating from older OCP releases (3.7, 3.9) and the pod comes into existence after the restic daemonset was started. restic uses a hostPath mount to peer into the volume and bind remount does not exist on these versions so if the application comes up after the daemonset it is oblivious to it. Possible solutions might include restarting the daeonset before running the analytic (I think this would be costly performance wise on large clusters) or creating a pod on the node to run the size check instead of using the restic daemonset so it always exists after the application.
Pranav/Jason, My suggestion is we do _not_ address this fix in code changes. We document this as a known issue of customers running a source cluster of 3.7/3.9, explain they could restart Restic or they could proceed without some functionality (assume won't be able to resize PVs, maybe some loss of progress, etc...we can explain in doc note). Does that sound reasonable?
John, That is reasonable. Assuming that the feature degrades gracefully and the users still have a way to manually mitigate the degradation, I wouldn't consider this as a blocker. I will take the responibility of documenting this in our upstream docs.
In my previous comment, I forgot to note an important thing. When PV Resizing degrades gracefully (FailedRunningDF condition on MigAnalytic), it does _not_ block migrations from proceeding. The migrations still work. The only difference is that the migration cannot resize the volumes automatically in the target cluster based on usage of the volume because MigAnalytic failed to collect that information. If users do care about the PV resizing, they need to bounce the Restic pods once for resizing to happen automatically. If the users don't care about PV resizing, then they can simply proceed with PV resizing disabled.
Shifting this to a docs BZ so we can document upstream and get that info propagated downstream as you see fit, Avital. Pranav will follow with the details.
I can add this to the release notes > known issues for 1.4.6. Since you do not plan to address this with a code fix, I will add the bug and workaround to 1.5.0 release notes as well.
Update: I will put this in 1.5.0 release notes and not in 1.4.6 because PV resizing was introduced as a 1.5.0 feature and only appears in the documentation for that release.
Changes merged for OCP 4.8/MTC 1.5.0 RN
*** Bug 1982604 has been marked as a duplicate of this bug. ***