Bug 1982729 - MigAnalytic fails to get source cluster resources, all reported as 0 in ocp 3.9 sometimes
Summary: MigAnalytic fails to get source cluster resources, all reported as 0 in ocp 3...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Migration Toolkit for Containers
Classification: Red Hat
Component: Documentation
Version: 1.4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 1.5.0
Assignee: Avital Pinnick
QA Contact: Xin jiang
Avital Pinnick
URL:
Whiteboard:
: 1982604 (view as bug list)
Depends On: 1982604
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-15 14:48 UTC by Sergio
Modified: 2021-08-26 18:55 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1982604
Environment:
Last Closed: 2021-07-21 13:41:26 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Sergio 2021-07-15 14:48:09 UTC
+++ This bug was initially created as a clone of Bug #1982604 +++

Description of problem:
When create migplan for an application with PV in AWS OCP3.9 cluster, the MigAnalytic fails to get the source cluster resources, all reported as 0 sometimes.  The probability of occurrence is very high.

Version-Release number of selected component (if applicable):
MTC 1.5.0
image: quay-enterprise-quay-enterprise.apps.cam-tgt-21420.qe.devcluster.openshift.com/admin/openshift-migration-rhel7-operator:v1.5.0-23
Source cluster : AWS OCP 3.9 (controller)
Target cluster: AWS OCP 4.8

How reproducible:
1.  Prepare an nginx application in 3.9 source cluster
$ ansible-playbook deploy-app.yml -e use_role=ocp-nginxpv -e namespace=ocp-24706-basicvolmig

2. Create indirect migration plan against nginx 

3. Check analytics message in migration plan 

Actual results:
The migplan will be ready status, but there is warning message “ Failed gathering extended PV usage information for PVs [nginx-logs nginx-html] in migplan. The MigAnalytic reported as 0 

Expected results:
The migplan will be ready status without warning and error message.  The MigAnalytic reported as 0 

Additional info:
$ oc get pvc -n ocp-24706-basicvolmig
NAME     	STATUS	VOLUME                                 	CAPACITY   ACCESS MODES   STORAGECLASS   AGE
nginx-html   Bound 	pvc-602321ba-e51c-11eb-b736-0e2f00abf38f   1Gi    	RWO        	gp2        	1h
nginx-logs   Bound 	pvc-601ee567-e51c-11eb-b736-0e2f00abf38f   1Gi    	RWO        	gp2        	1h

$ oc get migplan ocp-24706-basicvolmig-migplan-1626319591  -o yaml
apiVersion: migration.openshift.io/v1alpha1
kind: MigPlan
metadata:
  ……...
  name: ocp-24706-basicvolmig-migplan-1626319591
  namespace: openshift-migration
spec:
  destMigClusterRef:
	name: target-cluster
	namespace: openshift-migration
  indirectImageMigration: true
  indirectVolumeMigration: true
  migStorageRef:
	name: automatic
	namespace: openshift-migration
  namespaces:
  - ocp-24706-basicvolmig
  persistentVolumes:
  - capacity: 1Gi
	name: pvc-601ee567-e51c-11eb-b736-0e2f00abf38f
	proposedCapacity: "0"
	pvc:
  	accessModes:
  	- ReadWriteOnce
  	hasReference: true
  	name: nginx-logs
  	namespace: ocp-24706-basicvolmig
	selection:
  	action: copy
  	copyMethod: filesystem
  	storageClass: gp2
	storageClass: gp2
	supported:
  	actions:
  	- copy
  	- move
  	copyMethods:
  	- filesystem
  	- snapshot
  - capacity: 1Gi
	name: pvc-602321ba-e51c-11eb-b736-0e2f00abf38f
	proposedCapacity: "0"
	pvc:
  	accessModes:
  	- ReadWriteOnce
  	hasReference: true
  	name: nginx-html
  	namespace: ocp-24706-basicvolmig
	selection:
  	action: copy
  	copyMethod: filesystem
  	storageClass: gp2
	storageClass: gp2
	supported:
  	actions:
  	- copy
  	- move
  	copyMethods:
  	- filesystem
  	- snapshot
  srcMigClusterRef:
	name: host
	namespace: openshift-migration
status:
  conditions:
  - category: Required
	lastTransitionTime: 2021-07-15T03:26:36Z
	message: The `persistentVolumes` list has been updated with discovered PVs.
	reason: Done
	status: "True"
	type: PvsDiscovered
  - category: Required
	lastTransitionTime: 2021-07-15T03:26:36Z
	message: The storage resources have been created.
	reason: Done
	status: "True"
	type: StorageEnsured
  - category: Warn
	lastTransitionTime: 2021-07-15T04:11:44Z
	message: Failed gathering extended PV usage information for PVs [nginx-logs nginx-html],
  	please see MigAnalytic openshift-migration/ocp-24706-basicvolmig-migplan-1626319591-szwd6
  	for details
	reason: FailedRunningDf
	status: "True"
	type: ExtendedPVAnalysisFailed
  - category: Required
	lastTransitionTime: 2021-07-15T03:26:36Z
	message: The migration plan is ready.
	status: "True"
	type: Ready
  destStorageClasses:
  - accessModes:
	- ReadWriteOnce
	default: true
	name: gp2
	provisioner: kubernetes.io/aws-ebs
  - accessModes:
	- ReadWriteOnce
	name: gp2-csi
	provisioner: ebs.csi.aws.com
  excludedResources:
………

$ oc get miganalytic ocp-24706-basicvolmig-migplan-1626319591-szwd6  -n openshift-migration -o yaml
apiVersion: migration.openshift.io/v1alpha1
kind: MigAnalytic
metadata:
 ……...
  name: ocp-24706-basicvolmig-migplan-1626319591-szwd6
  namespace: openshift-migration
spec:
  analyzeExtendedPVCapacity: true
  analyzeImageCount: false
  analyzeK8SResources: false
  analyzePVCapacity: false
  migPlanRef:
	name: ocp-24706-basicvolmig-migplan-1626319591
	namespace: openshift-migration
status:
  analytics:
	excludedk8sResourceTotal: 0
	imageCount: 0
	imageSizeTotal: "0"
	incompatiblek8sResourceTotal: 0
	k8sResourceTotal: 0
	namespaces:
	- excludedK8SResourceTotal: 0
  	imageCount: 0
  	imageSizeTotal: "0"
  	incompatibleK8SResourceTotal: 0
  	k8sResourceTotal: 0
  	namespace: ocp-24706-basicvolmig
  	persistentVolumes:
  	- actualCapacity: "0"
    	comment: No change in PV capacity is needed.
    	name: nginx-logs
    	proposedCapacity: "0"
    	requestedCapacity: 1Gi
  	- actualCapacity: "0"
    	comment: No change in PV capacity is needed.
    	name: nginx-html
    	proposedCapacity: "0"
    	requestedCapacity: 1Gi
  	pvCapacity: "0"
  	pvCount: 0
	percentComplete: 100
	plan: ocp-24706-basicvolmig-migplan-1626319591
	pvCapacity: "0"
	pvCount: 0
  conditions:
  - category: Warn
	lastTransitionTime: 2021-07-15T03:26:34Z
	message: Failed gathering extended PV usage information for PVs [nginx-logs nginx-html]
	reason: FailedRunningDf
	status: "True"
	type: ExtendedPVAnalysisFailed
  - category: Required
	lastTransitionTime: 2021-07-15T03:26:34Z
	message: The analytic is ready.
	status: "True"
	type: Ready
  observedGeneration: 1

--- Additional comment from  on 2021-07-15 09:16:22 UTC ---

one similar bug : https://bugzilla.redhat.com/show_bug.cgi?id=1918504

Comment 1 Sergio 2021-07-15 14:50:37 UTC
I've seen this bug in my 3.9 environment (aws 3.9 -> aws 4.8). When I clicked in "refresh", the analytic reported the right values. Nevertheless, the warning regarding the resize functionality remained there, even after refreshing the migplan apart from refreshing the miganalytic.

Comment 2 Jason Montleon 2021-07-15 17:19:24 UTC
This isn't as much a problem with Analytics as it is the pv resize feature which attempts to use the restic daemonset to determine the actual disk usage of the volumes.

IIUC the failures are limited to using analytics for pv resize when migrating from older OCP releases (3.7, 3.9) and the pod comes into existence after the restic daemonset was started.

restic uses a hostPath mount to peer into the volume and bind remount does not exist on these versions so if the application comes up after the daemonset it is oblivious to it.

Possible solutions might include restarting the daeonset before running the analytic (I think this would be costly performance wise on large clusters) or creating a pod on the node to run the size check instead of using the restic daemonset so it always exists after the application.

Comment 3 John Matthews 2021-07-15 17:35:12 UTC
Pranav/Jason,

My suggestion is we do _not_ address this fix in code changes.  We document this as a known issue of customers running a source cluster of 3.7/3.9, explain they could restart Restic or they could proceed without some functionality (assume won't be able to resize PVs, maybe some loss of progress, etc...we can explain in doc note).

Does that sound reasonable?

Comment 4 John Matthews 2021-07-15 17:35:13 UTC
Pranav/Jason,

My suggestion is we do _not_ address this fix in code changes.  We document this as a known issue of customers running a source cluster of 3.7/3.9, explain they could restart Restic or they could proceed without some functionality (assume won't be able to resize PVs, maybe some loss of progress, etc...we can explain in doc note).

Does that sound reasonable?

Comment 5 Pranav Gaikwad 2021-07-15 17:38:36 UTC
John, 

That is reasonable. Assuming that the feature degrades gracefully and the users still have a way to manually mitigate the degradation, I wouldn't consider this as a blocker. I will take the responibility of documenting this in our upstream docs.

Comment 6 Pranav Gaikwad 2021-07-15 18:01:05 UTC
In my previous comment, I forgot to note an important thing. 

When PV Resizing degrades gracefully (FailedRunningDF condition on MigAnalytic), it does _not_ block migrations from proceeding. The migrations still work. The only difference is that the migration cannot resize the volumes automatically in the target cluster based on usage of the volume because MigAnalytic failed to collect that information. If users do care about the PV resizing, they need to bounce the Restic pods once for resizing to happen automatically. If the users don't care about PV resizing, then they can simply proceed with PV resizing disabled.

Comment 7 Erik Nelson 2021-07-19 17:31:26 UTC
Shifting this to a docs BZ so we can document upstream and get that info propagated downstream as you see fit, Avital. Pranav will follow with the details.

Comment 8 Avital Pinnick 2021-07-20 07:42:39 UTC
I can add this to the release notes > known issues for 1.4.6. 

Since you do not plan to address this with a code fix, I will add the bug and workaround to 1.5.0 release notes as well.

Comment 9 Avital Pinnick 2021-07-20 08:24:59 UTC
Update: I will put this in 1.5.0 release notes and not in 1.4.6 because PV resizing was introduced as a 1.5.0 feature and only appears in the documentation for that release.

Comment 13 Avital Pinnick 2021-07-21 13:41:26 UTC
Changes merged for OCP 4.8/MTC 1.5.0 RN

Comment 14 Pranav Gaikwad 2021-08-26 18:55:00 UTC
*** Bug 1982604 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.