Bug 1812354

Summary:	NMO should update .status.pendingPods more frequently
Product:	OpenShift Container Platform	Reporter:	shahan <hasha>
Component:	Node Maintenance Operator	Assignee:	Marc Sluiter <msluiter>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Omri Hochman <ohochman>
Severity:	low	Docs Contact:
Priority:	medium
Version:	4.3.z	CC:	abeekhof, aos-bugs, jokerman, jtomasek
Target Milestone:	---	Keywords:	Triaged
Target Release:	4.6.0	Flags:	jtomasek: needinfo- abeekhof: needinfo?
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-26 12:19:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description shahan 2020-03-11 05:56:20 UTC

Description of problem:
moving workload always show 0% when start maintenance host

Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2020-03-06-003558

How reproducible:
Always

Steps to Reproduce:
1. install Container-native virtualization Operator on ipi BareMetal cluster
2. goto /BareMetalHost list page and select a host to maintenance
3. click the starting maintenance popover on the list page

Actual results:
It will show the progress of  moving workload, it always show 0% until get the under maintenance

Expected results:
The popover should show the exactly progress in the popover

Additional info:

Comment 1 Jiri Tomasek 2020-05-11 14:05:39 UTC

This is caused by node maintenance operator not reporting the pod counts frequently enough. The pod counts are updated often in the operator logs but not in resource status.

Comment 2 Andrew Beekhof 2020-05-13 12:42:03 UTC

Looks like a UI issue, bouncing to Tomas

Comment 3 Jiri Tomasek 2020-05-15 10:29:53 UTC

The UI calculates the percentage from pod counts reported from NM CR [1]. status.pendingPods is not being updated frequently enough (IIRC it is just once a minute). on the other hand the NMO log reports how pods are being evicted in detail. So it would be good if NMO updated the status.pendingPods in the same frequency. That would make the percentage of maintenance progress actually useful.

[1] https://github.com/openshift/console/blob/master/frontend/packages/metal3-plugin/src/selectors/node-maintenance.ts#L14

Comment 4 Andrew Beekhof 2020-07-22 12:30:11 UTC

Since the work needed here is in the NMO, we'll take this one back.
Sorry for the noise

Comment 5 Marc Sluiter 2020-07-27 09:24:40 UTC

Hey, there were some code changes already, which should result in a more frequent update of the status. In detail this happens:

- NMO calls the k8s node drain code, which logs in detail what happens with pods. No chance for NMO to "intercept" these detailed steps for updating the CR status.
- after some timeout NMO gives up, in case not all pod were evicted yet. That timeout was already reduced from 1 min to 30 seconds. That's when NMO updates the CR status.
- after that NMO waits 5 seconds before triggering another drain. Repeat until done.

So atm we get a fresh status after max 35 seconds. Is that good enough, or do we need a even shorter period? E.g. set drain timeout to 10 seconds + wait 5 seconds = fresh status every 15 seconds?

Andrew, Jiri: thoughts?