Description of problem: moving workload always show 0% when start maintenance host Version-Release number of selected component (if applicable): 4.3.0-0.nightly-2020-03-06-003558 How reproducible: Always Steps to Reproduce: 1. install Container-native virtualization Operator on ipi BareMetal cluster 2. goto /BareMetalHost list page and select a host to maintenance 3. click the starting maintenance popover on the list page Actual results: It will show the progress of moving workload, it always show 0% until get the under maintenance Expected results: The popover should show the exactly progress in the popover Additional info:
This is caused by node maintenance operator not reporting the pod counts frequently enough. The pod counts are updated often in the operator logs but not in resource status.
Looks like a UI issue, bouncing to Tomas
The UI calculates the percentage from pod counts reported from NM CR [1]. status.pendingPods is not being updated frequently enough (IIRC it is just once a minute). on the other hand the NMO log reports how pods are being evicted in detail. So it would be good if NMO updated the status.pendingPods in the same frequency. That would make the percentage of maintenance progress actually useful. [1] https://github.com/openshift/console/blob/master/frontend/packages/metal3-plugin/src/selectors/node-maintenance.ts#L14
Since the work needed here is in the NMO, we'll take this one back. Sorry for the noise
Hey, there were some code changes already, which should result in a more frequent update of the status. In detail this happens: - NMO calls the k8s node drain code, which logs in detail what happens with pods. No chance for NMO to "intercept" these detailed steps for updating the CR status. - after some timeout NMO gives up, in case not all pod were evicted yet. That timeout was already reduced from 1 min to 30 seconds. That's when NMO updates the CR status. - after that NMO waits 5 seconds before triggering another drain. Repeat until done. So atm we get a fresh status after max 35 seconds. Is that good enough, or do we need a even shorter period? E.g. set drain timeout to 10 seconds + wait 5 seconds = fresh status every 15 seconds? Andrew, Jiri: thoughts?