Description of problem: Percent CPU frequently goes above 100%, often reaching levels far beyond 1000000%. Typically this happens in a very short time interval and creates a peak in a graph Steps to Reproduce: View STF cloud dashboards after a couple of days of monitoring a cloud and these peaks can be seen
I consider the dashboard adjustment in GH pull #40 workaround to mitigate the effects of this bug in dashboards. A patch to libpod-stats must still be completed.
Reproduce this bug locally: 1. Launch collectd with the libpod-stats plugin loaded, be sure that collectd is writing to a data store like Prometheus so that metrics can be graphed 2. Start another container that was not previously running 3. CPU percentage calculations for the container in step 2 will spike
After further evaluation, the above process does not necessarily reproduce the bug. Rather, it demonstrates a > 100% usage when multiple cores are working hard at once, in which case >100% is expected behavior. The real bug is suspected to be because of the usage of unsigned integers to calculate difference between cpu utilization at different points in time. If a counter resets, the numerator might result in a unsigned int where the high order bits have been flipped (as a result of 2's-complement). This has been fixed upstream in the attached PR.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.2), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1001