Bug 2126413
| Summary: | Make poll interval metrics better: Take into account the preempted time | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Surya Seetharaman <surya> |
| Component: | OVN | Assignee: | OVN Team <ovnteam> |
| Status: | NEW --- | QA Contact: | Jianlin Shi <jishi> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | FDP 22.L | CC: | ctrautma, jiji, mmichels |
| Target Milestone: | --- | Flags: | mmichels:
needinfo?
(surya) |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Surya Seetharaman
2022-09-13 11:56:49 UTC
There are a few things to take in here. First, there are a couple of sources of times being referred to here. The "Unreasonably long poll interval" messages are from a core OVS library. However, stopwatches are also mentioned. Stopwatches are implemented within an OVS library, and can be created at will by any component that is interested in timing a certain section of code. Next, there's the breakdown of time. The system and user times in the "Unreasonably long poll interval" messages are determined by calling the getrusage() library function that returns general resource usage information. However, there is no explicit method for determining the preempted time simply from getrusage. It's possible that you could take the wall-clock time and subtract the user and system times to get a hint for the preempted time. If that's good enough, then that can be a role that the log scraper takes on rather than requiring a change in OVS Stopwatches don't print their results after every sample is taken, but rather accumulates trends over time. We could also use getrusage() within the stopwatch, I suppose, and we could keep track of the trends of system and user time there, too. I'm not sure if accumulating statistics about these times is as useful. The reason is that seeing a running average (or 95th percentile, etc.) is not useful for correlating a specific individual long sample with its corresponding system and user time. However, if you don't mind getting statistics about trends rather than being able to see individual samples, then it can be added. So, with those items under consideration, I want to try to determine what the use case is for this. If it's just the "Unreasonably long poll interval" messages that you want to try to determine preempted time from, then I think you can do some arithmetic with no need to modify any code. If you want stopwatches to give a user+system breakdown in addition to their current wall clock times, then that will require a change within OVS. What action should we take? |