Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2126413

Summary:	Make poll interval metrics better: Take into account the preempted time
Product:	Red Hat Enterprise Linux Fast Datapath	Reporter:	Surya Seetharaman <surya>
Component:	OVN	Assignee:	OVN Team <ovnteam>
Status:	CLOSED WONTFIX	QA Contact:	Jianlin Shi <jishi>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	FDP 22.L	CC:	ctrautma, jiji, mmichels
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-02-14 21:14:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Surya Seetharaman 2022-09-13 11:56:49 UTC

Description of problem:

On the OCP side, we often use these stopwatch metrics from OVN that measure the load on OVN components. Example: ovn_controller_flow_generation_maximum metric depicts the largest poll on ovn-controller in the past x time interval.

2022-08-01T18:09:50.987Z|00867|timeval|WARN|Unreasonably long 39806ms poll interval (6ms user, 1ms system)

Ilya and dcbw and Mark mentioned user+system != total poll then something else preempted the process or the component didn’t actually consume that amount of time.
This means when we put that as a metric, we aren’t being accurate? Note that we are propagating these metrics to SD clusters and customers who want to measure load on clusters, so are we doing the right thing here?

[dcbw] Should we have an "OVN/OVS pre-empted time" metric that we could alert on if system+user < 90% (or whatever) of poll interval time?
Use the user+system time; ovnk can’t scrape logs, can ovn send this info as well that we could use?
[numan] log is coming from ovs code I think, we could do something there
[mmichelson] mostly the polls are accurate and its not really an os issue, so we should still do it the current way
[ilya] we are seeing some escalations around ovs load related to these polls; adrian is trying to make these poll infos more user friendly
ACTION: Surya will open a bug with this info against OVN component
suggestion: expose poll interval times via appctl somehow
[dcbw] the logic is in OVS itself though; and we want it in ovsdb-server too, so we should have an OVS bug that the OVN one depends on

Pasting meeting minutes where this was discussed. It's possible like Dan said we need a dependent bug in OVS that OVN needs to consume? I will leave it to the OVN team to decide on that.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Comment 1 Mark Michelson 2022-09-28 18:03:47 UTC

There are a few things to take in here.

First, there are a couple of sources of times being referred to here. The "Unreasonably long poll interval" messages are from a core OVS library. However, stopwatches are also mentioned. Stopwatches are implemented within an OVS library, and can be created at will by any component that is interested in timing a certain section of code.

Next, there's the breakdown of time. The system and user times in the "Unreasonably long poll interval" messages are determined by calling the getrusage() library function that returns general resource usage information. However, there is no explicit method for determining the preempted time simply from getrusage. It's possible that you could take the wall-clock time and subtract the user and system times to get a hint for the preempted time. If that's good enough, then that can be a role that the log scraper takes on rather than requiring a change in OVS

Stopwatches don't print their results after every sample is taken, but rather accumulates trends over time. We could also use getrusage() within the stopwatch, I suppose, and we could keep track of the trends of system and user time there, too. I'm not sure if accumulating statistics about these times is as useful. The reason is that seeing a running average (or 95th percentile, etc.) is not useful for correlating a specific individual long sample with its corresponding system and user time. However, if you don't mind getting statistics about trends rather than being able to see individual samples, then it can be added.

So, with those items under consideration, I want to try to determine what the use case is for this. If it's just the "Unreasonably long poll interval" messages that you want to try to determine preempted time from, then I think you can do some arithmetic with no need to modify any code. If you want stopwatches to give a user+system breakdown in addition to their current wall clock times, then that will require a change within OVS.

What action should we take?

Comment 2 OVN Bot 2024-02-14 21:14:57 UTC

This issue is being closed as an automatic process due to the issue's age. If you wish to re-open this issue, please do so in Jira (https://issues.redhat.com) in the 'FDP' project. Please be sure to set the component to the latest OVN version where this issue is known to occur. If this is a feature request or improvement, please set the component to 'OVN'.

Comment 3 Red Hat Bugzilla 2024-06-14 04:25:07 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days