Bug 1458571 - [downstream clone - 4.1.3] [RFE] JSON RPC broker should pass correlation id to VDSM
Summary: [downstream clone - 4.1.3] [RFE] JSON RPC broker should pass correlation id t...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.1.3
: ---
Assignee: Martin Perina
QA Contact: Pavel Novotny
URL:
Whiteboard:
Depends On: 1425910
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-04 13:50 UTC by rhev-integ
Modified: 2019-04-28 14:07 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of: 1425910
Environment:
Last Closed: 2017-07-06 07:32:08 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)
engine & VDSM log excerpts with correlation ID (36.95 KB, text/plain)
2017-06-15 17:29 UTC, Pavel Novotny
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1696 0 normal SHIPPED_LIVE VDSM bug fix and enhancement update 4.1.3 2017-07-06 11:25:09 UTC

Description rhev-integ 2017-06-04 13:50:39 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1425910 +++
======================================================================

Reopening this bug, since this question comes up more and more and across the whole product. There is definitely a need for this feature to exist.

Even with all the changes and improvements in 4.1 logging system, there is no implementation for correlation id to uniquely identify a flow top down from engine to vdsm. Or even between  the parent and children commands in the engine itself.

Importance of this bug for CEE: to be able troubleshooting customers environment successfully. It is not a clean dev environment with a single flow that is very easy to find and track. Usually, there are multiple flows of the same type but on different entities happening at the same time + multiple retries of the same failed flow on a single entity, which makes it impossible to track, not inside engine.log and for sure not down from engine.log to vdsm.log.

+++ This bug was initially created as a clone of Bug #1112120 +++

--- Additional comment from Saggi Mizrahi on 2014-06-24 12:55:23 EDT ---

FYI using FlowID and correlation-id interchangeably.

Correlation was a mistake. I always said it's going to come a time where we will not support it anymore and that time has come.

Lets first start with why Correlation ID is completely useless.
The basic idea of passing one process's context to anothers' log is never a good idea.

It might seem nice at first. Being able to track calls that relate to a specific flow.

But VDSM doesn't care about engine flows. VDSM, for a lot of verbs (and many more to come) would actually try and batch jobs, Run tasks in parallel, get responses from cache and more. This means that most of the information is lost when looking at the correlation ID.

The correct way to filter is by Task ID, Domain ID, VM ID, etc... Those are context that VDSM understands and make's sure to log them whenever they are processes.

When this was originally suggested I was opposed to it and requested for a real use case for use of this information.

I was told that "it's good" but never received a concrete example.

Further more, in json-rpc *every* request has an ID and that ID can be whatever you want it to be. You could make IDs be "<FlowID>_<GUID>" you could even make it "<Flow_Name>_<Flow_ID>_<GUID>" so you have something even more descriptive.

In any case, I would still recommend ignoring FlowID when examining issues.

I have yet to see anyone use FlowID effectively, I have only seen people mistakenly use it and filter out important information.


As a side note, when event's are going to become more and more prevalent the stuff that are covered under the Task\Flow context are going to shrink. This is a good thing as Tasks are not entities in the system.

To sum up, unless I get a concrete use case with a problem that can't be easily solved without Correlation ID it's going to remain unsupported. If you don't want to argue you can put it as part of the RequestID but I strongly recommend you don't.

--- Additional comment from Andrew Cathrow on 2014-06-29 12:19:03 EDT ---

(In reply to Saggi Mizrahi from comment #1)
> FYI using FlowID and correlation-id interchangeably.
> 
> Correlation was a mistake. I always said it's going to come a time where we
> will not support it anymore and that time has come.
> 
> Lets first start with why Correlation ID is completely useless.
> The basic idea of passing one process's context to anothers' log is never a
> good idea.
> 
> It might seem nice at first. Being able to track calls that relate to a
> specific flow.
> 
> But VDSM doesn't care about engine flows. VDSM, for a lot of verbs (and many
> more to come) would actually try and batch jobs, Run tasks in parallel, get
> responses from cache and more. This means that most of the information is
> lost when looking at the correlation ID.
> 
> The correct way to filter is by Task ID, Domain ID, VM ID, etc... Those are
> context that VDSM understands and make's sure to log them whenever they are
> processes.
> 
> When this was originally suggested I was opposed to it and requested for a
> real use case for use of this information.
> 
> I was told that "it's good" but never received a concrete example.
> 
> Further more, in json-rpc *every* request has an ID and that ID can be
> whatever you want it to be. You could make IDs be "<FlowID>_<GUID>" you
> could even make it "<Flow_Name>_<Flow_ID>_<GUID>" so you have something even
> more descriptive.
> 
> In any case, I would still recommend ignoring FlowID when examining issues.
> 
> I have yet to see anyone use FlowID effectively, I have only seen people
> mistakenly use it and filter out important information.
> 
> 
> As a side note, when event's are going to become more and more prevalent the
> stuff that are covered under the Task\Flow context are going to shrink. This
> is a good thing as Tasks are not entities in the system.
> 
> To sum up, unless I get a concrete use case with a problem that can't be
> easily solved without Correlation ID it's going to remain unsupported. If
> you don't want to argue you can put it as part of the RequestID but I
> strongly recommend you don't.

Then I'd suggest you spend a few days working in support.
VDSM logs are an abomination, overly complex and make it nearly impossible for a seasoned systems administrator to troubleshoot their own problems.

Add into this complexity the lack of context in the log you've got a support nightmare.

--- Additional comment from Tomas Dosek on 2014-06-30 02:41:52 EDT ---

AFAIK so far correlation ID was not assigned to subtasks of certain flow anyway.
So when we start doing analysis the correlation ID might lead us to the "root task" in vdsm, but we have to identify child threads of this task anyway.

If we'd go with flow ID and each of the child tasks would get identify by this it would indeed make out life at support much easier.

--- Additional comment from Tomas Dosek on 2014-06-30 02:45:42 EDT ---

Just to state typical usecase here with regards how we used correlation ID.

A) Go to restored database audit_log
B) Identify failing task - get the correlation ID
C) Go to engine log - identify relevant task (this is all identified with the 
   correlation ID)
D) Go to vdsm logs - Identify relevant root call made by engine, inspect all
   child tasks - This requires knowledge of timeline and some further "manual"
   searching until the right threads are identified
E) Try to make clue in what happened of the above mentioned.

--- Additional comment from Roman Hodain on 2014-06-30 03:31:45 EDT ---

I would also like to highlight the fact that the engine DB is not always time
synced with the engine logs. So the entry in the audit log is shipped for
unknown amount of hours. This makes very complicated to find out the related
logs in environments with hundreds VMs and many RHEV admins. In many cases the
environment may suffer by more than just one issue and in that case the logs are
full of errors. There are also very often the same types of operation running
in parallel which makes complicated to identify the right one.

The correlation ID will at least tell you that this is exactly the place where 
you are supposed to start the investigation.

--- Additional comment from Anil Vettathu on 2014-06-30 04:40:14 EDT ---

1. When there are many similar event messages for multiple VMs, "correlation ID" is the only distinguishing factor to find the event log for a particular VM in the engine.log. If there is no correlation id for the events, then to find a log message is like looking for "Needle in a haystack".

2. Sometimes corelation ID is really useful, when we do remote trouble shooting session on live ennvironment. we just need to grep for the correlation ID in engine.log when an event messages comes in RHEV Manager GUI.

I think we need correlation ID in place, till you implement the new mechanism. Disabling it will make our troubleshooting really time consuming.

--- Additional comment from Udayendu Sekhar Kar on 2014-06-30 06:02:11 EDT ---

I fully agree with Roman and Anil. Till now we are experiencing lots of issues with RHEV and the corelation ID is making the analysis much easier. So we should keep the correlation ID in place, till the implementation of some other mechanism.

--- Additional comment from Tomas Dosek on 2014-07-08 01:50:11 EDT ---

Guys the feedback from GSS is quite clear correlation ID is essential to tracking things down in engine. Can we think of some way of temporary coexistence of correlation ID and flow ID to make the transition smoother?

--- Additional comment from Robert McSwain on 2014-07-08 11:55:24 EDT ---

I'm with Tomas (and Gordon in the sbr-virt email list) that any way to temporarily have a correlation ID and a flow ID at the same time would be better for those of us in support and customers alike, rather than having a period in time that we have neither.

--- Additional comment from Barak on 2014-07-13 10:23:33 EDT ---

Guys,

Do you use the correlation ID in the vdsm logs ?
Looking in the use case above noone mentioned the vdsm log (only engine log).

If you do - please specify the how you use it.

Saggi's comment was about vdsm side not the engine.

--- Additional comment from Saggi Mizrahi on 2014-07-13 11:53:42 EDT ---

In general, I think the thing I have to stress is that unlike the XML-RPC interface in json RPC every request has it's own request ID.
This request ID is selected by the engine. This ID is recorder in both sides.

Currently it's just a GUID. That means that by grepping the flowID in the engine log you should be able to see all of the request IDs and grep for them in VDSM.

Also, since the engine chooses that ID it could be whatever is preferred.
One possible choice is having it be "<flowID>.<incremental integer>" so every call from a flow has an ID that is directly related to the flow in the engine.

That is completely unrelated to the storage TaskIDs which are currently selected by VDSM.

To sum up, if just having a GUID for every request logged in the engine log and VDSM log is enough that is already available. If you require the flowID to be mangled as part of the request ID it is possible and would require some coding.

--- Additional comment from Roman Hodain on 2014-07-14 04:54:03 EDT ---

(In reply to Barak from comment #10)
> Guys,
> 
> Do you use the correlation ID in the vdsm logs ?
> Looking in the use case above noone mentioned the vdsm log (only engine log).
> 
> If you do - please specify the how you use it.
> 
> Saggi's comment was about vdsm side not the engine.

I use it only only in the engine logs so it the correlation Id is going to be removed only from the vdsm logs then I am OK with that.

--- Additional comment from Tomas Dosek on 2014-07-16 13:53:33 EDT ---

Ok Saggi, thanks for making it clear. Yes, currently none of us uses correlation ID in vdsm as it has almost not practical use. Flow ID which would be able to provide direct link in vdsm events makes indeed more sense.

--- Additional comment from Barak on 2015-07-06 08:30:47 EDT ---

we should aim to have a flowid in the vdsm log,
what is the ID to be used is the stuff there is no agreement on.
Let's redefine the problem we are trying to solve and continue from there.

(Originally by Marina Kalinin)

Comment 3 rhev-integ 2017-06-04 13:51:00 UTC
Please see dependent bug which was already fixed.

(Originally by Piotr Kliczewski)

Comment 4 rhev-integ 2017-06-09 09:58:03 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No external trackers attached]

For more info please contact: rhv-devops

Comment 5 Martin Perina 2017-06-09 11:51:28 UTC
(In reply to rhev-integ from comment #4)
> INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following
> reason:
> 
> [No external trackers attached]
> 
> For more info please contact: rhv-devops

Every work has been done on upstream BZ1112120, we only forgot that there's also created separate bug for this on downstream.

Comment 7 Pavel Novotny 2017-06-15 16:09:33 UTC
Verified in 
rhevm-4.1.3.2-0.1.el7.noarch
vdsm-4.19.18-1.el7ev.x86_64

The flow_id in a main VDSM task is now set to the correlation ID returned by engine.

Example: 
VM migration task. The correlation ID here is 352b4c28-b2b7-402a-962f-0718fbe31bb8.

engine.log:
2017-06-13 17:56:20,674+02 INFO  [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-2) [352b4c28-b2b7-402a-962f-0718fbe31bb8] Lock Acquired to object...
2017-06-13 17:56:20,806+02 INFO  [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (org.ovirt.thread.pool-6-thread-27) [352b4c28-b2b7-402a-962f-0718fbe31bb8] Running command: MigrateVmToServerCommand...

vdsm.log:
2017-06-13 17:56:20,887+0200 INFO  (jsonrpc/4) [vdsm.api] START migrate(params={...})  from=::ffff:engine.example.com,45292, flow_id=352b4c28-b2b7-402a-962f-0718fbe31bb8 (api:46)
2017-06-13 17:56:20,894+0200 INFO  (jsonrpc/4) [vdsm.api] FINISH migrate return={...} from=::ffff:engine.example.com,45292, flow_id=352b4c28-b2b7-402a-962f-0718fbe31bb8 (api:52)


See the attachment with full excerpts from engine and VDSM logs.

Comment 8 Pavel Novotny 2017-06-15 17:29:42 UTC
Created attachment 1288141 [details]
engine & VDSM log excerpts with correlation ID

Comment 10 errata-xmlrpc 2017-07-06 07:32:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1696


Note You need to log in before you can comment on or make changes to this bug.