Bug 885099

Summary:	error_is_set() conditionals should not use Error * parameters
Product:	Red Hat Enterprise Linux 7	Reporter:	Markus Armbruster <armbru>
Component:	qemu-kvm	Assignee:	Cole Robinson <crobinso>
Status:	CLOSED WONTFIX	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	7.0	CC:	acathrow, armbru, bsarathy, crobinso, hhuang, juzhang, lcapitulino, mkenneth, virt-maint
Target Milestone:	rc
Target Release:	7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-07-29 19:45:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Markus Armbruster 2012-12-07 14:12:46 UTC

Description of problem:
Incorrect usage pattern:

void bad(Error *errp)
{
    do_something(&errp);
    if (error_is_set(errp)) {
	return;
    }
    do_more();
}

where do_something() uses error_set() to put an error object into errp when it fails.  error_set() does nothing when its Error * argument is null. error_is_set() returns false when its Error * argument is null. Works as designed.

bad() breaks when a caller passes

1. a null argument: error_set() does nothing, error_is_set() returns false, and bad() calls do_more() even after do_something() fails.

2. an argument for that error_is_set() returns true: bad() doesn't call do_more() even after do_something() succeeds().

2 is usually a bad idea, but 1 can be perfectly legitimate.

Correct usage pattern:

void good(Error *errp)
{
    Error *local_error = NULL;

    do_something(&local_error);
    if (local_err) {
        error_propagate(errp, local_err);
	return;
    }
    do_more();
}

Coverity flagged an instance of the incorrect pattern in qmp_input_type_number(), as RESOURCE_LEAK.

Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.340.el6

How reproducible:
Not tried; bug found by static analysis

Steps to Reproduce:
Non known

Comment 1 Markus Armbruster 2012-12-07 14:14:34 UTC

We should search for the incorrect pattern upstream first.  Perhaps I can trick Coverity into flagging it; will report back if I can.

Comment 3 Luiz Capitulino 2012-12-07 17:23:07 UTC

Markus, I think you meant qmp_input_type_enum().

While the bug exists, it can't be triggered in practice in 6.4 tree as far as I could check it (it was a bit tough to check though, so I wouldn't complain if someone do a second pass).

Moving to 6.5.

Comment 4 Markus Armbruster 2013-01-09 13:46:10 UTC

Yes, I meant qmp_input_type_enum(), sorry for the typo.

Comment 6 Luiz Capitulino 2013-04-04 15:45:50 UTC

This bz is about two things:

1. A bad pattern in how we handle the error object that could cause functions to do unpredictable things

2. Coverity spotted qmp_input_type_enum() as a resource leak, because qmp_input_type_enum() uses the bad pattern from item 1

The good news is: the bad pattern is only triggered when the caller doesn't pass an error object (that is, the error object is NULL). This is valid, but the bad pattern doesn't handle this correctly.

I've checked users of qmp_input_type_enum() in RHEL6.5 and they all seem to pass a non-NULL error object, which means RHEL6.5 won't trigger this bug. I also checked other functions with the same problem and it *seems* that the qapi always pass non-NULL error objects so I think RHEL6 is fine.

For this reason I'm moving this to RHEL7.0 so that we can do right thing there.

Comment 10 Cole Robinson 2014-03-11 15:03:48 UTC

Has anyone considered using thread local storage to track a global 'last reported error' object? Saves having to manually pass around error objects and would avoid pitfalls like the bad pattern mentioned above. Libvirt uses this to great effect.

Of course that would be a larger project...

Comment 11 Laszlo Ersek 2014-03-11 17:46:05 UTC

I don't know much of the history of error propagation in qemu. But we already have a mixture (== unfinished conversion) of printf(), error_report(), and explicit propagation of Error objects, so we probably shouldn't start another parallel infrastructure :)

Comment 12 Cole Robinson 2014-03-11 18:01:45 UTC

(In reply to Laszlo Ersek from comment #11)
> I don't know much of the history of error propagation in qemu. But we
> already have a mixture (== unfinished conversion) of printf(),
> error_report(), and explicit propagation of Error objects, so we probably
> shouldn't start another parallel infrastructure :)

Changing function signatures to pass Error objects all the way down would cause rediculous churn though, and that's what would be presently needed to have qerror_report fully replace error_report. If we switched to using TLS stored error objects, you could convert error_report directly to qerror without touching function signatures. So it would really be an implementation detail that would ease the ongoing conversions... at least how I envision it in my head :)

Comment 13 Markus Armbruster 2014-03-12 11:09:04 UTC

This bug is about a specific misuse of the error_set() API.  We
strayed into more general discussion of possible error APIs in QEMU in
comment#10..#12.  This isn't the best place for such discussions, but
I'm going to add some background information now, to keep the record
straight.

Re comment#12: I think you mean error_set(), not qerror_report().

Conversion to error_set() involves changing function signatures, as
you described.

qerror_report() doesn't.  It sets a monitor-local last error.  This is
similar, but not the same as a thread-local last error.

We adopted qerror_report() so we don't have to pass error objects
around, precisely because passing them around causes "rediculous
churn".

My memory gets hazy here (perhaps Luiz can correct me if I talk
nonsense), but I think we made the error monitor-local, not
thread-local, because all monitors share the same thread, along with
much other code.  It's a state machine.  With a thread-local last
error, you need to take care to save and restore the last error as you
switch between unrelated states.  If I remember correctly (big if by
now), Anthony declared that impractical in QEMU.  I didn't fully
understand his arguments then, and time hasn't exactly improved my
understanding.

In my (limited!) experience, qerror_report() worked well enough for
its intended purpose, namely to help with converting existing HMP
commands to QMP.  However, it could not satisfy the need for a general
error reporting API less primitive than errno.  I'm afraid I can't
further explain that need to you, as I wasn't personally involved in
things with that need, at least not deeply enough to pass judgement on
whether the need is real.  All I can say is this is how I remember us
creating infrastructure for passing around error objects after having
rejected it first.

qerror_report() is now a transitional interface to help with
converting existing HMP commands to QMP.  Its uses elsewhere are
probably all mistakes.  "Transitional" means it should go away
completely in due time.

Regarding passing "Error objects all the way down": don't.  When a
simple error reporting mechanism such as returning -errno is fine,
don't complicate things with an Error object just for the heck of it.
Use common sense, and KISS.

Is error_set() the best possible solution for QEMU's error reporting
needs?  I don't claim to know.  But it's the solution that was picked,
and I don't wish to reevaluate it from first principles now (and
refight all the battles).

Hope this helps :)

Comment 14 Luiz Capitulino 2014-03-12 13:25:49 UTC

(In reply to Markus Armbruster from comment #13)

> My memory gets hazy here (perhaps Luiz can correct me if I talk
> nonsense), but I think we made the error monitor-local, not
> thread-local, because all monitors share the same thread, along with
> much other code.  It's a state machine.  With a thread-local last
> error, you need to take care to save and restore the last error as you
> switch between unrelated states.  If I remember correctly (big if by
> now), Anthony declared that impractical in QEMU.  I didn't fully
> understand his arguments then, and time hasn't exactly improved my
> understanding.

Correct.

The reason number one against any global error is that it prevents us from having parallel command execution support. Today, it's unclear whether we really need this feature, but back then we wanted to have it and the global error would block us.

Another (arguable) problem with qerror_report() is that it does too much... It mixes error setting with error reporting and that entirely depends on the context it's being called.

The new Error API is more generic and solves all those problems. Also, as Markus more or less puts it, I think we can say that there was general agreement that the Error API is a good solution for error reporting in QEMU (while there was wildly discussion against qerror_report()).

I wouldn't propose changing directions here, especially because most of the churn has already been done.

Comment 15 Cole Robinson 2014-03-12 13:42:53 UTC

Thanks guys for all the info. Sounds like qerror_report already does more or less what I was suggesting, but the explicit Error passing is the preferred way upstream.

Comment 17 Markus Armbruster 2014-07-14 08:30:15 UTC

I addressed the problem upstream by eliminating error_is_set()
entirely, in commit d2e064a.

In my elimination work, I examined all the bad uses of error_is_set(),
and deemed them harmless in practice.  Doesn't mean all downstream
uses must be equally harmless.  It's probable, though.

The work was worthwhile upstream, because it made the code simpler,
more robust, more consistent, got rid of a number of Coverity false
positives, and removed the temptation to misuse error_is_set(),
possibly in not so harmless ways.

The case for backporting it is weaker.  It would certainly be an
improvement, but is it worth the cost?  Perhaps if the backport is
trivial.  Some 40 commits to backport, not counting possible
prerequisite commits.

Comment 18 Luiz Capitulino 2014-07-14 13:33:29 UTC

My only pros for the backport is to avoid future conflicts.

Comment 19 Cole Robinson 2014-07-29 19:17:13 UTC

This bug is assigned to me, but given that I have very little experience with RHEL qemu I probably shouldn't be making the decision on whether to backport these patches.

Comment 20 Markus Armbruster 2014-07-29 19:45:33 UTC

Let's not backport anything for now.  We can revisit the decision (and reopen the bug) if the conflicts around error_is_set() bother us too much.