Bug 1253407 - [abrt] [beaker] [master] infinite loop in agent_delete_cb()
[abrt] [beaker] [master] infinite loop in agent_delete_cb()
Status: CLOSED UPSTREAM
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: NetworkManager (Show other bugs)
7.3
x86_64 Unspecified
medium Severity medium
: rc
: ---
Assigned To: Thomas Haller
Desktop QE
abrt_hash:e9b721edf5312daeb1f4bf8251a...
:
: 1255455 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-08-13 12:16 EDT by Lubomir Rintel
Modified: 2016-03-29 10:25 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-09-03 07:37:32 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
File: backtrace (20.17 KB, text/plain)
2015-08-13 12:16 EDT, Lubomir Rintel
no flags Details
File: cgroup (199 bytes, text/plain)
2015-08-13 12:16 EDT, Lubomir Rintel
no flags Details
File: core_backtrace (7.88 KB, text/plain)
2015-08-13 12:16 EDT, Lubomir Rintel
no flags Details
File: dso_list (5.81 KB, text/plain)
2015-08-13 12:16 EDT, Lubomir Rintel
no flags Details
File: environ (72 bytes, text/plain)
2015-08-13 12:16 EDT, Lubomir Rintel
no flags Details
File: limits (1.29 KB, text/plain)
2015-08-13 12:16 EDT, Lubomir Rintel
no flags Details
File: machineid (135 bytes, text/plain)
2015-08-13 12:16 EDT, Lubomir Rintel
no flags Details
File: maps (26.79 KB, text/plain)
2015-08-13 12:16 EDT, Lubomir Rintel
no flags Details
File: open_fds (22.73 KB, text/plain)
2015-08-13 12:16 EDT, Lubomir Rintel
no flags Details
File: proc_pid_status (1.04 KB, text/plain)
2015-08-13 12:16 EDT, Lubomir Rintel
no flags Details
File: var_log_messages (43.11 KB, text/plain)
2015-08-13 12:16 EDT, Lubomir Rintel
no flags Details
[PATCH] fix for an incorrect hash table handling (2.00 KB, patch)
2015-08-14 04:43 EDT, Jirka Klimes
no flags Details | Diff
[patch] rewrite handling async calls in NMSecretAgent (23.46 KB, patch)
2015-08-14 09:43 EDT, Thomas Haller
no flags Details | Diff

  None (edit)
Description Lubomir Rintel 2015-08-13 12:16:14 EDT
Description of problem:
Beaker run: https://beaker.engineering.redhat.com/jobs/1048794

NetworkManager ends up in an infinite loop in g_hash_table_remove().

Version-Release number of selected component:
NetworkManager-1.1.0-12796.8810036895.el7.centos

Additional info:
reporter:       libreport-2.1.11
backtrace_rating: 4
cmdline:        /usr/sbin/NetworkManager --no-daemon
crash_function: g_hash_table_lookup_node
executable:     /usr/sbin/NetworkManager
kernel:         3.10.0-302.el7.x86_64
runlevel:       N 3
type:           CCpp
uid:            0

Truncated backtrace:
Thread no. 1 (9 frames)
 #0 g_hash_table_lookup_node at ghash.c:405
 #1 g_hash_table_remove_internal at ghash.c:1295
 #2 agent_delete_cb at settings/nm-secret-agent.c:426
 #3 g_simple_async_result_complete at gsimpleasyncresult.c:763
 #4 reply_cb at gdbusproxy.c:2623
 #5 g_simple_async_result_complete at gsimpleasyncresult.c:763
 #6 g_dbus_connection_call_done at gdbusconnection.c:5502
 #7 g_simple_async_result_complete at gsimpleasyncresult.c:763
 #8 complete_in_idle_cb at gsimpleasyncresult.c:775
Comment 1 Lubomir Rintel 2015-08-13 12:16:16 EDT
Created attachment 1062702 [details]
File: backtrace
Comment 2 Lubomir Rintel 2015-08-13 12:16:17 EDT
Created attachment 1062703 [details]
File: cgroup
Comment 3 Lubomir Rintel 2015-08-13 12:16:18 EDT
Created attachment 1062704 [details]
File: core_backtrace
Comment 4 Lubomir Rintel 2015-08-13 12:16:19 EDT
Created attachment 1062705 [details]
File: dso_list
Comment 5 Lubomir Rintel 2015-08-13 12:16:20 EDT
Created attachment 1062706 [details]
File: environ
Comment 6 Lubomir Rintel 2015-08-13 12:16:20 EDT
Created attachment 1062707 [details]
File: limits
Comment 7 Lubomir Rintel 2015-08-13 12:16:21 EDT
Created attachment 1062708 [details]
File: machineid
Comment 8 Lubomir Rintel 2015-08-13 12:16:22 EDT
Created attachment 1062709 [details]
File: maps
Comment 9 Lubomir Rintel 2015-08-13 12:16:23 EDT
Created attachment 1062710 [details]
File: open_fds
Comment 10 Lubomir Rintel 2015-08-13 12:16:24 EDT
Created attachment 1062711 [details]
File: proc_pid_status
Comment 11 Lubomir Rintel 2015-08-13 12:16:25 EDT
Created attachment 1062712 [details]
File: var_log_messages
Comment 13 Lubomir Rintel 2015-08-14 04:26:25 EDT
Can't reproduce this one.

Happened during the big delay between these two:

NetworkManager_Test327_nmtui_ipv4_dns_method_auto_+_dns
...
NetworkManager_Test339_nmtui_ipv4_invalid_gateway
Comment 15 Jirka Klimes 2015-08-14 04:43:51 EDT
Created attachment 1062934 [details]
[PATCH] fix for an incorrect hash table handling
Comment 17 Lubomir Rintel 2015-08-14 09:36:08 EDT
LGTM
Comment 18 Thomas Haller 2015-08-14 09:42:30 EDT
(In reply to Jirka Klimes from comment #15)
> Created attachment 1062934 [details]
> [PATCH] fix for an incorrect hash table handling

I don't think this change fixes the issue.


	if (!g_hash_table_contains (priv->requests, r))
		g_hash_table_insert (priv->requests, r, r);


note that @r was just allocated newly a few lines above.
So, either @r was not in the hash (and the check is unneeded), or the hash contained a dangling pointer. The hash must not contain dangling pointers, and if it did -- that was the bug.
Comment 19 Thomas Haller 2015-08-14 09:43:37 EDT
Created attachment 1063036 [details]
[patch] rewrite handling async calls in NMSecretAgent

How about this larger refactoring?
Comment 20 Thomas Haller 2015-08-14 09:45:23 EDT
(In reply to Thomas Haller from comment #19)
> Created attachment 1063036 [details]
> [patch] rewrite handling async calls in NMSecretAgent
> 
> How about this larger refactoring?

ok, pushed the patch to th/secret-agent-rh1063036 ... seams easier to review
Comment 21 Thomas Haller 2015-08-16 06:08:25 EDT
(In reply to Thomas Haller from comment #20)

> ok, pushed the patch to th/secret-agent-rh1063036

th/secret-agent-rh1253407 of course...
Comment 22 Thomas Haller 2015-08-19 07:09:23 EDT
Got a crash, I think it's the same issue:


Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007ff97c988f55 in g_cancellable_is_cancelled (cancellable=0x22) at gcancellable.c:295
295       return cancellable != NULL && cancellable->priv->cancelled;
(gdb) bt
#0  0x00007ff97c988f55 in g_cancellable_is_cancelled (cancellable=0x22) at gcancellable.c:295
#1  0x000055f3897685e9 in get_callback (proxy=0x55f38b8a6770 [NMDBusSecretAgentProxy], result=0x55f38ba23a00, user_data=0x55f38b9dfb90) at settings/nm-secret-agent.c:269
#2  0x00007ff97c9c3ba7 in g_simple_async_result_complete (simple=0x55f38ba23a00 [GSimpleAsyncResult]) at gsimpleasyncresult.c:763
#3  0x00007ff97ca2b7a1 in reply_cb (connection=<optimized out>, res=<optimized out>, user_data=0x55f38ba23a00) at gdbusproxy.c:2623
#4  0x00007ff97c9c3ba7 in g_simple_async_result_complete (simple=0x55f38ba2c310 [GSimpleAsyncResult]) at gsimpleasyncresult.c:763
#5  0x00007ff97ca208cc in g_dbus_connection_call_done (source=<optimized out>, result=<optimized out>, user_data=0x7ff968006760) at gdbusconnection.c:5502
#6  0x00007ff97c9c3ba7 in g_simple_async_result_complete (simple=0x55f38ba23ae0 [GSimpleAsyncResult]) at gsimpleasyncresult.c:763
#7  0x00007ff97c9c3c09 in complete_in_idle_cb (data=<optimized out>) at gsimpleasyncresult.c:775
#8  0x00007ff97c40ca8a in g_main_context_dispatch (context=0x55f38b896340) at gmain.c:3122
#9  0x00007ff97c40ca8a in g_main_context_dispatch (context=context@entry=0x55f38b896340) at gmain.c:3737
#10 0x00007ff97c40ce20 in g_main_context_iterate (context=0x55f38b896340, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3808
#11 0x00007ff97c40d142 in g_main_loop_run (loop=0x55f38b896400) at gmain.c:4002
#12 0x000055f38956f45e in main (argc=1, argv=0x7fffd9b57a68) at main.c:449


Note that @cancellable is garbage (pointer value 0x22)
Comment 23 Thomas Haller 2015-08-20 12:30:32 EDT
*** Bug 1255455 has been marked as a duplicate of this bug. ***
Comment 24 Thomas Haller 2015-08-21 17:52:25 EDT
Largely extended the branch.

Also fixed another bug, that we left NMSecretAgent instances hanging if the client didn't properly unregister -- eg. during a crash.

In that case, the hanging secret agent instance is not only a leak, but it also prevents any further connections form the client.

please review.
Comment 25 Thomas Haller 2015-08-24 06:30:01 EDT
Another crash:

(gdb) bt
#0  0x00007ff1e48d79c8 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007ff1e48d965a in __GI_abort () at abort.c:89
#2  0x00007ff1e4eea2a5 in g_assertion_message (domain=domain@entry=0x7ff1e4f3112e "GLib", file=file@entry=0x7ff1e4f359f2 "ghash.c", line=line@entry=373, func=func@entry=0x7ff1e4f35c90 <__func__.10250> "g_hash_table_lookup_node", message=message@entry=0x55c36e230970 "assertion failed: (hash_table->ref_count > 0)") at gtestutils.c:2356
#3  0x00007ff1e4eea33a in g_assertion_message_expr (domain=domain@entry=0x7ff1e4f3112e "GLib", file=file@entry=0x7ff1e4f359f2 "ghash.c", line=line@entry=373, func=func@entry=0x7ff1e4f35c90 <__func__.10250> "g_hash_table_lookup_node", expr=expr@entry=0x7ff1e4f359d8 "hash_table->ref_count > 0") at gtestutils.c:2371
Python Exception <class 'TypeError'> iter() returned non-iterator of type '_iterator':
#4  0x00007ff1e4eb25a6 in g_hash_table_remove_internal (hash_return=<synthetic pointer>, key=0x55c36e421660, hash_table=0x55c36e45b920) at ghash.c:373
Python Exception <class 'TypeError'> iter() returned non-iterator of type '_iterator':
#5  0x00007ff1e4eb25a6 in g_hash_table_remove_internal (hash_table=0x55c36e45b920, key=0x55c36e421660, notify=1) at ghash.c:1352
#6  0x000055c36d7a506e in agent_delete_cb (proxy=<optimized out>, result=0x55c36e1fc2e0, user_data=0x55c36e421660) at settings/nm-secret-agent.c:426
#7  0x00007ff1e547aba7 in g_simple_async_result_complete (simple=0x55c36e1fc2e0 [GSimpleAsyncResult]) at gsimpleasyncresult.c:763
#8  0x00007ff1e54e27a1 in reply_cb (connection=<optimized out>, res=<optimized out>, user_data=0x55c36e1fc2e0) at gdbusproxy.c:2623
#9  0x00007ff1e547aba7 in g_simple_async_result_complete (simple=0x7ff1d0008880 [GSimpleAsyncResult]) at gsimpleasyncresult.c:763
#10 0x00007ff1e54d78cc in g_dbus_connection_call_done (source=<optimized out>, result=<optimized out>, user_data=0x55c36e4890f0) at gdbusconnection.c:5502
#11 0x00007ff1e547aba7 in g_simple_async_result_complete (simple=0x55c36e2aade0 [GSimpleAsyncResult]) at gsimpleasyncresult.c:763
#12 0x00007ff1e547ac09 in complete_in_idle_cb (data=<optimized out>) at gsimpleasyncresult.c:775
#13 0x00007ff1e4ec3a8a in g_main_context_dispatch (context=0x55c36e0943a0) at gmain.c:3122
#14 0x00007ff1e4ec3a8a in g_main_context_dispatch (context=context@entry=0x55c36e0943a0) at gmain.c:3737
#15 0x00007ff1e4ec3e20 in g_main_context_iterate (context=0x55c36e0943a0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3808
#16 0x00007ff1e4ec4142 in g_main_loop_run (loop=0x55c36e094460) at gmain.c:4002
#17 0x000055c36d677256 in main (argc=1, argv=0x7ffca3bedd08) at main.c:449
Comment 26 Dan Winship 2015-08-24 11:15:41 EDT
> secret-agent: fix leak of @dbus_owner

>    (cherry picked from commit 833faf3bdbc0af74d09e11c5f081d47a51680d51)

I don't think you meant to use -x here. (That commit doesn't exist on fdo.)

Commit looks right. It looks like I introduced this bug in pre-gdbus.



> secret-agent: rework handling of asynchronous request and cancelling

>    Use an opaque type NMSecretAgentCallId for the call-id.

(would be nice if that was split out separately...)

>      When cancelling, the callback will be invoked synchronously, with an
>      GError indicating the cancelling (G_IO_ERROR_CANCELLED).

Callbacks that are sometimes synchronous and sometimes asynchronous can be difficult to work with, and have been removed with extreme prejudice from most of GNOME...

>    - Don't strip_remote_error from NMSecretManager. NMAgentManager inspects
>      the remote error for the 'GetSecrets' call. Any users of
>      NMSecretManager should strip the error themselves (as they see fit).

Ah, no, I missed a bit of gdbus porting here. Now that we're using gdbus everywhere, D-Bus errors get translated to the correct GError domain+code, so:

- 		if (_nm_dbus_error_has_name (error, NM_DBUS_INTERFACE_SECRET_AGENT ".UserCanceled")) {
+ 		if (g_error_matches (error, NM_SECRET_AGENT_ERROR, NM_SECRET_AGENT_ERROR_USER_CANCELLED)) {

(It looks like this is the only instance of this bug. Possibly _nm_dbus_error_has_name() should warn if the name starts with "org.freedesktop.NetworkManager".)

>    This also fixes a crash where handling of the asynchronous data was
>    messed up.

what exactly was messed up before?

> 	if (error) {
>-		nm_log_dbg (LOGD_AGENTS, "(%s) agent failed secrets request %p/%s/%s: (%d) %s",
>+		if (g_error_matches (error, G_IO_ERROR, G_IO_ERROR_CANCELLED)) {
>+			nm_log_dbg (LOGD_AGENTS, "(%s) get secrets request cancelled: %p/%s/%s",
>+			            nm_secret_agent_get_description (agent),
>+			            req, parent->detail, req->setting_name);
>+			return;
>+		}

Won't returning here cause req to be leaked?



> secret-agent: fix detection of disapearing secret-agent

>+nm_bus_manager_connection_is_private (NMBusManager *self,
>+                                      GDBusConnection *connection)

You can just check if g_dbus_connection_get_unique_name() is NULL or not.

>+	if (!new_owner || !*new_owner) {

new_owner can't be NULL here.
Comment 27 Thomas Haller 2015-08-24 14:14:11 EDT
(In reply to Dan Winship from comment #26)
> > secret-agent: fix leak of @dbus_owner
> 
> >    (cherry picked from commit 833faf3bdbc0af74d09e11c5f081d47a51680d51)
> 
> I don't think you meant to use -x here. (That commit doesn't exist on fdo.)

Correct. Fixed

> Commit looks right. It looks like I introduced this bug in pre-gdbus.
> 
> 
> 
> > secret-agent: rework handling of asynchronous request and cancelling
> 
> >    Use an opaque type NMSecretAgentCallId for the call-id.
> 
> (would be nice if that was split out separately...)

Done. New commit:
  "secret-agent: refactor call-id to be a (opaque pointer) type instead of a void pointer"



> >      When cancelling, the callback will be invoked synchronously, with an
> >      GError indicating the cancelling (G_IO_ERROR_CANCELLED).
> 
> Callbacks that are sometimes synchronous and sometimes asynchronous can be
> difficult to work with, and have been removed with extreme prejudice from
> most of GNOME...

Correct. The callbacks are always invoked synchronously when invoking the action. Only during "cancel()" and "dispose()", the callback will be invoked synchronously with respect to the cancel/dispose(). Especially, for the dispose case, that is unavoidable. 


> >    - Don't strip_remote_error from NMSecretManager. NMAgentManager inspects
> >      the remote error for the 'GetSecrets' call. Any users of
> >      NMSecretManager should strip the error themselves (as they see fit).
> 
> Ah, no, I missed a bit of gdbus porting here. Now that we're using gdbus
> everywhere, D-Bus errors get translated to the correct GError domain+code,
> so:
> 
> - 		if (_nm_dbus_error_has_name (error, NM_DBUS_INTERFACE_SECRET_AGENT
> ".UserCanceled")) {
> + 		if (g_error_matches (error, NM_SECRET_AGENT_ERROR,
> NM_SECRET_AGENT_ERROR_USER_CANCELLED)) {
> 
> (It looks like this is the only instance of this bug. Possibly
> _nm_dbus_error_has_name() should warn if the name starts with
> "org.freedesktop.NetworkManager".)


How is the fixup!


> >    This also fixes a crash where handling of the asynchronous data was
> >    messed up.
> 
> what exactly was messed up before?

I don't really know. There are several backtraces of crashes attached to this BZ. The indicate that we have dangling pointers inside priv->requests. I didn't investigate how that happens exactly, because I refactored the code before I noticed the bug. But the branches fixes the crashes for me.


> > 	if (error) {
> >-		nm_log_dbg (LOGD_AGENTS, "(%s) agent failed secrets request %p/%s/%s: (%d) %s",
> >+		if (g_error_matches (error, G_IO_ERROR, G_IO_ERROR_CANCELLED)) {
> >+			nm_log_dbg (LOGD_AGENTS, "(%s) get secrets request cancelled: %p/%s/%s",
> >+			            nm_secret_agent_get_description (agent),
> >+			            req, parent->detail, req->setting_name);
> >+			return;
> >+		}
> 
> Won't returning here cause req to be leaked?

I think not. We return here only when nm-agent-manager cancels the requests. 
I will think more about this... (TODO)


> > secret-agent: fix detection of disapearing secret-agent
> 
> >+nm_bus_manager_connection_is_private (NMBusManager *self,
> >+                                      GDBusConnection *connection)
> 
> You can just check if g_dbus_connection_get_unique_name() is NULL or not.

fixed.


> >+	if (!new_owner || !*new_owner) {
> 
> new_owner can't be NULL here.

fixed.



Thanks, Repushed.
Comment 28 Dan Winship 2015-08-24 16:28:27 EDT
(In reply to Thomas Haller from comment #27)
> > (It looks like this is the only instance of this bug. Possibly
> > _nm_dbus_error_has_name() should warn if the name starts with
> > "org.freedesktop.NetworkManager".)
> 
> How is the fixup!

Good, but should probably update the docs too:

    This should only be used for "foreign" D-Bus errors (eg, errors
    from BlueZ or wpa_supplicant). All NetworkManager D-Bus errors
    should be properly mapped by gdbus to one of the domains/codes in
    nm-errors.h

Oh, also, the commit message needs to be updated to match the squashed commit.

> > Won't returning here cause req to be leaked?
> 
> I think not. We return here only when nm-agent-manager cancels the requests. 
> I will think more about this... (TODO)

Hm... ok.


I think the rest looks right.
Comment 29 Beniamino Galvani 2015-08-24 16:53:04 EDT
> secret-agent: rework handling of asynchronous request and cancelling

s/NMSecretManager/NMSecretAgent/ in commit message?

> secret-agent: fix detection of disapearing secret-agent

disappearing

"The signal "nofity:g-name-owner" is only emited for well-know"

notify, emitted

I'm not very familiar with agents management, but at first look
the branch looks good to me.
Comment 30 Thomas Haller 2015-08-25 04:23:26 EDT
(In reply to Dan Winship from comment #28)

> > > Won't returning here cause req to be leaked?
> > 
> > I think not. We return here only when nm-agent-manager cancels the requests. 
> > I will think more about this... (TODO)
> 
> Hm... ok.

I think it is correct.

There are only two callers of nm_secret_agent_cancel_secrets():
  get_cancel_cb() and request_next_agent().

get_cancel_cb() is only invoked inside request_free() -- so there is noting to free.
And during request_next_agent() we should already be in process of doing the necessary cleanup.


Not really sure though. IMO NMAgentManager should be significantly simplified. It's quite confusing (to me).
Comment 31 Thomas Haller 2015-08-25 05:16:52 EDT
Squashed all fixup commits, rebased and repushed.
IMO it's ready to be merged.



I only doubt about how the subscription to "NameOwnerChanged" signal works
(nm_secret_agent_new() and _on_disconnected_name_owner_changed()).

AFAIS, it tests OK and it seems to correspond to the implementation in GDBusProxy.

What I don't understand, is that we pass priv->dbus_owner as "arg0" to g_dbus_connection_signal_subscribe. Is that all we need to ensure that we get the correct signal in _on_disconnected_name_owner_changed()? Could it not happen, that we get called in g_dbus_connection_signal_subscribe() for a wrong connection?

Can somebody who understands this confirm that this is correct?
Comment 32 Dan Winship 2015-08-25 09:32:19 EDT
(In reply to Thomas Haller from comment #31)
> What I don't understand, is that we pass priv->dbus_owner as "arg0" to
> g_dbus_connection_signal_subscribe. Is that all we need to ensure that we
> get the correct signal in _on_disconnected_name_owner_changed()?

Yes. It means you will only receive the signal when the signal's first argument matches that arg0.
Comment 34 Thomas Haller 2015-09-03 07:37:32 EDT
Closing BZ as it was tracking an issue with upstream master -- and got fixed there.

No builds in RHEL were affected.

Note You need to log in before you can comment on or make changes to this bug.