RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1253407 - [abrt] [beaker] [master] infinite loop in agent_delete_cb()
Summary: [abrt] [beaker] [master] infinite loop in agent_delete_cb()
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: NetworkManager
Version: 7.3
Hardware: x86_64
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Thomas Haller
QA Contact: Desktop QE
URL:
Whiteboard: abrt_hash:e9b721edf5312daeb1f4bf8251a...
: 1255455 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-08-13 16:16 UTC by Lubomir Rintel
Modified: 2016-03-29 14:25 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-09-03 11:37:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
File: backtrace (20.17 KB, text/plain)
2015-08-13 16:16 UTC, Lubomir Rintel
no flags Details
File: cgroup (199 bytes, text/plain)
2015-08-13 16:16 UTC, Lubomir Rintel
no flags Details
File: core_backtrace (7.88 KB, text/plain)
2015-08-13 16:16 UTC, Lubomir Rintel
no flags Details
File: dso_list (5.81 KB, text/plain)
2015-08-13 16:16 UTC, Lubomir Rintel
no flags Details
File: environ (72 bytes, text/plain)
2015-08-13 16:16 UTC, Lubomir Rintel
no flags Details
File: limits (1.29 KB, text/plain)
2015-08-13 16:16 UTC, Lubomir Rintel
no flags Details
File: machineid (135 bytes, text/plain)
2015-08-13 16:16 UTC, Lubomir Rintel
no flags Details
File: maps (26.79 KB, text/plain)
2015-08-13 16:16 UTC, Lubomir Rintel
no flags Details
File: open_fds (22.73 KB, text/plain)
2015-08-13 16:16 UTC, Lubomir Rintel
no flags Details
File: proc_pid_status (1.04 KB, text/plain)
2015-08-13 16:16 UTC, Lubomir Rintel
no flags Details
File: var_log_messages (43.11 KB, text/plain)
2015-08-13 16:16 UTC, Lubomir Rintel
no flags Details
[PATCH] fix for an incorrect hash table handling (2.00 KB, patch)
2015-08-14 08:43 UTC, Jirka Klimes
no flags Details | Diff
[patch] rewrite handling async calls in NMSecretAgent (23.46 KB, patch)
2015-08-14 13:43 UTC, Thomas Haller
no flags Details | Diff

Description Lubomir Rintel 2015-08-13 16:16:14 UTC
Description of problem:
Beaker run: https://beaker.engineering.redhat.com/jobs/1048794

NetworkManager ends up in an infinite loop in g_hash_table_remove().

Version-Release number of selected component:
NetworkManager-1.1.0-12796.8810036895.el7.centos

Additional info:
reporter:       libreport-2.1.11
backtrace_rating: 4
cmdline:        /usr/sbin/NetworkManager --no-daemon
crash_function: g_hash_table_lookup_node
executable:     /usr/sbin/NetworkManager
kernel:         3.10.0-302.el7.x86_64
runlevel:       N 3
type:           CCpp
uid:            0

Truncated backtrace:
Thread no. 1 (9 frames)
 #0 g_hash_table_lookup_node at ghash.c:405
 #1 g_hash_table_remove_internal at ghash.c:1295
 #2 agent_delete_cb at settings/nm-secret-agent.c:426
 #3 g_simple_async_result_complete at gsimpleasyncresult.c:763
 #4 reply_cb at gdbusproxy.c:2623
 #5 g_simple_async_result_complete at gsimpleasyncresult.c:763
 #6 g_dbus_connection_call_done at gdbusconnection.c:5502
 #7 g_simple_async_result_complete at gsimpleasyncresult.c:763
 #8 complete_in_idle_cb at gsimpleasyncresult.c:775

Comment 1 Lubomir Rintel 2015-08-13 16:16:16 UTC
Created attachment 1062702 [details]
File: backtrace

Comment 2 Lubomir Rintel 2015-08-13 16:16:17 UTC
Created attachment 1062703 [details]
File: cgroup

Comment 3 Lubomir Rintel 2015-08-13 16:16:18 UTC
Created attachment 1062704 [details]
File: core_backtrace

Comment 4 Lubomir Rintel 2015-08-13 16:16:19 UTC
Created attachment 1062705 [details]
File: dso_list

Comment 5 Lubomir Rintel 2015-08-13 16:16:20 UTC
Created attachment 1062706 [details]
File: environ

Comment 6 Lubomir Rintel 2015-08-13 16:16:20 UTC
Created attachment 1062707 [details]
File: limits

Comment 7 Lubomir Rintel 2015-08-13 16:16:21 UTC
Created attachment 1062708 [details]
File: machineid

Comment 8 Lubomir Rintel 2015-08-13 16:16:22 UTC
Created attachment 1062709 [details]
File: maps

Comment 9 Lubomir Rintel 2015-08-13 16:16:23 UTC
Created attachment 1062710 [details]
File: open_fds

Comment 10 Lubomir Rintel 2015-08-13 16:16:24 UTC
Created attachment 1062711 [details]
File: proc_pid_status

Comment 11 Lubomir Rintel 2015-08-13 16:16:25 UTC
Created attachment 1062712 [details]
File: var_log_messages

Comment 13 Lubomir Rintel 2015-08-14 08:26:25 UTC
Can't reproduce this one.

Happened during the big delay between these two:

NetworkManager_Test327_nmtui_ipv4_dns_method_auto_+_dns
...
NetworkManager_Test339_nmtui_ipv4_invalid_gateway

Comment 15 Jirka Klimes 2015-08-14 08:43:51 UTC
Created attachment 1062934 [details]
[PATCH] fix for an incorrect hash table handling

Comment 17 Lubomir Rintel 2015-08-14 13:36:08 UTC
LGTM

Comment 18 Thomas Haller 2015-08-14 13:42:30 UTC
(In reply to Jirka Klimes from comment #15)
> Created attachment 1062934 [details]
> [PATCH] fix for an incorrect hash table handling

I don't think this change fixes the issue.


	if (!g_hash_table_contains (priv->requests, r))
		g_hash_table_insert (priv->requests, r, r);


note that @r was just allocated newly a few lines above.
So, either @r was not in the hash (and the check is unneeded), or the hash contained a dangling pointer. The hash must not contain dangling pointers, and if it did -- that was the bug.

Comment 19 Thomas Haller 2015-08-14 13:43:37 UTC
Created attachment 1063036 [details]
[patch] rewrite handling async calls in NMSecretAgent

How about this larger refactoring?

Comment 20 Thomas Haller 2015-08-14 13:45:23 UTC
(In reply to Thomas Haller from comment #19)
> Created attachment 1063036 [details]
> [patch] rewrite handling async calls in NMSecretAgent
> 
> How about this larger refactoring?

ok, pushed the patch to th/secret-agent-rh1063036 ... seams easier to review

Comment 21 Thomas Haller 2015-08-16 10:08:25 UTC
(In reply to Thomas Haller from comment #20)

> ok, pushed the patch to th/secret-agent-rh1063036

th/secret-agent-rh1253407 of course...

Comment 22 Thomas Haller 2015-08-19 11:09:23 UTC
Got a crash, I think it's the same issue:


Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007ff97c988f55 in g_cancellable_is_cancelled (cancellable=0x22) at gcancellable.c:295
295       return cancellable != NULL && cancellable->priv->cancelled;
(gdb) bt
#0  0x00007ff97c988f55 in g_cancellable_is_cancelled (cancellable=0x22) at gcancellable.c:295
#1  0x000055f3897685e9 in get_callback (proxy=0x55f38b8a6770 [NMDBusSecretAgentProxy], result=0x55f38ba23a00, user_data=0x55f38b9dfb90) at settings/nm-secret-agent.c:269
#2  0x00007ff97c9c3ba7 in g_simple_async_result_complete (simple=0x55f38ba23a00 [GSimpleAsyncResult]) at gsimpleasyncresult.c:763
#3  0x00007ff97ca2b7a1 in reply_cb (connection=<optimized out>, res=<optimized out>, user_data=0x55f38ba23a00) at gdbusproxy.c:2623
#4  0x00007ff97c9c3ba7 in g_simple_async_result_complete (simple=0x55f38ba2c310 [GSimpleAsyncResult]) at gsimpleasyncresult.c:763
#5  0x00007ff97ca208cc in g_dbus_connection_call_done (source=<optimized out>, result=<optimized out>, user_data=0x7ff968006760) at gdbusconnection.c:5502
#6  0x00007ff97c9c3ba7 in g_simple_async_result_complete (simple=0x55f38ba23ae0 [GSimpleAsyncResult]) at gsimpleasyncresult.c:763
#7  0x00007ff97c9c3c09 in complete_in_idle_cb (data=<optimized out>) at gsimpleasyncresult.c:775
#8  0x00007ff97c40ca8a in g_main_context_dispatch (context=0x55f38b896340) at gmain.c:3122
#9  0x00007ff97c40ca8a in g_main_context_dispatch (context=context@entry=0x55f38b896340) at gmain.c:3737
#10 0x00007ff97c40ce20 in g_main_context_iterate (context=0x55f38b896340, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3808
#11 0x00007ff97c40d142 in g_main_loop_run (loop=0x55f38b896400) at gmain.c:4002
#12 0x000055f38956f45e in main (argc=1, argv=0x7fffd9b57a68) at main.c:449


Note that @cancellable is garbage (pointer value 0x22)

Comment 23 Thomas Haller 2015-08-20 16:30:32 UTC
*** Bug 1255455 has been marked as a duplicate of this bug. ***

Comment 24 Thomas Haller 2015-08-21 21:52:25 UTC
Largely extended the branch.

Also fixed another bug, that we left NMSecretAgent instances hanging if the client didn't properly unregister -- eg. during a crash.

In that case, the hanging secret agent instance is not only a leak, but it also prevents any further connections form the client.

please review.

Comment 25 Thomas Haller 2015-08-24 10:30:01 UTC
Another crash:

(gdb) bt
#0  0x00007ff1e48d79c8 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007ff1e48d965a in __GI_abort () at abort.c:89
#2  0x00007ff1e4eea2a5 in g_assertion_message (domain=domain@entry=0x7ff1e4f3112e "GLib", file=file@entry=0x7ff1e4f359f2 "ghash.c", line=line@entry=373, func=func@entry=0x7ff1e4f35c90 <__func__.10250> "g_hash_table_lookup_node", message=message@entry=0x55c36e230970 "assertion failed: (hash_table->ref_count > 0)") at gtestutils.c:2356
#3  0x00007ff1e4eea33a in g_assertion_message_expr (domain=domain@entry=0x7ff1e4f3112e "GLib", file=file@entry=0x7ff1e4f359f2 "ghash.c", line=line@entry=373, func=func@entry=0x7ff1e4f35c90 <__func__.10250> "g_hash_table_lookup_node", expr=expr@entry=0x7ff1e4f359d8 "hash_table->ref_count > 0") at gtestutils.c:2371
Python Exception <class 'TypeError'> iter() returned non-iterator of type '_iterator':
#4  0x00007ff1e4eb25a6 in g_hash_table_remove_internal (hash_return=<synthetic pointer>, key=0x55c36e421660, hash_table=0x55c36e45b920) at ghash.c:373
Python Exception <class 'TypeError'> iter() returned non-iterator of type '_iterator':
#5  0x00007ff1e4eb25a6 in g_hash_table_remove_internal (hash_table=0x55c36e45b920, key=0x55c36e421660, notify=1) at ghash.c:1352
#6  0x000055c36d7a506e in agent_delete_cb (proxy=<optimized out>, result=0x55c36e1fc2e0, user_data=0x55c36e421660) at settings/nm-secret-agent.c:426
#7  0x00007ff1e547aba7 in g_simple_async_result_complete (simple=0x55c36e1fc2e0 [GSimpleAsyncResult]) at gsimpleasyncresult.c:763
#8  0x00007ff1e54e27a1 in reply_cb (connection=<optimized out>, res=<optimized out>, user_data=0x55c36e1fc2e0) at gdbusproxy.c:2623
#9  0x00007ff1e547aba7 in g_simple_async_result_complete (simple=0x7ff1d0008880 [GSimpleAsyncResult]) at gsimpleasyncresult.c:763
#10 0x00007ff1e54d78cc in g_dbus_connection_call_done (source=<optimized out>, result=<optimized out>, user_data=0x55c36e4890f0) at gdbusconnection.c:5502
#11 0x00007ff1e547aba7 in g_simple_async_result_complete (simple=0x55c36e2aade0 [GSimpleAsyncResult]) at gsimpleasyncresult.c:763
#12 0x00007ff1e547ac09 in complete_in_idle_cb (data=<optimized out>) at gsimpleasyncresult.c:775
#13 0x00007ff1e4ec3a8a in g_main_context_dispatch (context=0x55c36e0943a0) at gmain.c:3122
#14 0x00007ff1e4ec3a8a in g_main_context_dispatch (context=context@entry=0x55c36e0943a0) at gmain.c:3737
#15 0x00007ff1e4ec3e20 in g_main_context_iterate (context=0x55c36e0943a0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3808
#16 0x00007ff1e4ec4142 in g_main_loop_run (loop=0x55c36e094460) at gmain.c:4002
#17 0x000055c36d677256 in main (argc=1, argv=0x7ffca3bedd08) at main.c:449

Comment 26 Dan Winship 2015-08-24 15:15:41 UTC
> secret-agent: fix leak of @dbus_owner

>    (cherry picked from commit 833faf3bdbc0af74d09e11c5f081d47a51680d51)

I don't think you meant to use -x here. (That commit doesn't exist on fdo.)

Commit looks right. It looks like I introduced this bug in pre-gdbus.



> secret-agent: rework handling of asynchronous request and cancelling

>    Use an opaque type NMSecretAgentCallId for the call-id.

(would be nice if that was split out separately...)

>      When cancelling, the callback will be invoked synchronously, with an
>      GError indicating the cancelling (G_IO_ERROR_CANCELLED).

Callbacks that are sometimes synchronous and sometimes asynchronous can be difficult to work with, and have been removed with extreme prejudice from most of GNOME...

>    - Don't strip_remote_error from NMSecretManager. NMAgentManager inspects
>      the remote error for the 'GetSecrets' call. Any users of
>      NMSecretManager should strip the error themselves (as they see fit).

Ah, no, I missed a bit of gdbus porting here. Now that we're using gdbus everywhere, D-Bus errors get translated to the correct GError domain+code, so:

- 		if (_nm_dbus_error_has_name (error, NM_DBUS_INTERFACE_SECRET_AGENT ".UserCanceled")) {
+ 		if (g_error_matches (error, NM_SECRET_AGENT_ERROR, NM_SECRET_AGENT_ERROR_USER_CANCELLED)) {

(It looks like this is the only instance of this bug. Possibly _nm_dbus_error_has_name() should warn if the name starts with "org.freedesktop.NetworkManager".)

>    This also fixes a crash where handling of the asynchronous data was
>    messed up.

what exactly was messed up before?

> 	if (error) {
>-		nm_log_dbg (LOGD_AGENTS, "(%s) agent failed secrets request %p/%s/%s: (%d) %s",
>+		if (g_error_matches (error, G_IO_ERROR, G_IO_ERROR_CANCELLED)) {
>+			nm_log_dbg (LOGD_AGENTS, "(%s) get secrets request cancelled: %p/%s/%s",
>+			            nm_secret_agent_get_description (agent),
>+			            req, parent->detail, req->setting_name);
>+			return;
>+		}

Won't returning here cause req to be leaked?



> secret-agent: fix detection of disapearing secret-agent

>+nm_bus_manager_connection_is_private (NMBusManager *self,
>+                                      GDBusConnection *connection)

You can just check if g_dbus_connection_get_unique_name() is NULL or not.

>+	if (!new_owner || !*new_owner) {

new_owner can't be NULL here.

Comment 27 Thomas Haller 2015-08-24 18:14:11 UTC
(In reply to Dan Winship from comment #26)
> > secret-agent: fix leak of @dbus_owner
> 
> >    (cherry picked from commit 833faf3bdbc0af74d09e11c5f081d47a51680d51)
> 
> I don't think you meant to use -x here. (That commit doesn't exist on fdo.)

Correct. Fixed

> Commit looks right. It looks like I introduced this bug in pre-gdbus.
> 
> 
> 
> > secret-agent: rework handling of asynchronous request and cancelling
> 
> >    Use an opaque type NMSecretAgentCallId for the call-id.
> 
> (would be nice if that was split out separately...)

Done. New commit:
  "secret-agent: refactor call-id to be a (opaque pointer) type instead of a void pointer"



> >      When cancelling, the callback will be invoked synchronously, with an
> >      GError indicating the cancelling (G_IO_ERROR_CANCELLED).
> 
> Callbacks that are sometimes synchronous and sometimes asynchronous can be
> difficult to work with, and have been removed with extreme prejudice from
> most of GNOME...

Correct. The callbacks are always invoked synchronously when invoking the action. Only during "cancel()" and "dispose()", the callback will be invoked synchronously with respect to the cancel/dispose(). Especially, for the dispose case, that is unavoidable. 


> >    - Don't strip_remote_error from NMSecretManager. NMAgentManager inspects
> >      the remote error for the 'GetSecrets' call. Any users of
> >      NMSecretManager should strip the error themselves (as they see fit).
> 
> Ah, no, I missed a bit of gdbus porting here. Now that we're using gdbus
> everywhere, D-Bus errors get translated to the correct GError domain+code,
> so:
> 
> - 		if (_nm_dbus_error_has_name (error, NM_DBUS_INTERFACE_SECRET_AGENT
> ".UserCanceled")) {
> + 		if (g_error_matches (error, NM_SECRET_AGENT_ERROR,
> NM_SECRET_AGENT_ERROR_USER_CANCELLED)) {
> 
> (It looks like this is the only instance of this bug. Possibly
> _nm_dbus_error_has_name() should warn if the name starts with
> "org.freedesktop.NetworkManager".)


How is the fixup!


> >    This also fixes a crash where handling of the asynchronous data was
> >    messed up.
> 
> what exactly was messed up before?

I don't really know. There are several backtraces of crashes attached to this BZ. The indicate that we have dangling pointers inside priv->requests. I didn't investigate how that happens exactly, because I refactored the code before I noticed the bug. But the branches fixes the crashes for me.


> > 	if (error) {
> >-		nm_log_dbg (LOGD_AGENTS, "(%s) agent failed secrets request %p/%s/%s: (%d) %s",
> >+		if (g_error_matches (error, G_IO_ERROR, G_IO_ERROR_CANCELLED)) {
> >+			nm_log_dbg (LOGD_AGENTS, "(%s) get secrets request cancelled: %p/%s/%s",
> >+			            nm_secret_agent_get_description (agent),
> >+			            req, parent->detail, req->setting_name);
> >+			return;
> >+		}
> 
> Won't returning here cause req to be leaked?

I think not. We return here only when nm-agent-manager cancels the requests. 
I will think more about this... (TODO)


> > secret-agent: fix detection of disapearing secret-agent
> 
> >+nm_bus_manager_connection_is_private (NMBusManager *self,
> >+                                      GDBusConnection *connection)
> 
> You can just check if g_dbus_connection_get_unique_name() is NULL or not.

fixed.


> >+	if (!new_owner || !*new_owner) {
> 
> new_owner can't be NULL here.

fixed.



Thanks, Repushed.

Comment 28 Dan Winship 2015-08-24 20:28:27 UTC
(In reply to Thomas Haller from comment #27)
> > (It looks like this is the only instance of this bug. Possibly
> > _nm_dbus_error_has_name() should warn if the name starts with
> > "org.freedesktop.NetworkManager".)
> 
> How is the fixup!

Good, but should probably update the docs too:

    This should only be used for "foreign" D-Bus errors (eg, errors
    from BlueZ or wpa_supplicant). All NetworkManager D-Bus errors
    should be properly mapped by gdbus to one of the domains/codes in
    nm-errors.h

Oh, also, the commit message needs to be updated to match the squashed commit.

> > Won't returning here cause req to be leaked?
> 
> I think not. We return here only when nm-agent-manager cancels the requests. 
> I will think more about this... (TODO)

Hm... ok.


I think the rest looks right.

Comment 29 Beniamino Galvani 2015-08-24 20:53:04 UTC
> secret-agent: rework handling of asynchronous request and cancelling

s/NMSecretManager/NMSecretAgent/ in commit message?

> secret-agent: fix detection of disapearing secret-agent

disappearing

"The signal "nofity:g-name-owner" is only emited for well-know"

notify, emitted

I'm not very familiar with agents management, but at first look
the branch looks good to me.

Comment 30 Thomas Haller 2015-08-25 08:23:26 UTC
(In reply to Dan Winship from comment #28)

> > > Won't returning here cause req to be leaked?
> > 
> > I think not. We return here only when nm-agent-manager cancels the requests. 
> > I will think more about this... (TODO)
> 
> Hm... ok.

I think it is correct.

There are only two callers of nm_secret_agent_cancel_secrets():
  get_cancel_cb() and request_next_agent().

get_cancel_cb() is only invoked inside request_free() -- so there is noting to free.
And during request_next_agent() we should already be in process of doing the necessary cleanup.


Not really sure though. IMO NMAgentManager should be significantly simplified. It's quite confusing (to me).

Comment 31 Thomas Haller 2015-08-25 09:16:52 UTC
Squashed all fixup commits, rebased and repushed.
IMO it's ready to be merged.



I only doubt about how the subscription to "NameOwnerChanged" signal works
(nm_secret_agent_new() and _on_disconnected_name_owner_changed()).

AFAIS, it tests OK and it seems to correspond to the implementation in GDBusProxy.

What I don't understand, is that we pass priv->dbus_owner as "arg0" to g_dbus_connection_signal_subscribe. Is that all we need to ensure that we get the correct signal in _on_disconnected_name_owner_changed()? Could it not happen, that we get called in g_dbus_connection_signal_subscribe() for a wrong connection?

Can somebody who understands this confirm that this is correct?

Comment 32 Dan Winship 2015-08-25 13:32:19 UTC
(In reply to Thomas Haller from comment #31)
> What I don't understand, is that we pass priv->dbus_owner as "arg0" to
> g_dbus_connection_signal_subscribe. Is that all we need to ensure that we
> get the correct signal in _on_disconnected_name_owner_changed()?

Yes. It means you will only receive the signal when the signal's first argument matches that arg0.

Comment 34 Thomas Haller 2015-09-03 11:37:32 UTC
Closing BZ as it was tracking an issue with upstream master -- and got fixed there.

No builds in RHEL were affected.


Note You need to log in before you can comment on or make changes to this bug.