Bug 1253407
| Summary: | [abrt] [beaker] [master] infinite loop in agent_delete_cb() | ||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Lubomir Rintel <lrintel> | ||||||||||||||||||||||||||||
| Component: | NetworkManager | Assignee: | Thomas Haller <thaller> | ||||||||||||||||||||||||||||
| Status: | CLOSED UPSTREAM | QA Contact: | Desktop QE <desktop-qa-list> | ||||||||||||||||||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||||||||||||||||||
| Priority: | medium | ||||||||||||||||||||||||||||||
| Version: | 7.3 | CC: | bgalvani, danw, dcbw, jklimes, lrintel, rkhan, thaller | ||||||||||||||||||||||||||||
| Target Milestone: | rc | ||||||||||||||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||||||||||||||
| Hardware: | x86_64 | ||||||||||||||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||||||||||||||
| Whiteboard: | abrt_hash:e9b721edf5312daeb1f4bf8251ae28b0c6b41a80 | ||||||||||||||||||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||||||||||||
| Last Closed: | 2015-09-03 11:37:32 UTC | Type: | --- | ||||||||||||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||||||||
| Embargoed: | |||||||||||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||||||||||
|
Description
Lubomir Rintel
2015-08-13 16:16:14 UTC
Created attachment 1062702 [details]
File: backtrace
Created attachment 1062703 [details]
File: cgroup
Created attachment 1062704 [details]
File: core_backtrace
Created attachment 1062705 [details]
File: dso_list
Created attachment 1062706 [details]
File: environ
Created attachment 1062707 [details]
File: limits
Created attachment 1062708 [details]
File: machineid
Created attachment 1062709 [details]
File: maps
Created attachment 1062710 [details]
File: open_fds
Created attachment 1062711 [details]
File: proc_pid_status
Created attachment 1062712 [details]
File: var_log_messages
Can't reproduce this one. Happened during the big delay between these two: NetworkManager_Test327_nmtui_ipv4_dns_method_auto_+_dns ... NetworkManager_Test339_nmtui_ipv4_invalid_gateway Created attachment 1062934 [details]
[PATCH] fix for an incorrect hash table handling
LGTM (In reply to Jirka Klimes from comment #15) > Created attachment 1062934 [details] > [PATCH] fix for an incorrect hash table handling I don't think this change fixes the issue. if (!g_hash_table_contains (priv->requests, r)) g_hash_table_insert (priv->requests, r, r); note that @r was just allocated newly a few lines above. So, either @r was not in the hash (and the check is unneeded), or the hash contained a dangling pointer. The hash must not contain dangling pointers, and if it did -- that was the bug. Created attachment 1063036 [details]
[patch] rewrite handling async calls in NMSecretAgent
How about this larger refactoring?
(In reply to Thomas Haller from comment #19) > Created attachment 1063036 [details] > [patch] rewrite handling async calls in NMSecretAgent > > How about this larger refactoring? ok, pushed the patch to th/secret-agent-rh1063036 ... seams easier to review (In reply to Thomas Haller from comment #20) > ok, pushed the patch to th/secret-agent-rh1063036 th/secret-agent-rh1253407 of course... Got a crash, I think it's the same issue: Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007ff97c988f55 in g_cancellable_is_cancelled (cancellable=0x22) at gcancellable.c:295 295 return cancellable != NULL && cancellable->priv->cancelled; (gdb) bt #0 0x00007ff97c988f55 in g_cancellable_is_cancelled (cancellable=0x22) at gcancellable.c:295 #1 0x000055f3897685e9 in get_callback (proxy=0x55f38b8a6770 [NMDBusSecretAgentProxy], result=0x55f38ba23a00, user_data=0x55f38b9dfb90) at settings/nm-secret-agent.c:269 #2 0x00007ff97c9c3ba7 in g_simple_async_result_complete (simple=0x55f38ba23a00 [GSimpleAsyncResult]) at gsimpleasyncresult.c:763 #3 0x00007ff97ca2b7a1 in reply_cb (connection=<optimized out>, res=<optimized out>, user_data=0x55f38ba23a00) at gdbusproxy.c:2623 #4 0x00007ff97c9c3ba7 in g_simple_async_result_complete (simple=0x55f38ba2c310 [GSimpleAsyncResult]) at gsimpleasyncresult.c:763 #5 0x00007ff97ca208cc in g_dbus_connection_call_done (source=<optimized out>, result=<optimized out>, user_data=0x7ff968006760) at gdbusconnection.c:5502 #6 0x00007ff97c9c3ba7 in g_simple_async_result_complete (simple=0x55f38ba23ae0 [GSimpleAsyncResult]) at gsimpleasyncresult.c:763 #7 0x00007ff97c9c3c09 in complete_in_idle_cb (data=<optimized out>) at gsimpleasyncresult.c:775 #8 0x00007ff97c40ca8a in g_main_context_dispatch (context=0x55f38b896340) at gmain.c:3122 #9 0x00007ff97c40ca8a in g_main_context_dispatch (context=context@entry=0x55f38b896340) at gmain.c:3737 #10 0x00007ff97c40ce20 in g_main_context_iterate (context=0x55f38b896340, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3808 #11 0x00007ff97c40d142 in g_main_loop_run (loop=0x55f38b896400) at gmain.c:4002 #12 0x000055f38956f45e in main (argc=1, argv=0x7fffd9b57a68) at main.c:449 Note that @cancellable is garbage (pointer value 0x22) *** Bug 1255455 has been marked as a duplicate of this bug. *** Largely extended the branch. Also fixed another bug, that we left NMSecretAgent instances hanging if the client didn't properly unregister -- eg. during a crash. In that case, the hanging secret agent instance is not only a leak, but it also prevents any further connections form the client. please review. Another crash: (gdb) bt #0 0x00007ff1e48d79c8 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55 #1 0x00007ff1e48d965a in __GI_abort () at abort.c:89 #2 0x00007ff1e4eea2a5 in g_assertion_message (domain=domain@entry=0x7ff1e4f3112e "GLib", file=file@entry=0x7ff1e4f359f2 "ghash.c", line=line@entry=373, func=func@entry=0x7ff1e4f35c90 <__func__.10250> "g_hash_table_lookup_node", message=message@entry=0x55c36e230970 "assertion failed: (hash_table->ref_count > 0)") at gtestutils.c:2356 #3 0x00007ff1e4eea33a in g_assertion_message_expr (domain=domain@entry=0x7ff1e4f3112e "GLib", file=file@entry=0x7ff1e4f359f2 "ghash.c", line=line@entry=373, func=func@entry=0x7ff1e4f35c90 <__func__.10250> "g_hash_table_lookup_node", expr=expr@entry=0x7ff1e4f359d8 "hash_table->ref_count > 0") at gtestutils.c:2371 Python Exception <class 'TypeError'> iter() returned non-iterator of type '_iterator': #4 0x00007ff1e4eb25a6 in g_hash_table_remove_internal (hash_return=<synthetic pointer>, key=0x55c36e421660, hash_table=0x55c36e45b920) at ghash.c:373 Python Exception <class 'TypeError'> iter() returned non-iterator of type '_iterator': #5 0x00007ff1e4eb25a6 in g_hash_table_remove_internal (hash_table=0x55c36e45b920, key=0x55c36e421660, notify=1) at ghash.c:1352 #6 0x000055c36d7a506e in agent_delete_cb (proxy=<optimized out>, result=0x55c36e1fc2e0, user_data=0x55c36e421660) at settings/nm-secret-agent.c:426 #7 0x00007ff1e547aba7 in g_simple_async_result_complete (simple=0x55c36e1fc2e0 [GSimpleAsyncResult]) at gsimpleasyncresult.c:763 #8 0x00007ff1e54e27a1 in reply_cb (connection=<optimized out>, res=<optimized out>, user_data=0x55c36e1fc2e0) at gdbusproxy.c:2623 #9 0x00007ff1e547aba7 in g_simple_async_result_complete (simple=0x7ff1d0008880 [GSimpleAsyncResult]) at gsimpleasyncresult.c:763 #10 0x00007ff1e54d78cc in g_dbus_connection_call_done (source=<optimized out>, result=<optimized out>, user_data=0x55c36e4890f0) at gdbusconnection.c:5502 #11 0x00007ff1e547aba7 in g_simple_async_result_complete (simple=0x55c36e2aade0 [GSimpleAsyncResult]) at gsimpleasyncresult.c:763 #12 0x00007ff1e547ac09 in complete_in_idle_cb (data=<optimized out>) at gsimpleasyncresult.c:775 #13 0x00007ff1e4ec3a8a in g_main_context_dispatch (context=0x55c36e0943a0) at gmain.c:3122 #14 0x00007ff1e4ec3a8a in g_main_context_dispatch (context=context@entry=0x55c36e0943a0) at gmain.c:3737 #15 0x00007ff1e4ec3e20 in g_main_context_iterate (context=0x55c36e0943a0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3808 #16 0x00007ff1e4ec4142 in g_main_loop_run (loop=0x55c36e094460) at gmain.c:4002 #17 0x000055c36d677256 in main (argc=1, argv=0x7ffca3bedd08) at main.c:449 > secret-agent: fix leak of @dbus_owner > (cherry picked from commit 833faf3bdbc0af74d09e11c5f081d47a51680d51) I don't think you meant to use -x here. (That commit doesn't exist on fdo.) Commit looks right. It looks like I introduced this bug in pre-gdbus. > secret-agent: rework handling of asynchronous request and cancelling > Use an opaque type NMSecretAgentCallId for the call-id. (would be nice if that was split out separately...) > When cancelling, the callback will be invoked synchronously, with an > GError indicating the cancelling (G_IO_ERROR_CANCELLED). Callbacks that are sometimes synchronous and sometimes asynchronous can be difficult to work with, and have been removed with extreme prejudice from most of GNOME... > - Don't strip_remote_error from NMSecretManager. NMAgentManager inspects > the remote error for the 'GetSecrets' call. Any users of > NMSecretManager should strip the error themselves (as they see fit). Ah, no, I missed a bit of gdbus porting here. Now that we're using gdbus everywhere, D-Bus errors get translated to the correct GError domain+code, so: - if (_nm_dbus_error_has_name (error, NM_DBUS_INTERFACE_SECRET_AGENT ".UserCanceled")) { + if (g_error_matches (error, NM_SECRET_AGENT_ERROR, NM_SECRET_AGENT_ERROR_USER_CANCELLED)) { (It looks like this is the only instance of this bug. Possibly _nm_dbus_error_has_name() should warn if the name starts with "org.freedesktop.NetworkManager".) > This also fixes a crash where handling of the asynchronous data was > messed up. what exactly was messed up before? > if (error) { >- nm_log_dbg (LOGD_AGENTS, "(%s) agent failed secrets request %p/%s/%s: (%d) %s", >+ if (g_error_matches (error, G_IO_ERROR, G_IO_ERROR_CANCELLED)) { >+ nm_log_dbg (LOGD_AGENTS, "(%s) get secrets request cancelled: %p/%s/%s", >+ nm_secret_agent_get_description (agent), >+ req, parent->detail, req->setting_name); >+ return; >+ } Won't returning here cause req to be leaked? > secret-agent: fix detection of disapearing secret-agent >+nm_bus_manager_connection_is_private (NMBusManager *self, >+ GDBusConnection *connection) You can just check if g_dbus_connection_get_unique_name() is NULL or not. >+ if (!new_owner || !*new_owner) { new_owner can't be NULL here. (In reply to Dan Winship from comment #26) > > secret-agent: fix leak of @dbus_owner > > > (cherry picked from commit 833faf3bdbc0af74d09e11c5f081d47a51680d51) > > I don't think you meant to use -x here. (That commit doesn't exist on fdo.) Correct. Fixed > Commit looks right. It looks like I introduced this bug in pre-gdbus. > > > > > secret-agent: rework handling of asynchronous request and cancelling > > > Use an opaque type NMSecretAgentCallId for the call-id. > > (would be nice if that was split out separately...) Done. New commit: "secret-agent: refactor call-id to be a (opaque pointer) type instead of a void pointer" > > When cancelling, the callback will be invoked synchronously, with an > > GError indicating the cancelling (G_IO_ERROR_CANCELLED). > > Callbacks that are sometimes synchronous and sometimes asynchronous can be > difficult to work with, and have been removed with extreme prejudice from > most of GNOME... Correct. The callbacks are always invoked synchronously when invoking the action. Only during "cancel()" and "dispose()", the callback will be invoked synchronously with respect to the cancel/dispose(). Especially, for the dispose case, that is unavoidable. > > - Don't strip_remote_error from NMSecretManager. NMAgentManager inspects > > the remote error for the 'GetSecrets' call. Any users of > > NMSecretManager should strip the error themselves (as they see fit). > > Ah, no, I missed a bit of gdbus porting here. Now that we're using gdbus > everywhere, D-Bus errors get translated to the correct GError domain+code, > so: > > - if (_nm_dbus_error_has_name (error, NM_DBUS_INTERFACE_SECRET_AGENT > ".UserCanceled")) { > + if (g_error_matches (error, NM_SECRET_AGENT_ERROR, > NM_SECRET_AGENT_ERROR_USER_CANCELLED)) { > > (It looks like this is the only instance of this bug. Possibly > _nm_dbus_error_has_name() should warn if the name starts with > "org.freedesktop.NetworkManager".) How is the fixup! > > This also fixes a crash where handling of the asynchronous data was > > messed up. > > what exactly was messed up before? I don't really know. There are several backtraces of crashes attached to this BZ. The indicate that we have dangling pointers inside priv->requests. I didn't investigate how that happens exactly, because I refactored the code before I noticed the bug. But the branches fixes the crashes for me. > > if (error) { > >- nm_log_dbg (LOGD_AGENTS, "(%s) agent failed secrets request %p/%s/%s: (%d) %s", > >+ if (g_error_matches (error, G_IO_ERROR, G_IO_ERROR_CANCELLED)) { > >+ nm_log_dbg (LOGD_AGENTS, "(%s) get secrets request cancelled: %p/%s/%s", > >+ nm_secret_agent_get_description (agent), > >+ req, parent->detail, req->setting_name); > >+ return; > >+ } > > Won't returning here cause req to be leaked? I think not. We return here only when nm-agent-manager cancels the requests. I will think more about this... (TODO) > > secret-agent: fix detection of disapearing secret-agent > > >+nm_bus_manager_connection_is_private (NMBusManager *self, > >+ GDBusConnection *connection) > > You can just check if g_dbus_connection_get_unique_name() is NULL or not. fixed. > >+ if (!new_owner || !*new_owner) { > > new_owner can't be NULL here. fixed. Thanks, Repushed. (In reply to Thomas Haller from comment #27) > > (It looks like this is the only instance of this bug. Possibly > > _nm_dbus_error_has_name() should warn if the name starts with > > "org.freedesktop.NetworkManager".) > > How is the fixup! Good, but should probably update the docs too: This should only be used for "foreign" D-Bus errors (eg, errors from BlueZ or wpa_supplicant). All NetworkManager D-Bus errors should be properly mapped by gdbus to one of the domains/codes in nm-errors.h Oh, also, the commit message needs to be updated to match the squashed commit. > > Won't returning here cause req to be leaked? > > I think not. We return here only when nm-agent-manager cancels the requests. > I will think more about this... (TODO) Hm... ok. I think the rest looks right. > secret-agent: rework handling of asynchronous request and cancelling s/NMSecretManager/NMSecretAgent/ in commit message? > secret-agent: fix detection of disapearing secret-agent disappearing "The signal "nofity:g-name-owner" is only emited for well-know" notify, emitted I'm not very familiar with agents management, but at first look the branch looks good to me. (In reply to Dan Winship from comment #28) > > > Won't returning here cause req to be leaked? > > > > I think not. We return here only when nm-agent-manager cancels the requests. > > I will think more about this... (TODO) > > Hm... ok. I think it is correct. There are only two callers of nm_secret_agent_cancel_secrets(): get_cancel_cb() and request_next_agent(). get_cancel_cb() is only invoked inside request_free() -- so there is noting to free. And during request_next_agent() we should already be in process of doing the necessary cleanup. Not really sure though. IMO NMAgentManager should be significantly simplified. It's quite confusing (to me). Squashed all fixup commits, rebased and repushed. IMO it's ready to be merged. I only doubt about how the subscription to "NameOwnerChanged" signal works (nm_secret_agent_new() and _on_disconnected_name_owner_changed()). AFAIS, it tests OK and it seems to correspond to the implementation in GDBusProxy. What I don't understand, is that we pass priv->dbus_owner as "arg0" to g_dbus_connection_signal_subscribe. Is that all we need to ensure that we get the correct signal in _on_disconnected_name_owner_changed()? Could it not happen, that we get called in g_dbus_connection_signal_subscribe() for a wrong connection? Can somebody who understands this confirm that this is correct? (In reply to Thomas Haller from comment #31) > What I don't understand, is that we pass priv->dbus_owner as "arg0" to > g_dbus_connection_signal_subscribe. Is that all we need to ensure that we > get the correct signal in _on_disconnected_name_owner_changed()? Yes. It means you will only receive the signal when the signal's first argument matches that arg0. branch merged to upstream master as: http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=da1b6c2c3283ae15b9e223c8dc45661447527a22 Closing BZ as it was tracking an issue with upstream master -- and got fixed there. No builds in RHEL were affected. |