Bug 635006

Summary:	Network does not come back up on resume from hibernate
Product:	[Fedora] Fedora	Reporter:	Orion Poplawski <orion>
Component:	NetworkManager	Assignee:	Dan Williams <dcbw>
Status:	CLOSED DUPLICATE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	low
Version:	13	CC:	dcbw, sgallagh
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-09-23 21:31:14 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Orion Poplawski 2010-09-17 15:10:40 UTC

Description of problem:

I'm seeing this message in the logs of a number of machines:

NetworkManager[1441]: <error> [1284099212.540711] [nm-manager.c:1312] user_proxy_init(): could not init user settings proxy: (3) Could not get owner of name 'org.freedesktop.NetworkManagerUserSettings': no such name

What brought it to my attention was a machine that failed to bring up its network from hibernate:

Sep 17 08:24:48 kolea kernel: Restarting tasks ... done.
Sep 17 08:24:50 kolea ntpd[1346]: Deleting interface #7 eth0, fe80::213:d3ff:fef7:f6e0#123, interface stats: received=0, sent=0, dropped=0, active_time=182902 secs
Sep 17 08:24:50 kolea ntpd[1346]: Deleting interface #6 eth0, 192.168.0.44#123, interface stats: received=624, sent=624, dropped=0, active_time=182902 secs
Sep 17 08:24:50 kolea ntpd[1346]: 65.44.101.180 interface 192.168.0.44 -> (null)
Sep 17 08:24:51 kolea automount[7829]: umount_autofs_indirect: ask umount returned busy /data
Sep 17 08:24:53 kolea automount[7829]: umount_autofs_indirect: ask umount returned busy /home
Sep 17 08:24:58 kolea automount[10862]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server
Sep 17 08:24:58 kolea automount[10862]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server
Sep 17 08:24:58 kolea automount[10862]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server
Sep 17 08:24:58 kolea automount[10862]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server
Sep 17 08:40:02 kolea sssd[be[default]]: Could not start TLS. (null)
Sep 17 08:40:02 kolea sssd[be[default]]: Could not start TLS. (null)
Sep 17 08:40:02 kolea sssd[be[default]]: Could not start TLS. (null)
Sep 17 08:44:23 kolea kernel: usb 1-5: new high speed USB device using ehci_hcd and address 10
Sep 17 08:44:23 kolea kernel: usb 1-5: New USB device found, idVendor=0424, idProduct=2504
Sep 17 08:44:23 kolea kernel: usb 1-5: New USB device strings: Mfr=0, Product=0, SerialNumber=0
Sep 17 08:44:23 kolea kernel: hub 1-5:1.0: USB hub found
Sep 17 08:44:23 kolea kernel: hub 1-5:1.0: 4 ports detected
Sep 17 08:44:38 kolea kdm: :0[1629]: Received unknown command 1 from greeter
Sep 17 08:45:00 kolea sssd[be[default]]: Could not start TLS. (null)
Sep 17 08:45:00 kolea sssd[be[default]]: Could not start TLS. (null)
Sep 17 08:45:00 kolea sssd[be[default]]: Could not start TLS. (null)
Sep 17 08:45:01 kolea NetworkManager[1058]: <error> [1284734701.579551] [nm-manager.c:1331] user_proxy_init(): could not init user settings proxy: (3) Could not get owner of name 'org.freedesktop.NetworkManagerUserSettings': no such name

I ended up restarting NetworkManager and it came up okay.  /var/log/pm-suspend.log appears okay:

-rw-r--r--. 1 root root 5095 Sep 17 08:24 /var/log/pm-suspend.log

Fri Sep 17 08:24:49 MDT 2010: Running hooks for thaw
....
/usr/lib64/pm-utils/sleep.d/55NetworkManager thaw hibernate: success.


We use ldap for user info and autofs maps and automount home directories.  I suppose this could be an sssd issue.

The other times this is appearing in the logs is during system shutdown, where it appears to be the very first message in the logs regarding the shutdown.

Sep 10 00:13:32 ventus NetworkManager[1191]: <error> [1284099212.540711] [nm-manager.c:1312] user_proxy_init(): could not init user settings proxy: (3) Could not get owner of name 'org.freedesktop.NetworkManagerUserSettings': no such name
Sep 10 00:13:32 ventus init: tty4 main process (1668) killed by TERM signal


Version-Release number of selected component (if applicable):
NetworkManager-0.8.1-6.git20100831.fc13.x86_64

Comment 1 Stephen Gallagher 2010-09-20 12:58:14 UTC

What version of SSSD are you running?

Comment 2 Orion Poplawski 2010-09-20 14:48:13 UTC

sssd-1.2.2-21.fc13.x86_64

Comment 3 Stephen Gallagher 2010-09-20 14:59:38 UTC

That's very strange. That error message implies that you were able to connect to the LDAP server, but not to initiate TLS.

That "(null)" is supposed to be the reason returned by the LDAP server, but we're not getting one.

Is it possible that the connection dropped between the bind and the request to start TLS?

This sounds like NM to me (sorry!)

Comment 4 Orion Poplawski 2010-09-20 15:27:54 UTC

Well, I'm pretty sure the network wasn't up at all at that point.  No idea why NM didn't bring it back up since there are no messages from it.  I've turned on more debugging so hopefully will have more info next time it happens.

Comment 5 Orion Poplawski 2010-09-20 19:47:38 UTC

Stephen - just a thought.  Could sssd have a socket connection left over to the LDAP server that it is trying to reuse and timing out when the machine comes back up from hibernate?  Does sssd need any hooks into the pm system to bump it when a machine resumes?  Otherwise I'll see what the NM debug logs show it trying to do on resume.

Comment 6 Stephen Gallagher 2010-09-20 20:37:01 UTC

(In reply to comment #5)
> Stephen - just a thought.  Could sssd have a socket connection left over to the
> LDAP server that it is trying to reuse and timing out when the machine comes
> back up from hibernate?  Does sssd need any hooks into the pm system to bump it
> when a machine resumes?  Otherwise I'll see what the NM debug logs show it
> trying to do on resume.

It's possible, but I'd need to see detailed SSSD debug logs to be certain.

Comment 7 Orion Poplawski 2010-09-22 17:50:55 UTC

Most recently:

hibernate:
Sep 21 16:11:03 kolea NetworkManager[20071]: <info> (eth0): cleaning up...
Sep 21 16:11:03 kolea NetworkManager[20071]: <info> (eth0): taking down device.
Sep 21 16:11:03 kolea kernel: tg3 0000:02:00.0: PME# enabled
Sep 21 16:11:03 kolea kernel: pci0000:00: wake-up capability enabled by ACPI
Sep 21 16:11:03 kolea NetworkManager[20071]: <debug> [1285107063.655884] [nm-netlink-monitor.c:1
17] link_msg_handler(): netlink link message: iface idx 2 flags 0x1002
Sep 21 16:11:03 kolea NetworkManager[20071]: <info> (eth0): carrier now OFF (device state 1)

wakeup:
Sep 22 10:11:39 kolea kernel: Restarting tasks ... done.
Sep 22 10:11:39 kolea kernel: PM: Basic memory bitmaps freed
Sep 22 10:11:42 kolea ntpd[1346]: Deleting interface #11 eth0, fe80::213:d3ff:fef7:f6e0#123, interface stats: received=0, sent=0, dropped=0, active_time=33302 secs
Sep 22 10:11:42 kolea ntpd[1346]: Deleting interface #10 eth0, 192.168.0.44#123, interface stats: received=190, sent=190, dropped=0, active_time=33302 secs
Sep 22 10:11:42 kolea ntpd[1346]: 65.44.101.180 interface 192.168.0.44 -> (null)
Sep 22 10:11:48 kolea NetworkManagerDispatcher: performing autofs restart
Sep 22 10:11:52 kolea automount[30427]: umount_autofs_indirect: ask umount returned busy /data
Sep 22 10:11:58 kolea automount[30427]: umount_autofs_indirect: ask umount returned busy /home
Sep 22 10:12:04 kolea run-parts(/etc/cron.hourly)[17108]: finished zz-hibernate
Sep 22 10:12:04 kolea kernel: SELinux: initialized (dev autofs, type autofs), uses genfs_contexts
Sep 22 10:12:04 kolea automount[17110]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server

NM just appears very unresponsive.  System was then shut down and restarted at 10:15.

sssd seems pretty quiet:

(Wed Sep 22 10:11:40 2010) [sssd[be[default]]] [cleanup_users_send] (9): Cache expiration is set to 0 days
(Wed Sep 22 10:11:42 2010) [sssd[be[default]]] [sysdb_search_users_check_handle] (6): Search use
rs with filter: (&(objectclass=user)(&(!(dataExpireTimestamp=0))(dataExpireTimestamp<=1285171900)(!(lastLogin=*))))
(Wed Sep 22 10:11:42 2010) [sssd[be[default]]] [cleanup_users_process] (4): Found 1 expired user entries!
(Wed Sep 22 10:11:43 2010) [sssd[be[default]]] [cleanup_users_delete] (9): About to delete user orion
(Wed Sep 22 10:11:43 2010) [sssd[be[default]]] [sldb_request_callback] (6): LDB Error: 1 (ltdb modify without transaction)
(Wed Sep 22 10:11:43 2010) [sssd[be[default]]] [sysdb_op_default_done] (6): Error: 5 (Input/output error)
(Wed Sep 22 10:11:43 2010) [sssd[be[default]]] [cleanup_users_delete_done] (2): User delete returned 5 (Input/output error)
(Wed Sep 22 10:11:43 2010) [sssd[be[default]]] [ldap_id_cleanup_users_done] (1): Failed to cleanup users (5 [Input/output error]), retrying later!
(Wed Sep 22 10:11:43 2010) [sssd[be[default]]] [ldap_id_cleanup_set_timer] (6): Scheduling next cleanup at 1285182702.1047662
(Wed Sep 22 10:11:43 2010) [sssd[be[default]]] [sbus_dispatch] (9): dbus conn: 2267AE0
(Wed Sep 22 10:11:43 2010) [sssd[be[default]]] [sbus_dispatch] (9): Dispatching.
(Wed Sep 22 10:11:43 2010) [sssd[be[default]]] [sbus_message_handler] (9): Received SBUS method [ping]
(Wed Sep 22 10:11:44 2010) [sssd[be[default]]] [sbus_dispatch] (9): dbus conn: 2267AE0
(Wed Sep 22 10:11:45 2010) [sssd[be[default]]] [sbus_dispatch] (9): Dispatching.
(Wed Sep 22 10:11:45 2010) [sssd[be[default]]] [sbus_message_handler] (9): Received SBUS method [resInit]
(Wed Sep 22 10:11:45 2010) [sssd[be[default]]] [resolv_reread_configuration] (4): Recreating all c-ares channels
(Wed Sep 22 10:11:45 2010) [sssd[be[default]]] [recreate_ares_channel] (4): Initializing new c-ares channel
(Wed Sep 22 10:11:46 2010) [sssd[be[default]]] [recreate_ares_channel] (4): Destroying the old c-ares channel
(Wed Sep 22 10:11:50 2010) [sssd[be[default]]] [sbus_dispatch] (9): dbus conn: 2267AE0
(Wed Sep 22 10:11:50 2010) [sssd[be[default]]] [sbus_dispatch] (9): Dispatching.
(Wed Sep 22 10:11:50 2010) [sssd[be[default]]] [sbus_message_handler] (9): Received SBUS method [ping]
...

Comment 8 Orion Poplawski 2010-09-23 17:24:30 UTC

Okay, more info.  I put some more debugging into NM and determined the following.  NM is ignoring the wake message coming from the pm-utils thaw script because dbus_bus_get_unix_user (in nm_auth_get_caller_uid) fails with a NameHasNoOwner DBusError message:

Could not get UID of name ':1.54': no such name

In dbus_bus_get_unix_user() the reply to dbus_connection_send_with_reply_and_block is NULL.

I cannot seem to reproduce this by running (which the thaw script does):

dbus-send --system --dest=org.freedesktop.NetworkManager /org/freedesktop/NetworkManager org.freedesktop.NetworkManager.wake

as root from a shell - dbus_bus_get_unix_user() appears to work there.  System is being put into hibernate by a script run from cron, which may have something to do with it.

Comment 9 Dan Williams 2010-09-23 20:43:18 UTC

NM 20100831 and later also listen for upower events for suspend/resume so that we dont' have to bother with pm-utils problems like this anymore (pm-utils invocations of dbus-send dont' stick around long enough to be able to authenticate the call).  I'm sure we all agree that making sure that the sleep/wake calls are authenticated is a good thing :)  Is upower running on this machine?

Comment 10 Orion Poplawski 2010-09-23 20:59:18 UTC

upower was not installed, I'm running KDE which doesn't appear to use it.  Also, I need this to work with no one logged in at all.

Comment 11 Orion Poplawski 2010-09-23 21:01:10 UTC

What if the pm-utils calls to dbus-send used --print-reply?  Would that allow them to be authenticated reliably?

Comment 12 Orion Poplawski 2010-09-23 21:27:48 UTC

Looks like this is essentially the same as bug 610299.

Comment 13 Orion Poplawski 2010-09-23 21:29:18 UTC

Bug 477964 claims this was fixed in dbus 1.2.6, but apparently not, or has been broken again.

Comment 14 Orion Poplawski 2010-09-23 21:31:14 UTC


*** This bug has been marked as a duplicate of bug 549089 ***