Bug 1330694

Summary:	NM does not run pre-down scripts on suspend/sleep/hibernate
Product:	Red Hat Enterprise Linux 7	Reporter:	Orion Poplawski <orion>
Component:	NetworkManager	Assignee:	Beniamino Galvani <bgalvani>
Status:	CLOSED ERRATA	QA Contact:	Desktop QE <desktop-qa-list>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.2	CC:	aloughla, atragler, bgalvani, fgiudici, lrintel, rkhan, thaller, vbenes
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	NetworkManager-1.4.0-0.1.git20160606.b769b4df.el7	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-03 19:09:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Orion Poplawski 2016-04-26 17:41:25 UTC

Description of problem:

NetworkManager doesn't seem to emitted the StateChanged NM_STATE_DISCONNECTING signal when the system is put to suspend. Instead it just sends the NM_STATE_ASLEEP signal after disconnecting all network devices.

This results in dispatcher scripts in pre-down.d not being run when suspending the computer.  We need these to be able to unmount NFS filesystems before disconnecting from the network.

Version-Release number of selected component (if applicable):
NetworkManager-1.0.6-29.el7_2.x86_64

Comment 2 Beniamino Galvani 2016-05-07 13:27:49 UTC

Pushed branch bg/disconnect-on-suspend-rh1330694.

Comment 3 Thomas Haller 2016-05-08 14:19:58 UTC

>> device: add nm_device_queue_unmanaged_by_flags()

I don't like the name too much

- nm_device_set_unmanaged_flags() sets the flags (only)
- nm_device_set_unmanaged_by_*() set the flag and change the device state as needed.
- nm_device_queue_unmanaged_by_flags(): sets the flags and queues a device state (as needed).

What they all do in the first place is setting the unmanaged flags. Can we name them all with a nm_device_set_unmanaged_* prefix?

Maybe nm_device_set_unmanaged_by_flags_queue()?




>> sleep-monitor: add functions for delaying the suspension
    
You say: "If some action were not completed, callers must avoid using release_inhibitor() after the wake-up.", which is a very good advice, but it's unclear how the caller would know when he should avoid release_inhibitor() :)

I think nm_sleep_monitor_keep_inhibitor() should return an opaque handle that must be released exactly once. An internal sleep_signal() might invalidate the current release handles, but when the caller does release_inhibitory() on a stale handle, it gets released without actually touching the fd or inhibitor state.

How about the fixup commit?



>> manager: disconnect devices before unmanaging them

lgtm

Comment 4 Beniamino Galvani 2016-05-09 09:08:39 UTC

(In reply to Thomas Haller from comment #3)
> >> device: add nm_device_queue_unmanaged_by_flags()
>
> Maybe nm_device_set_unmanaged_by_flags_queue()?

Sounds good.

> >> sleep-monitor: add functions for delaying the suspension
>
> You say: "If some action were not completed, callers must avoid using
> release_inhibitor() after the wake-up.", which is a very good advice, but
> it's unclear how the caller would know when he should avoid
> release_inhibitor() :)

The caller must subscribe to the SLEEPING signal since
keep_inhibitor() can only be called in the handler for the signal. Any
use of the function outside the handler is subject to races. So, since
the caller is subscribed to the signal, it can certainly know when the
wake-up happens :)

Even if release_inhibitor() is called after the wake-up but before the
SIGNAL (sleep=false) is received, that is harmless. What is important
is that the function is not called after the signal, because it would
decrease the counter after it has been reset.

> I think nm_sleep_monitor_keep_inhibitor() should return an opaque handle
> that must be released exactly once. An internal sleep_signal() might
> invalidate the current release handles, but when the caller does
> release_inhibitory() on a stale handle, it gets released without actually
> touching the fd or inhibitor state.
>
> How about the fixup commit?

What's the purpose of the stale_list? Can we have only active_list and
ignore release() for non-active items?

Also, is there any advantage of this over a simple counter? Because
with the fixup callers have to store the handle and this would (only
slightly) complicate the next patch.

Comment 5 Thomas Haller 2016-05-09 14:03:58 UTC

(In reply to Beniamino Galvani from comment #4)
> (In reply to Thomas Haller from comment #3)
> > >> device: add nm_device_queue_unmanaged_by_flags()
> >
> > Maybe nm_device_set_unmanaged_by_flags_queue()?
> 
> Sounds good.
> 
> > >> sleep-monitor: add functions for delaying the suspension
> >
> > You say: "If some action were not completed, callers must avoid using
> > release_inhibitor() after the wake-up.", which is a very good advice, but
> > it's unclear how the caller would know when he should avoid
> > release_inhibitor() :)
> 
> The caller must subscribe to the SLEEPING signal since
> keep_inhibitor() can only be called in the handler for the signal. Any
> use of the function outside the handler is subject to races. So, since
> the caller is subscribed to the signal, it can certainly know when the
> wake-up happens :)
> 
> Even if release_inhibitor() is called after the wake-up but before the
> SIGNAL (sleep=false) is received, that is harmless. What is important
> is that the function is not called after the signal, because it would
> decrease the counter after it has been reset.

I don't know, I suspect not :)

Note that not matching every keep() with a release() call will mess up the counting. For example, take_inhibitor() sets inhibit_count to zero, so, you must no longer call release() for the keep() calls or you hit the assertion
g_return_if_fail (self->inhibit_count);
In other words, after sleep-monitor calls take_inhibitor(), the caller must know not to return the handle anymore.


 - receive sleep_signal(is_about_to_suspend=TRUE)
   - emit SLEEPING
     - NMManager calls keep_inhibitor()

shortly after:

  - receive sleep_signal(is_about_to_suspend=FALSE)
    - take_inhibitor() -- which resets inhibit_count=0
    - emit SLEEPING
      - if (manager_sleeping (self)) 
           (a) might register a second sleep-inihibitor(?)
        else
           (b) clear_sleep_devices(). Ok, in this case it's all good.
               After take_inhibitor() we forget about release_inhibit().

after (a), when device_sleep_cb() gets called and wants to call release_inhibitor(). Which is wrong and triggers an assertion.


So, at least you always must clear_sleep_devices() after sleeping_cb(is_about_to_suspend=FALSE);


-- It gets more complicated, because do_sleep_wake() is also triggered by `nmcli networking on|off`. So, how do they play together?





> > I think nm_sleep_monitor_keep_inhibitor() should return an opaque handle
> > that must be released exactly once. An internal sleep_signal() might
> > invalidate the current release handles, but when the caller does
> > release_inhibitory() on a stale handle, it gets released without actually
> > touching the fd or inhibitor state.
> >
> > How about the fixup commit?
> 
> What's the purpose of the stale_list? Can we have only active_list and
> ignore release() for non-active items?

> Also, is there any advantage of this over a simple counter? Because
> with the fixup callers have to store the handle and this would (only
> slightly) complicate the next patch.


I think, the pattern should be that you must exactly return the handle once. Everything else, is hard to get right. And it's even harder to verify/review.

The handles and the stale-list enforces this.

The stale-list is only(!) there to assert that the caller didn't mess up the counting. It could be omitted, but it's really a trivial implementation detail of NMSleepMonitor.

The handle is necessary, because an internal event can invalidate the handle -- that is, after take_inhibitor(), a release() must do nothing. You cannot express that with a counter.

With a counter only, the caller must know when (or when not) to release.

Comment 6 Thomas Haller 2016-05-09 14:14:53 UTC

Ok, maybe you can also do... dunno.


static void
sleeping_cb (NMSleepMonitor *monitor, gboolean is_about_to_suspend, gpoi...
{
    NMManager *self = user_data;

    _LOGD (LOGD_SUSPEND, "Received %s signal", is_about_to_suspend ?...
    clear_sleep_devices (self);
    _internal_sleep (self, is_about_to_suspend);
}

Comment 7 Beniamino Galvani 2016-05-11 10:09:55 UTC

(In reply to Thomas Haller from comment #5)
>  - receive sleep_signal(is_about_to_suspend=TRUE)
>    - emit SLEEPING
>      - NMManager calls keep_inhibitor()
> 
> shortly after:
> 
>   - receive sleep_signal(is_about_to_suspend=FALSE)
>     - take_inhibitor() -- which resets inhibit_count=0
>     - emit SLEEPING
>       - if (manager_sleeping (self)) 
>            (a) might register a second sleep-inihibitor(?)
>         else
>            (b) clear_sleep_devices(). Ok, in this case it's all good.
>                After take_inhibitor() we forget about release_inhibit().
> 
> after (a), when device_sleep_cb() gets called and wants to call
> release_inhibitor(). Which is wrong and triggers an assertion.

When the manager receives the SLEEPING signal after a wake up (a) should never be hit... or am I reading it wrong?

> So, at least you always must clear_sleep_devices() after
> sleeping_cb(is_about_to_suspend=FALSE);
> 

> I think, the pattern should be that you must exactly return the handle once.
> Everything else, is hard to get right. And it's even harder to verify/review.

Yeah, this makes sense, and actually isn't much more complicated for callers.
 
> -- It gets more complicated, because do_sleep_wake() is also triggered by
> `nmcli networking on|off`. So, how do they play together?

On networking off the manager should not take the inhibitor, apart from this I think both sleep and network-off should be handled in the same way (being careful not to deactivate an interface which is already deactivating). 

How about bg/disconnect-on-suspend-rh1330694-v2 ?

Comment 8 Thomas Haller 2016-05-11 11:09:48 UTC

(In reply to Beniamino Galvani from comment #7)
> (In reply to Thomas Haller from comment #5)

> How about bg/disconnect-on-suspend-rh1330694-v2 ?

I like this. It may be more lines of code, but something that I can understand (simply make sure not to loose a handle without returning it).


I think 
  if (!g_hash_table_contains (priv->sleep_devices, device)) {
is wrong. How about the fixup?

Comment 9 Beniamino Galvani 2016-05-11 11:54:29 UTC

(In reply to Thomas Haller from comment #8)

> I think 
>   if (!g_hash_table_contains (priv->sleep_devices, device)) {
> is wrong. How about the fixup?

Looks good, thanks!

Comment 10 Thomas Haller 2016-05-11 16:06:10 UTC

lgtm now

Comment 11 Francesco Giudici 2016-05-13 11:48:35 UTC

lgtm

Comment 12 Beniamino Galvani 2016-05-13 12:45:34 UTC

Merged to master:

https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=3fdffab9550bf7e55815d7a4453854057de55f0a

Comment 14 Vladimir Benes 2016-09-22 10:55:27 UTC

pre-down script in /etc/NetworkManager/dispatcher.d/pre-down.d/97-disp was successfully executed during suspend/resume cycle

Comment 16 errata-xmlrpc 2016-11-03 19:09:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2581.html