Bug 1259039

Summary: tuned-profiles-nfv: sometimes tuned_params is empty in grub.cfg
Product: Red Hat Enterprise Linux 7 Reporter: Luiz Capitulino <lcapitulino>
Component: tunedAssignee: Jaroslav Škarvada <jskarvad>
Status: CLOSED ERRATA QA Contact: Tereza Cerna <tcerna>
Severity: unspecified Docs Contact:
Priority: medium    
Version: 7.2CC: bhu, hhuang, jeder, jen, jskarvad, juzhang, lcapitulino, mtosatti, pagupta, tcerna, xfu
Target Milestone: rcFlags: tcerna: needinfo-
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tuned-2.7.0-1.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-04 07:24:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 1240765    

Description Luiz Capitulino 2015-09-01 20:03:44 UTC
Description of problem:

After a fresh install of a KVM-RT host and the new nfv profiles, I sometimes get an empty tuned_params= in grub.cfg. Like this:

### BEGIN /etc/grub.d/00_tuned ###
set tuned_params=""
### END /etc/grub.d/00_tuned ###

I'm not sure this problem is 100% reproducible because I got it only twice when setting up a new KVM-RT host. Here's the steps I can recollect I performed:

1. Do a fresh RHEL7.2 install (this seems to always install tuned by default)
2. Install tuned-profiles-realtime and tuned-profiles-nfv
3. Edit /etc/tuned/realtime-virtual-host-variables.conf to isolate cores
4. Activate the realtime-virtual-host profile
  # tuned-adm profile realtime-virtual-host
5. tuned_params= is empty in /boot/grub/grub.cfg

When I got this problem the second time, I've tried to activate the realtime-virtual-host profile again. This caused tuned-adm to hang. Then I checked 'systemctl status tuned' and saw that it was deactivated... I've killed tuned by hand, and suddenly it was back and tuned_params was added automatically.

Two important observations:

1. This doesn't happen on a VM, so you have to try it on a host
2. I'm very suspicious of the block of code in /usr/lib/tuned/realtime-virtual-host/script.sh which calls run-tscdeadline-latency.sh which in turn calls QEMU. This operation can take several seconds, I wonder if something unexpected is happening because tuned is temporary hung (like, systemd killing tuned). But I may be wrong

I'll keep working on getting this more reproducible.

Version-Release number of selected component (if applicable): tuned-2.5.1-3.el7.noarch, tuned-profiles-nfv-2.5.1-3.el7.noarch, tuned-profiles-realtime-2.5.1-3.el7.noarch

Comment 1 Luiz Capitulino 2015-09-02 14:01:30 UTC
OK, I know what's happening. Here's a simple reproducer:

1. If your active profile is realtime-virtual-host, change to another profile

  # tuned-adm active
  Current active profile: realtime-virtual-host
  # tuned-adm profile virtual-host

2. Check tuned_params in grub.cfg

  # grep tuned_params= /boot/grub2/grub.cfg
  set tuned_params=""

3. Activate the realtime-virtual-host profile and confirm it succeeded

 # tuned-adm profile realtime-virtual-host
 # echo $?
 0

4. Check tuned_params in grub.cfg

  # grep tuned_params= /boot/grub2/grub.cfg
  set tuned_params=""

If you reboot the machine at this point, you end up with no kernel command-line options. This happened to me twice already. If you wait some minutes (2 minutes on the machine I'm testing), grub.cfg will eventually be updated. Also, after you reboot grub.cfg will eventually be updated too.

I believe this is so because of run-tscdeadline-latency.sh, which takes a lot to run.

Comment 3 Jaroslav Škarvada 2015-11-09 16:55:22 UTC
Can we lower the number of iterations the script do or speed it up by different way?

Comment 5 Marcelo Tosatti 2015-11-19 23:25:46 UTC
(In reply to Jaroslav Škarvada from comment #3)
> Can we lower the number of iterations the script do or speed it up by
> different way?

Its similar to a calibration process, so lets assume we could reduce it from 2 minutes to 30 seconds. 
The problem would still exist, even if the time was reduced, right? 

So don't see why reducing the runtime of run-tscdeadline-latency.sh gets rid of the problem.

Comment 6 Luiz Capitulino 2015-11-20 17:12:04 UTC
Jarda,

Although this issue only happens with tuned-profiles-nfv right now, I'm under the impression that this is a more general design issue with tuned.

Is tuned designed for profiles which may block or take a long time to finish?

Marcelo can answer if run-tscdeadline-latency,sh can finish sooner or whether we could try to come up with some trick, but the issue at hand is that we have a profile which may need minutes to finish. Is this allowed at all in tuned? If not, why not?

Comment 7 Jaroslav Škarvada 2015-11-20 17:25:49 UTC
(In reply to Luiz Capitulino from comment #6)
> Jarda,
> 
> Although this issue only happens with tuned-profiles-nfv right now, I'm
> under the impression that this is a more general design issue with tuned.
> 
> Is tuned designed for profiles which may block or take a long time to finish?
> 
IMHO it shouldn't be problem. The only limit which could affect it is IIRC the 3 minutes (or so) systemd service timeout, but IMHO it shouldn't be in effect for profile loading (I will check it).

Back to comment 1, there is written that the problem occurre if the machine is rebooted during the calibration. I am afraid we cannot do much about it - attempts to block or postpone reboots/shut downs are bad and usually ineffective (there is always power switch somewhere which can be cut). Is there another reproducer?

Comment 8 Jaroslav Škarvada 2015-11-20 17:29:01 UTC
(In reply to Jaroslav Škarvada from comment #7)
> (In reply to Luiz Capitulino from comment #6)
> > Jarda,
> > 
> > Although this issue only happens with tuned-profiles-nfv right now, I'm
> > under the impression that this is a more general design issue with tuned.
> > 
> > Is tuned designed for profiles which may block or take a long time to finish?
> > 
> IMHO it shouldn't be problem. The only limit which could affect it is IIRC
> the 3 minutes (or so) systemd service timeout, but IMHO it shouldn't be in
> effect for profile loading (I will check it).
> 
> Back to comment 1, there is written that the problem occurre if the machine
> is rebooted during the calibration. I am afraid we cannot do much about it -
> attempts to block or postpone reboots/shut downs are bad and usually
> ineffective (there is always power switch somewhere which can be cut). Is
> there another reproducer?

Maybe we could re-run the calibration on the next tuned start if the first calibration failed or was interrupted. I will check.

Comment 9 Luiz Capitulino 2015-11-20 18:23:34 UTC
That could be helpful. But IMHO the problem is that tuned-adm profile returns success while the profile configuration is still taking place. If instead of returning success it printed a warning to the user saying that this profile takes longer than usual to finish and then waited for it to finish this would solve the problem.

Comment 10 Jeff Nelson 2015-11-23 16:07:43 UTC
In Comment 9, Luiz writes:
>IMHO the problem is that tuned-adm profile returns success
>while the profile configuration is still taking place. 

Indeed, this is the fundamental flaw. Why? Because the profile configuration is going to alter some configuration files. This must be allowed to complete or the setup is not finished, but it appears to the user that it was successful.

Consider that the user may also want to make changes to the same files that tuned will modify. There is a race-condition here and a possible loss of data if one overwrites the settings made by the other.

The user really needs to know when it's safe to proceed.

In Comment 8, Jaroslav writes:
>Maybe we could re-run the calibration on the next tuned start
>if the first calibration failed or was interrupted.

This would help. I encourage you to inform the user that this happened.

Comment 11 Jaroslav Škarvada 2016-07-14 16:46:19 UTC
I think this is resolved by sync mode, which is by default on, i.e. 'tuned-adm profile' doesn't return until the profile is fully applied.

There may be a problem, when the profile is autodetected and enabled upon tuned installation and then the machine is quickly rebooted. IMHO Tuned now waits till the configuration is finished and then shutdowns. Well, it may be still hard killed by the systemd after 90 seconds timeout, but it's better than nothing.

Even in such case no much harm should be done - the profile is simply re-applied and the script re-run after the reboot.

For profiles with long running initialization we should fix their script code to re-run the initilization if the previous initialization wasn't completed. I think it's the best what we can do.

Comment 12 Jaroslav Škarvada 2016-07-14 16:52:39 UTC
(In reply to Jaroslav Škarvada from comment #11)
> I think this is resolved by sync mode, which is by default on, i.e.
> 'tuned-adm profile' doesn't return until the profile is fully applied.
> 
> There may be a problem, when the profile is autodetected and enabled upon
> tuned installation and then the machine is quickly rebooted. IMHO Tuned now
> waits till the configuration is finished and then shutdowns. Well, it may be
> still hard killed by the systemd after 90 seconds timeout, but it's better
> than nothing.
> 
> Even in such case no much harm should be done - the profile is simply
> re-applied and the script re-run after the reboot.
> 
> For profiles with long running initialization we should fix their script
> code to re-run the initilization if the previous initialization wasn't
> completed. I think it's the best what we can do.

Regarding the NFV profiles scripts it seems all the relevant "helper" files are recreated if they don't exist. But I did only quick code check.

Comment 15 Luiz Capitulino 2016-07-20 15:10:57 UTC
Jaroslav,

Just to make sure I understood correctly, you backported sync mode to 7.3 (or maybe did a full rebase upstream) and it fixes this issue?

Comment 16 Jaroslav Škarvada 2016-07-20 15:24:04 UTC
(In reply to Luiz Capitulino from comment #15)
> Jaroslav,
> 
> Just to make sure I understood correctly, you backported sync mode to 7.3
> (or maybe did a full rebase upstream) and it fixes this issue?

The 7.3 is receiving the latest tuned-2.7.0 with the sync mode enabled by default. I wasn't able to reproduce the problem, but there still may be some different problem. Feel free to test and let me know.

Comment 21 Tereza Cerna 2016-09-19 13:23:38 UTC
Close as SanityOnly bug.

Comment 24 errata-xmlrpc 2016-11-04 07:24:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2479.html