Bug 1285922

Summary: Invalid option in nova.conf
Product: Red Hat OpenStack Reporter: Andrew Beekhof <abeekhof>
Component: openstack-novaAssignee: Artom Lifshitz <alifshit>
Status: CLOSED NOTABUG QA Contact: nlevinki <nlevinki>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 8.0 (Liberty)CC: abeekhof, alifshit, berrange, dasmith, eglynn, fdinitto, kchamart, michele, rscarazz, sbauza, sferdjao, sgordon, srevivo, vromanso
Target Milestone: ---Keywords: ZStream
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-06 14:14:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1185030    
Attachments:
Description Flags
nova.conf from a clean ospd8 installation. none

Description Andrew Beekhof 2015-11-27 00:49:39 UTC
Description of problem:

/etc/nova/nova.conf contains the deprecated/removed 'verbose' option.
On compute nodes, where NovaCompute is used, this prevents startup when not not started from the foreground.

Version-Release number of selected component (if applicable):

openstack-nova-common-12.0.0-2.el7ost.noarch

How reproducible:

100%

Steps to Reproduce:
1. Install with director
2. Perform the steps from https://access.redhat.com/articles/1544823
3. Run: crm_resource -r nova-compute --force-start -V

Actual results:

Nothing sent to /var/log/nova/nova-compute.log
Process does not stay up for more than a few seconds

Expected results:

Once removing the verbose option completely:
- Logs are populated
- Process stays up

Additional info:

I'm not sure who populates nova.conf or what the defaults are, so this may well be a director bug instead.

I will leave it as an exercise for the nova team to work out why the process only exits silently if the script that calls it has been fork()+exec()'d with pipes for stdout and stderr.

Comment 2 Artom Lifshitz 2015-12-08 15:50:11 UTC
Hi Andrew,

The verbose option is not deprecated and its presence should not cause nova compute to crash. In what section of the config file ([DEFAULT], [libvirt], etc) is the 'verbose' line that you are removing? Could there be something else on that line that causes the crash (non printing characters, for example?) When the bug manifests itself, is the compute log always empty, even with debug enabled?

Thanks!

Comment 3 Andrew Beekhof 2015-12-08 22:10:15 UTC
(In reply to Artom Lifshitz from comment #2)
> Hi Andrew,
> 
> The verbose option is not deprecated

The config file disagrees:

# If set to false, will disable INFO logging level, making WARNING the default.
# (boolean value)
# This option is deprecated for removal.
# Its value may be silently ignored in the future.

> and its presence should not cause nova
> compute to crash. 

I agree. But it does.

> In what section of the config file ([DEFAULT], [libvirt],
> etc) is the 'verbose' line that you are removing? 

It seems to be in DEFAULT under

#
# From oslo.log
#


> Could there be something
> else on that line that causes the crash (non printing characters, for
> example?) 

Maybe, but that one change is the only one I needed to make.

> When the bug manifests itself, is the compute log always empty,
> even with debug enabled?

Correct. I had to liter the code with print statements until I could find a cause.

Comment 4 Artom Lifshitz 2015-12-16 18:56:28 UTC
Do you still have the original environment in which you first noticed the bug? Failing that, is the original nova.conf available? My feeling is that oslo.log is crashing because of something on that verbose line. It may be put there by osp-director, but the root cause would be in oslo.log.

Comment 5 Fabio Massimo Di Nitto 2015-12-17 07:09:06 UTC
(In reply to Artom Lifshitz from comment #4)
> Do you still have the original environment in which you first noticed the
> bug? Failing that, is the original nova.conf available? My feeling is that
> oslo.log is crashing because of something on that verbose line. It may be
> put there by osp-director, but the root cause would be in oslo.log.

The environment is deployed by OSPd 8. You should have access to such deployment including nova.conf etc.

Our development environment is wiped and reinstalled every night as part of CI. We can´t put it on hold for weeks at a time for every bug we find :)

If you are unable to deploy with OSPd 8, please contact rasca or bandini on #rhos-pidone and they can provide you with a copy of nova.conf

Comment 6 Artom Lifshitz 2015-12-18 21:36:11 UTC
> Our development environment is wiped and reinstalled every night as part of
> CI. We can´t put it on hold for weeks at a time for every bug we find :)

For sure :) I was hoping that perhaps the offending nova.conf had been saved somewhere as part of this BZ.

> If you are unable to deploy with OSPd 8, please contact rasca or bandini on
> #rhos-pidone and they can provide you with a copy of nova.conf

Am I understanding correctly that rasca or bandini would be able to provide a nova.conf as deployed by OSPd? And that therefore said nova.conf should contain whatever is triggering this bug (if my theory is correct)?

Comment 7 Fabio Massimo Di Nitto 2015-12-19 03:52:42 UTC
(In reply to Artom Lifshitz from comment #6)
> > Our development environment is wiped and reinstalled every night as part of
> > CI. We can´t put it on hold for weeks at a time for every bug we find :)
> 
> For sure :) I was hoping that perhaps the offending nova.conf had been saved
> somewhere as part of this BZ.
> 
> > If you are unable to deploy with OSPd 8, please contact rasca or bandini on
> > #rhos-pidone and they can provide you with a copy of nova.conf
> 
> Am I understanding correctly that rasca or bandini would be able to provide
> a nova.conf as deployed by OSPd? And that therefore said nova.conf should
> contain whatever is triggering this bug (if my theory is correct)?

Yes.

Comment 8 Raoul Scarazzini 2015-12-21 09:42:56 UTC
Created attachment 1108270 [details]
nova.conf from a clean ospd8 installation.

Comment 9 Raoul Scarazzini 2015-12-21 09:43:21 UTC
I attached the nova.conf file that comes from our ospd8 deployment.

As Andrew said the two options on which we're working now to make things work are:

debug=False (which is the same of the default)
verbose=False (which is NOT the default)

If you need something else don't hesitate to ask.

Comment 10 Raoul Scarazzini 2015-12-24 15:46:50 UTC
So after a lot of unsuccessful tries, we did manage to reproduce the problem under strace.
We have copied two strace outputs at these links:

The one not working: http://file.rdu.redhat.com/~rscarazz/BZ1285922/strace-notworking.log.gz
The one working: http://file.rdu.redhat.com/~rscarazz/BZ1285922/strace-working-s128.log.gz (note that this one had the -s128 parameter to strace so the strings are longer)

The straces were taken by manually launching the resource agent, like this:
export OCF_ROOT=/usr/lib/ocf
export OCF_RESKEY_auth_url=http://172.20.0.10:5000/v2.0/
export OCF_RESKEY_username=admin
export OCF_RESKEY_password=KAcGkxF6Nkw2AgEFJ8yUqEQu2
export OCF_RESKEY_tenant_name=admin
export OCF_RESKEY_domain=localdomain
strace -f -o /tmp/strace.log /usr/lib/ocf/resource.d/openstack/NovaCompute start

What is interesting is that in the non-working case, there is no invocation at all
of nova-compute, which seems to point the finger at something fishy in the OCF script.
Specifically the nova_pid function seems not all too robust.

We will continue the analysis once back from PTO.

Comment 11 Andrew Beekhof 2016-01-04 04:48:35 UTC
(In reply to Raoul Scarazzini from comment #10)

> What is interesting is that in the non-working case, there is no invocation
> at all
> of nova-compute, which seems to point the finger at something fishy in the
> OCF script.

I disagree, but for the sake of argument lets pretend I agree... how do you explain that removing that one logging option changes a "never works" situation into an "always works" one?

> Specifically the nova_pid function seems not all too robust.

Shouldn't be related since the process would be stopped and there would be nothing to detect.

Comment 12 Raoul Scarazzini 2016-01-04 16:32:27 UTC
(In reply to Andrew Beekhof from comment #11)
> I disagree, but for the sake of argument lets pretend I agree... how do you
> explain that removing that one logging option changes a "never works"
> situation into an "always works" one?

This never happened on the environment I'm testing (OSPd Liberty with the 20 steps from the Knowledge Base). That option never had an influence on the process.
What I've done to make things work is this:

1) Install a clean OSPd 7 environment with 3 controllers and 4 computes;
2) Download the upstream scripts for NovaEvacuate, NovaCompute and fence_compute.py;
*** The most important step ***
3) Apply this patch in NovaCompute to use systemctl instead of nova-compute to launch the service:

--- NovaCompute.upstream        2016-01-04 13:05:18.038937203 +0000
+++ NovaCompute.rasca   2016-01-04 13:06:41.830694586 +0000
@@ -150,7 +150,7 @@
 }
 
 nova_pid() {
-    ps axf | grep python.*nova-compute | grep -v grep | awk '{print $1}'
+    pgrep -u nova -f '/usr/bin/python2.*/usr/bin/nova-compute.*'
 }
 
 nova_start() {
@@ -180,7 +180,7 @@
     fi
 
     export LIBGUESTFS_ATTACH_METHOD=appliance
-    su nova -s /bin/sh -c /usr/bin/nova-compute &
+    /bin/systemctl start openstack-nova-compute
 
     rc=$OCF_NOT_RUNNING
     ocf_log info "Waiting for nova to start"
@@ -213,7 +213,7 @@
 nova_stop() {
     pid=$(nova_pid)
     if [ "x$pid" != x ]; then
-       su nova -c "kill -TERM $pid" -s /bin/bash
+        /bin/systemctl stop openstack-nova-compute
     fi
 
     while [ "x$pid" != x ]; do
4) Copy NovaEvacuate, NovaCompute and fence_compute.py on all the controllers (not really necessary) and computes;
********************************
5) Apply all the 20 steps from the KB article, changing some of them to add no_shared_storage option on nova-evacuate, nova-compute (which is cloned) and also fence-nova stonith resource (which is based upon fence_compute stonith agent);
6) Launch 6 instances on the same compute node;
7) Schedule a "echo c > /proc/sysrq-trigger" on the chosen compute node at a certain time;

After this all the instances are successfully migrated to a different host, so everything works as expected. Note that this never happened before, at least for me.

All the steps above are reproducible starting from a clean setup.

Comment 13 Andrew Beekhof 2016-01-04 21:40:19 UTC
Please see the additional info from the description.
Its always been known that the symptoms are not reproducible that way but switching to systemd just avoids rather than fixes the underlying problem.

Comment 14 Raoul Scarazzini 2016-01-05 13:48:57 UTC
Ok, I agree with this, what I don't understand is that everything seems to start from a nova.conf misconfiguration, but in my tests this does not affect anything (neither with true or false setting).

So the point is to understand why the service works in systemd and not while invoked inside the resource agent, even apparently the way it is launched is the same.
This is the content of the systemd "Service" declaration:

Environment=LIBGUESTFS_ATTACH_METHOD=appliance
Type=notify
NotifyAccess=all
TimeoutStartSec=0
Restart=always
User=nova
ExecStart=/usr/bin/nova-compute

and this is how it is invoked inside the resource agent:

export LIBGUESTFS_ATTACH_METHOD=appliance
su nova -s /bin/sh -c /usr/bin/nova-compute &

So, apparently, everything is the same.

As discussed I will try to separate the evacuation part from the nova-compute service, which will be a stand alone resource dependent from the first one, and then I'll update this BZ.

Comment 15 Artom Lifshitz 2016-06-06 14:14:57 UTC
Hello,

This bug has been open for a while without any updates. I'm going to close it. If you feel there's anything here that a Nova engineer can help with, don't hesitate to reopen this bug.

Cheers!