Description of problem: /etc/nova/nova.conf contains the deprecated/removed 'verbose' option. On compute nodes, where NovaCompute is used, this prevents startup when not not started from the foreground. Version-Release number of selected component (if applicable): openstack-nova-common-12.0.0-2.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Install with director 2. Perform the steps from https://access.redhat.com/articles/1544823 3. Run: crm_resource -r nova-compute --force-start -V Actual results: Nothing sent to /var/log/nova/nova-compute.log Process does not stay up for more than a few seconds Expected results: Once removing the verbose option completely: - Logs are populated - Process stays up Additional info: I'm not sure who populates nova.conf or what the defaults are, so this may well be a director bug instead. I will leave it as an exercise for the nova team to work out why the process only exits silently if the script that calls it has been fork()+exec()'d with pipes for stdout and stderr.
Hi Andrew, The verbose option is not deprecated and its presence should not cause nova compute to crash. In what section of the config file ([DEFAULT], [libvirt], etc) is the 'verbose' line that you are removing? Could there be something else on that line that causes the crash (non printing characters, for example?) When the bug manifests itself, is the compute log always empty, even with debug enabled? Thanks!
(In reply to Artom Lifshitz from comment #2) > Hi Andrew, > > The verbose option is not deprecated The config file disagrees: # If set to false, will disable INFO logging level, making WARNING the default. # (boolean value) # This option is deprecated for removal. # Its value may be silently ignored in the future. > and its presence should not cause nova > compute to crash. I agree. But it does. > In what section of the config file ([DEFAULT], [libvirt], > etc) is the 'verbose' line that you are removing? It seems to be in DEFAULT under # # From oslo.log # > Could there be something > else on that line that causes the crash (non printing characters, for > example?) Maybe, but that one change is the only one I needed to make. > When the bug manifests itself, is the compute log always empty, > even with debug enabled? Correct. I had to liter the code with print statements until I could find a cause.
Do you still have the original environment in which you first noticed the bug? Failing that, is the original nova.conf available? My feeling is that oslo.log is crashing because of something on that verbose line. It may be put there by osp-director, but the root cause would be in oslo.log.
(In reply to Artom Lifshitz from comment #4) > Do you still have the original environment in which you first noticed the > bug? Failing that, is the original nova.conf available? My feeling is that > oslo.log is crashing because of something on that verbose line. It may be > put there by osp-director, but the root cause would be in oslo.log. The environment is deployed by OSPd 8. You should have access to such deployment including nova.conf etc. Our development environment is wiped and reinstalled every night as part of CI. We can´t put it on hold for weeks at a time for every bug we find :) If you are unable to deploy with OSPd 8, please contact rasca or bandini on #rhos-pidone and they can provide you with a copy of nova.conf
> Our development environment is wiped and reinstalled every night as part of > CI. We can´t put it on hold for weeks at a time for every bug we find :) For sure :) I was hoping that perhaps the offending nova.conf had been saved somewhere as part of this BZ. > If you are unable to deploy with OSPd 8, please contact rasca or bandini on > #rhos-pidone and they can provide you with a copy of nova.conf Am I understanding correctly that rasca or bandini would be able to provide a nova.conf as deployed by OSPd? And that therefore said nova.conf should contain whatever is triggering this bug (if my theory is correct)?
(In reply to Artom Lifshitz from comment #6) > > Our development environment is wiped and reinstalled every night as part of > > CI. We can´t put it on hold for weeks at a time for every bug we find :) > > For sure :) I was hoping that perhaps the offending nova.conf had been saved > somewhere as part of this BZ. > > > If you are unable to deploy with OSPd 8, please contact rasca or bandini on > > #rhos-pidone and they can provide you with a copy of nova.conf > > Am I understanding correctly that rasca or bandini would be able to provide > a nova.conf as deployed by OSPd? And that therefore said nova.conf should > contain whatever is triggering this bug (if my theory is correct)? Yes.
Created attachment 1108270 [details] nova.conf from a clean ospd8 installation.
I attached the nova.conf file that comes from our ospd8 deployment. As Andrew said the two options on which we're working now to make things work are: debug=False (which is the same of the default) verbose=False (which is NOT the default) If you need something else don't hesitate to ask.
So after a lot of unsuccessful tries, we did manage to reproduce the problem under strace. We have copied two strace outputs at these links: The one not working: http://file.rdu.redhat.com/~rscarazz/BZ1285922/strace-notworking.log.gz The one working: http://file.rdu.redhat.com/~rscarazz/BZ1285922/strace-working-s128.log.gz (note that this one had the -s128 parameter to strace so the strings are longer) The straces were taken by manually launching the resource agent, like this: export OCF_ROOT=/usr/lib/ocf export OCF_RESKEY_auth_url=http://172.20.0.10:5000/v2.0/ export OCF_RESKEY_username=admin export OCF_RESKEY_password=KAcGkxF6Nkw2AgEFJ8yUqEQu2 export OCF_RESKEY_tenant_name=admin export OCF_RESKEY_domain=localdomain strace -f -o /tmp/strace.log /usr/lib/ocf/resource.d/openstack/NovaCompute start What is interesting is that in the non-working case, there is no invocation at all of nova-compute, which seems to point the finger at something fishy in the OCF script. Specifically the nova_pid function seems not all too robust. We will continue the analysis once back from PTO.
(In reply to Raoul Scarazzini from comment #10) > What is interesting is that in the non-working case, there is no invocation > at all > of nova-compute, which seems to point the finger at something fishy in the > OCF script. I disagree, but for the sake of argument lets pretend I agree... how do you explain that removing that one logging option changes a "never works" situation into an "always works" one? > Specifically the nova_pid function seems not all too robust. Shouldn't be related since the process would be stopped and there would be nothing to detect.
(In reply to Andrew Beekhof from comment #11) > I disagree, but for the sake of argument lets pretend I agree... how do you > explain that removing that one logging option changes a "never works" > situation into an "always works" one? This never happened on the environment I'm testing (OSPd Liberty with the 20 steps from the Knowledge Base). That option never had an influence on the process. What I've done to make things work is this: 1) Install a clean OSPd 7 environment with 3 controllers and 4 computes; 2) Download the upstream scripts for NovaEvacuate, NovaCompute and fence_compute.py; *** The most important step *** 3) Apply this patch in NovaCompute to use systemctl instead of nova-compute to launch the service: --- NovaCompute.upstream 2016-01-04 13:05:18.038937203 +0000 +++ NovaCompute.rasca 2016-01-04 13:06:41.830694586 +0000 @@ -150,7 +150,7 @@ } nova_pid() { - ps axf | grep python.*nova-compute | grep -v grep | awk '{print $1}' + pgrep -u nova -f '/usr/bin/python2.*/usr/bin/nova-compute.*' } nova_start() { @@ -180,7 +180,7 @@ fi export LIBGUESTFS_ATTACH_METHOD=appliance - su nova -s /bin/sh -c /usr/bin/nova-compute & + /bin/systemctl start openstack-nova-compute rc=$OCF_NOT_RUNNING ocf_log info "Waiting for nova to start" @@ -213,7 +213,7 @@ nova_stop() { pid=$(nova_pid) if [ "x$pid" != x ]; then - su nova -c "kill -TERM $pid" -s /bin/bash + /bin/systemctl stop openstack-nova-compute fi while [ "x$pid" != x ]; do 4) Copy NovaEvacuate, NovaCompute and fence_compute.py on all the controllers (not really necessary) and computes; ******************************** 5) Apply all the 20 steps from the KB article, changing some of them to add no_shared_storage option on nova-evacuate, nova-compute (which is cloned) and also fence-nova stonith resource (which is based upon fence_compute stonith agent); 6) Launch 6 instances on the same compute node; 7) Schedule a "echo c > /proc/sysrq-trigger" on the chosen compute node at a certain time; After this all the instances are successfully migrated to a different host, so everything works as expected. Note that this never happened before, at least for me. All the steps above are reproducible starting from a clean setup.
Please see the additional info from the description. Its always been known that the symptoms are not reproducible that way but switching to systemd just avoids rather than fixes the underlying problem.
Ok, I agree with this, what I don't understand is that everything seems to start from a nova.conf misconfiguration, but in my tests this does not affect anything (neither with true or false setting). So the point is to understand why the service works in systemd and not while invoked inside the resource agent, even apparently the way it is launched is the same. This is the content of the systemd "Service" declaration: Environment=LIBGUESTFS_ATTACH_METHOD=appliance Type=notify NotifyAccess=all TimeoutStartSec=0 Restart=always User=nova ExecStart=/usr/bin/nova-compute and this is how it is invoked inside the resource agent: export LIBGUESTFS_ATTACH_METHOD=appliance su nova -s /bin/sh -c /usr/bin/nova-compute & So, apparently, everything is the same. As discussed I will try to separate the evacuation part from the nova-compute service, which will be a stand alone resource dependent from the first one, and then I'll update this BZ.
Hello, This bug has been open for a while without any updates. I'm going to close it. If you feel there's anything here that a Nova engineer can help with, don't hesitate to reopen this bug. Cheers!