Bug 1230592

Summary: possible race condition in nova-compute
Product: Red Hat OpenStack Reporter: Fabio Massimo Di Nitto <fdinitto>
Component: openstack-novaAssignee: Eoghan Glynn <eglynn>
Status: CLOSED WONTFIX QA Contact: nlevinki <nlevinki>
Severity: high Docs Contact:
Priority: medium    
Version: 7.0 (Kilo)CC: berrange, dasmith, eglynn, kchamart, owalsh, rscarazz, sbauza, sferdjao, sgordon, srevivo, vromanso
Target Milestone: ---Keywords: ZStream
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-01-17 15:12:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1185030, 1251948, 1261487    
Attachments:
Description Flags
logs from one compute node none

Description Fabio Massimo Di Nitto 2015-06-11 08:34:28 UTC
Created attachment 1037552 [details]
logs from one compute node

Description of problem:

this issue was originally reported via email. It is not critical yet and it can only be triggered following a very specific set of events based on the old implementation of the NovaCompute resource agent.

It requires a fresh install every time to reproduce of both controllers and compute nodes. Rolling back the db or so won't do it.

The resource agent code (as provided to us) used to do:

    export LIBGUESTFS_ATTACH_METHOD=appliance
    su nova -s /bin/sh -c /usr/bin/nova-compute &

    rc=$OCF_NOT_RUNNING
    ocf_log info "Waiting for nova to start"
    while [ $rc != $OCF_SUCCESS ]; do
        nova_monitor
        rc=$?
    done

    if [ "x${OCF_RESKEY_domain}" != x ]; then
       export service_host="${NOVA_HOST}.${OCF_RESKEY_domain}"
    else
       export service_host="${NOVA_HOST}"
    fi

    python -c "import os; from novaclient import client as nova_client; nova = nova_client.Client('2', os.environ.get('OCF_RESKEY_username'), os.environ.get('OCF_RESKEY_password'), os.environ.get('OCF_RESKEY_tenant_name'), os.environ.get('OCF_RESKEY_auth_url')); nova.services.enable(os.environ.get('service_host'), 'nova-compute');"

It appears, from what we were able to see, that nova-compute would start, and while nova-compute starts to register itself as hypervisor, the subsequent call to nova would happen "too fast" or in a racy matter that left the db in a non consistent state.

Any attempt to start an instance on that given compute node would fail.

After a full environment reset, and dropped the call to python, everything would work just fine.

Version-Release number of selected component (if applicable):

controllers:
openstack-nova-common-2015.1.0-4.el7ost.noarch
openstack-nova-console-2015.1.0-4.el7ost.noarch
openstack-nova-scheduler-2015.1.0-4.el7ost.noarch
openstack-nova-novncproxy-2015.1.0-4.el7ost.noarch
openstack-nova-conductor-2015.1.0-4.el7ost.noarch
openstack-nova-api-2015.1.0-4.el7ost.noarch
python-nova-2015.1.0-4.el7ost.noarch
python-novaclient-2.23.0-1.el7ost.noarch

computes:
openstack-nova-common-2015.1.0-4.el7ost.noarch
python-nova-2015.1.0-4.el7ost.noarch
python-novaclient-2.23.0-1.el7ost.noarch
openstack-nova-compute-2015.1.0-4.el7ost.noarch

How reproducible:

always

Steps to Reproduce:
1. install super fresh environment without compute nodes
2. prepare one compute node (configure et all) WITHTOUT starting nova-compute
3. put the above code in a small shell script so it's executed as fast as pacemaker would execute it when starting the resource agent
4. execute it to start nova-compute
5. try to fire up an instance.

Actual results:

Instances will fail to start

Expected results:

Instances should start.

Additional info:

A nova-compute.log is attached from an old run. Hopefully it is good enough to be useful, but otherwise it might take a few days before we can grab another one.

Also note that we did workaround this issue temporary by disabling the call to python (it provides an optimization but it's not critical path), tho a user can potentially do the same by start nova-compute via systemctl and enable the service right away.

Comment 8 Artom Lifshitz 2018-01-17 15:12:48 UTC
Given the age of this thing, the rather specific reproduction steps, and the lack of customer case, I think we can safely close this.