Bug 1230592 - possible race condition in nova-compute
Summary: possible race condition in nova-compute
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 7.0 (Kilo)
Assignee: Eoghan Glynn
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks: 1185030 1251948 1261487
TreeView+ depends on / blocked
 
Reported: 2015-06-11 08:34 UTC by Fabio Massimo Di Nitto
Modified: 2020-12-21 19:39 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-01-17 15:12:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs from one compute node (24.69 KB, text/plain)
2015-06-11 08:34 UTC, Fabio Massimo Di Nitto
no flags Details

Description Fabio Massimo Di Nitto 2015-06-11 08:34:28 UTC
Created attachment 1037552 [details]
logs from one compute node

Description of problem:

this issue was originally reported via email. It is not critical yet and it can only be triggered following a very specific set of events based on the old implementation of the NovaCompute resource agent.

It requires a fresh install every time to reproduce of both controllers and compute nodes. Rolling back the db or so won't do it.

The resource agent code (as provided to us) used to do:

    export LIBGUESTFS_ATTACH_METHOD=appliance
    su nova -s /bin/sh -c /usr/bin/nova-compute &

    rc=$OCF_NOT_RUNNING
    ocf_log info "Waiting for nova to start"
    while [ $rc != $OCF_SUCCESS ]; do
        nova_monitor
        rc=$?
    done

    if [ "x${OCF_RESKEY_domain}" != x ]; then
       export service_host="${NOVA_HOST}.${OCF_RESKEY_domain}"
    else
       export service_host="${NOVA_HOST}"
    fi

    python -c "import os; from novaclient import client as nova_client; nova = nova_client.Client('2', os.environ.get('OCF_RESKEY_username'), os.environ.get('OCF_RESKEY_password'), os.environ.get('OCF_RESKEY_tenant_name'), os.environ.get('OCF_RESKEY_auth_url')); nova.services.enable(os.environ.get('service_host'), 'nova-compute');"

It appears, from what we were able to see, that nova-compute would start, and while nova-compute starts to register itself as hypervisor, the subsequent call to nova would happen "too fast" or in a racy matter that left the db in a non consistent state.

Any attempt to start an instance on that given compute node would fail.

After a full environment reset, and dropped the call to python, everything would work just fine.

Version-Release number of selected component (if applicable):

controllers:
openstack-nova-common-2015.1.0-4.el7ost.noarch
openstack-nova-console-2015.1.0-4.el7ost.noarch
openstack-nova-scheduler-2015.1.0-4.el7ost.noarch
openstack-nova-novncproxy-2015.1.0-4.el7ost.noarch
openstack-nova-conductor-2015.1.0-4.el7ost.noarch
openstack-nova-api-2015.1.0-4.el7ost.noarch
python-nova-2015.1.0-4.el7ost.noarch
python-novaclient-2.23.0-1.el7ost.noarch

computes:
openstack-nova-common-2015.1.0-4.el7ost.noarch
python-nova-2015.1.0-4.el7ost.noarch
python-novaclient-2.23.0-1.el7ost.noarch
openstack-nova-compute-2015.1.0-4.el7ost.noarch

How reproducible:

always

Steps to Reproduce:
1. install super fresh environment without compute nodes
2. prepare one compute node (configure et all) WITHTOUT starting nova-compute
3. put the above code in a small shell script so it's executed as fast as pacemaker would execute it when starting the resource agent
4. execute it to start nova-compute
5. try to fire up an instance.

Actual results:

Instances will fail to start

Expected results:

Instances should start.

Additional info:

A nova-compute.log is attached from an old run. Hopefully it is good enough to be useful, but otherwise it might take a few days before we can grab another one.

Also note that we did workaround this issue temporary by disabling the call to python (it provides an optimization but it's not critical path), tho a user can potentially do the same by start nova-compute via systemctl and enable the service right away.

Comment 8 Artom Lifshitz 2018-01-17 15:12:48 UTC
Given the age of this thing, the rather specific reproduction steps, and the lack of customer case, I think we can safely close this.


Note You need to log in before you can comment on or make changes to this bug.