Bug 1321985 - Inconsistent failures with introspection and deploy build on Baremetal - OSP 8 poodle
Summary: Inconsistent failures with introspection and deploy build on Baremetal - OSP ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 10.0 (Newton)
Assignee: Angus Thomas
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-29 14:06 UTC by Ronelle Landy
Modified: 2016-04-18 07:47 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-04-18 07:47:50 UTC
Target Upstream Version:


Attachments (Terms of Use)
heat stack list and event list output (15.33 KB, text/plain)
2016-03-29 14:06 UTC, Ronelle Landy
no flags Details
ironic errors (77.02 KB, text/plain)
2016-03-29 14:07 UTC, Ronelle Landy
no flags Details
Heat log errors (64.42 KB, text/plain)
2016-03-29 14:07 UTC, Ronelle Landy
no flags Details

Description Ronelle Landy 2016-03-29 14:06:17 UTC
Created attachment 1141264 [details]
heat stack list and event list output

Description of problem:

With the OSP 8 poodle runs on 03/27 and 03/28 we are hitting a different error on each run - all related to introspection or initial stages of deploy.

Examples of failures are added below:

========================

If introspection is run one node at a time (not bulk) and a node gets stuck at getting agent.ramdisk, if introspection is rerun, and passes that point, 'openstack baremetal introspection status' still shows:

+----------+-------+
| Field    | Value |
+----------+-------+
| error    | None  |
| finished | False |
+----------+-------+

=========================

Using disk hints fails introspection on some runs and passes on others:

Run with failure:
-----------------

09:35:24 cmd: source /home/stack/stackrc; openstack baremetal introspection bulk start;
09:35:24 
09:35:24 start: 2016-03-29 05:30:41.377246
09:35:24 
09:35:24 end: 2016-03-29 05:35:28.776619
09:35:24 
09:35:24 delta: 0:04:47.399373
09:35:24 
09:35:24 stdout: Setting nodes for introspection to manageable...
09:35:24 Starting introspection of node: 04a4308c-5e43-4221-af70-7c26299e7854
09:35:24 Starting introspection of node: 249b039d-7e67-41ce-a3d7-575fe449f49e
09:35:24 Starting introspection of node: c1db408c-cc5a-4994-a018-5f00fc809581
09:35:24 Starting introspection of node: c3b0a5e3-17ee-4882-b003-f8719fe6dd78
09:35:24 Waiting for introspection to finish...
09:35:24 Introspection for UUID c1db408c-cc5a-4994-a018-5f00fc809581 finished successfully.
09:35:24 Introspection for UUID 04a4308c-5e43-4221-af70-7c26299e7854 finished successfully.
09:35:24 Introspection for UUID 249b039d-7e67-41ce-a3d7-575fe449f49e finished with error: No disks satisfied root device hints for node 249b039d-7e67-41ce-a3d7-575fe449f49e
09:35:24 Introspection for UUID c3b0a5e3-17ee-4882-b003-f8719fe6dd78 finished successfully.
09:35:24 Setting manageable nodes to available...
09:35:24 Node 04a4308c-5e43-4221-af70-7c26299e7854 has been set to available.
09:35:24 Node c1db408c-cc5a-4994-a018-5f00fc809581 has been set to available.
09:35:24 Node c3b0a5e3-17ee-4882-b003-f8719fe6dd78 has been set to available.
09:35:24 
09:35:24 stderr: Introspection completed with errors:
09:35:24 249b039d-7e67-41ce-a3d7-575fe449f49e: No disks satisfied root device hints for node 249b039d-7e67-41ce-a3d7-575fe449f49e
09:35:24 

Same set of nodes (using same disk hints) passing introspection on another run:
-------------------------------------------------------------------------------

[stack@host15 ~]$ openstack baremetal introspection bulk start
Setting nodes for introspection to manageable...
Starting introspection of node: f75e9dff-56e8-4932-b061-8e96436f7241
Starting introspection of node: db951e4e-15a6-48d4-8cdc-bbb1f6116832
Starting introspection of node: 367130e7-d1e5-4515-b3d2-63fcfa2d3f02
Starting introspection of node: 7c932423-9457-44df-abdd-385d49035157
Waiting for introspection to finish...
Introspection for UUID db951e4e-15a6-48d4-8cdc-bbb1f6116832 finished successfully.
Introspection for UUID 367130e7-d1e5-4515-b3d2-63fcfa2d3f02 finished successfully.
Introspection for UUID f75e9dff-56e8-4932-b061-8e96436f7241 finished successfully.
Introspection for UUID 7c932423-9457-44df-abdd-385d49035157 finished successfully.
Setting manageable nodes to available...
Node f75e9dff-56e8-4932-b061-8e96436f7241 has been set to available.
Node db951e4e-15a6-48d4-8cdc-bbb1f6116832 has been set to available.
Node 367130e7-d1e5-4515-b3d2-63fcfa2d3f02 has been set to available.
Node 7c932423-9457-44df-abdd-385d49035157 has been set to available.
Introspection completed.

================================

There are runs where introspection will pass and deploy fails quickly with some nodes in error:

23:07:35 2016-03-28 18:09:37.819 66160 ERROR heat.engine.resource ResourceInError: Went to status ERROR due to "Message: Build of instance e39d0ca0-c9df-4211-bed9-7332dbe04ed2 aborted: Could not clean up failed build, not rescheduling, Code: 500"

....

23:07:35 2016-03-28 18:54:24.178 66160 ERROR heat.engine.resource RemoteError: Remote error: OperationalError (pymysql.err.OperationalError) (1040, u'Too many connections')

See error and debug outputs attached

Possibly there are different issues here but each run picks up a slightly different failure.

Version-Release number of selected component (if applicable):

[stack@host15 ~]$ rpm -qa | grep openstack
openstack-tripleo-heat-templates-0.8.14-1.el7ost.noarch
openstack-ceilometer-api-5.0.2-2.el7ost.noarch
openstack-heat-api-cfn-5.0.1-4.el7ost.noarch
openstack-swift-object-2.5.0-2.el7ost.noarch
openstack-glance-11.0.1-4.el7ost.noarch
openstack-nova-compute-12.0.2-4.el7ost.noarch
openstack-neutron-ml2-7.0.1-14.el7ost.noarch
openstack-nova-scheduler-12.0.2-4.el7ost.noarch
openstack-nova-api-12.0.2-4.el7ost.noarch
openstack-neutron-common-7.0.1-14.el7ost.noarch
openstack-heat-templates-0-0.1.20151019.el7ost.noarch
openstack-ceilometer-common-5.0.2-2.el7ost.noarch
openstack-ironic-inspector-2.2.5-2.el7ost.noarch
openstack-heat-common-5.0.1-4.el7ost.noarch
openstack-swift-account-2.5.0-2.el7ost.noarch
openstack-tripleo-0.0.7-1.el7ost.noarch
openstack-ceilometer-collector-5.0.2-2.el7ost.noarch
openstack-tripleo-heat-templates-kilo-0.8.14-1.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.5-1.el7ost.noarch
openstack-ceilometer-polling-5.0.2-2.el7ost.noarch
openstack-ironic-conductor-4.2.2-4.el7ost.noarch
openstack-nova-conductor-12.0.2-4.el7ost.noarch
openstack-neutron-openvswitch-7.0.1-14.el7ost.noarch
openstack-heat-api-cloudwatch-5.0.1-4.el7ost.noarch
openstack-selinux-0.6.58-1.el7ost.noarch
openstack-swift-plugin-swift3-1.9-1.el7ost.noarch
openstack-aodh-common-1.1.2-1.el7ost.noarch
openstack-utils-2014.2-1.el7ost.noarch
openstack-neutron-7.0.1-14.el7ost.noarch
openstack-aodh-notifier-1.1.2-1.el7ost.noarch
python-openstackclient-1.7.2-1.el7ost.noarch
openstack-aodh-listener-1.1.2-1.el7ost.noarch
openstack-ceilometer-alarm-5.0.2-2.el7ost.noarch
openstack-keystone-8.0.1-1.el7ost.noarch
openstack-nova-cert-12.0.2-4.el7ost.noarch
openstack-puppet-modules-7.0.17-1.el7ost.noarch
openstack-aodh-evaluator-1.1.2-1.el7ost.noarch
openstack-tripleo-image-elements-0.9.9-1.el7ost.noarch
openstack-swift-2.5.0-2.el7ost.noarch
openstack-ironic-api-4.2.2-4.el7ost.noarch
openstack-heat-engine-5.0.1-4.el7ost.noarch
openstack-swift-container-2.5.0-2.el7ost.noarch
openstack-nova-common-12.0.2-4.el7ost.noarch
openstack-ceilometer-central-5.0.2-2.el7ost.noarch
openstack-tripleo-common-0.3.1-1.el7ost.noarch
openstack-ironic-common-4.2.2-4.el7ost.noarch
openstack-aodh-api-1.1.2-1.el7ost.noarch
openstack-heat-api-5.0.1-4.el7ost.noarch
openstack-swift-proxy-2.5.0-2.el7ost.noarch
openstack-ceilometer-notification-5.0.2-2.el7ost.noarch


How reproducible:

Fails introspection or deploy build on every job in a slightly different place

Steps to Reproduce:
1. Install OSPD 8 on baremetal from latest poodle
2. Run introspection - try both bulk and single-node
3. Run deploy

Actual results:
Either step 2 or 3 fails

Expected results:
Successful deployment

Additional info:

Undercloud box will be made available for debugging

Comment 1 Ronelle Landy 2016-03-29 14:07:31 UTC
Created attachment 1141265 [details]
ironic errors

Comment 2 Ronelle Landy 2016-03-29 14:07:57 UTC
Created attachment 1141266 [details]
Heat log errors

Comment 4 Lucas Alvares Gomes 2016-03-29 15:26:45 UTC
Just an update here.

I got the "openstack baremetal introspection bulk start" finishing successfully after putting the node 249b039d-7e67-41ce-a3d7-575fe449f49e into maintenance mode[0]

The node seems problematic, I can't even access the web console and the attempts to change the power state results in a failure:

2016-03-29 10:46:31.445 78405 ERROR ironic.drivers.modules.ipmitool [-] IPMI Error while attempting "ipmitool -I lanplus -H 10.9.10.130 -L ADMINISTRATOR -U root -R 3 -N 5 -f /tmp/tmp40Vm64 power status"for node 249b039d-7e67-41ce-a3d7-575fe449f49e. Error: Unexpected error while running command.
Command: ipmitool -I lanplus -H 10.9.10.130 -L ADMINISTRATOR -U root -R 3 -N 5 -f /tmp/tmp40Vm64 power status
Exit code: 1
Stdout: u''
Stderr: u'Error: Unable to establish IPMI v2 / RMCP+ session\n'

I've tried to use the IPMI protocol version 1.5 [1] to see if it would fix the problem but not, I got a similar error.

...

So judging on a first glance it seems that it failed because some nodes are flaky :-/

...

BMCs are usually problematic and very fragile, one thing that I noted is that we have the configuration option "sync_power_state_interval" using the default value of 60. That means that Ironic will try to sync the power state of that node every minute, perhaps we could increase that value a little just to make sure Ironic won't be hammering the BMC too much.

[0] ironic node-set-maintenance 249b039d-7e67-41ce-a3d7-575fe449f49e on
[1] ironic node-update 249b039d-7e67-41ce-a3d7-575fe449f49e add driver_info/ipmi_protocol_version="1.5"

Comment 5 Mike Burns 2016-04-07 21:36:02 UTC
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 6 Dmitry Tantsur 2016-04-08 13:45:55 UTC
Hi! Do you still experience this problem? Judging by Lucas' comment it was "fixed" by removing the misbehaving node, right?

Comment 7 Ronelle Landy 2016-04-08 14:30:32 UTC
There were a number of fixes that went into solving the introspection and deploy issues - most notably the "max-connections" fix detailed in https://bugzilla.redhat.com/show_bug.cgi?id=1323728.

After we had this fix in place, reset iDRAC on the misbehaving nodes, reworked the CI code to recognize disk sizes better for ironic hints, we managed to get successful runs.

Comment 8 Dmitry Tantsur 2016-04-18 07:47:50 UTC
Great, so I assume we can close this one, right? Please reopen or open a new one if you experience more problems with introspection/deployment.


Note You need to log in before you can comment on or make changes to this bug.