1391602 – fail: openstack baremetal import --json instackenv.json, Exception registering nodes: No valid host was found. Reason: No conductor service registered which supports driver...

Bug 1391602 - fail: openstack baremetal import --json instackenv.json, Exception registering nodes: No valid host was found. Reason: No conductor service registered which supports driver...

Summary: fail: openstack baremetal import --json instackenv.json, Exception registerin...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-ironic
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	10.0 (Newton)
Assignee:	Lucas Alvares Gomes
QA Contact:	Raviv Bar-Tal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-03 15:42 UTC by Matt Young
Modified:	2016-11-14 15:16 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-11-14 15:16:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Gerrithub.io	300572	None	None	None	2016-11-03 15:44:11 UTC
Github	https://github.com/redhat-openstack ansible-role-tripleo-overcloud-prep-images issues 1	None	None	None	2020-09-25 03:56:10 UTC
Launchpad	1639013	None	None	None	2016-11-03 19:00:08 UTC

Description Matt Young 2016-11-03 15:42:57 UTC

Description of problem:

We are seeing this issue in the RDO CI as well as the internal pipelines we use to inform imports of RDO Newton --> OSP 10.

We are observing that when the jobs call:

"openstack baremetal import --json instackenv.json" we're seeing a failure.

the actual command failing (and the log of that execeution):

- https://thirdparty-logs.rdoproject.org/jenkins-tripleo-quickstart-periodic-newton-delorean-ha_192gb-3/undercloud/home/stack/overcloud-prep-images.sh
- https://thirdparty-logs.rdoproject.org/jenkins-tripleo-quickstart-periodic-newton-delorean-ha_192gb-3/undercloud/home/stack/overcloud_prep_images.log.gz

The ironic log from that job:

- https://thirdparty-logs.rdoproject.org/jenkins-tripleo-quickstart-periodic-newton-delorean-ha_192gb-3/undercloud/var/log/ironic/ironic-conductor.log.gz

the Ironic conf file from that job

- https://thirdparty-logs.rdoproject.org/jenkins-tripleo-quickstart-periodic-newton-delorean-ha_192gb-3/undercloud/etc/ironic/ironic.conf.gz

Version-Release number of selected component (if applicable):

RDO Newton, and shortly OSP 10 builds will exhibit this.  We first noticed it here:

- https://trunk.rdoproject.org/centos7-newton/ef/c1/efc1fbe18783e9d24e46b99edb3282c87eb85244_96121e15 

which RDO promoted via this job:

- https://ci.centos.org/view/rdo/view/promotion-pipeline/job/rdo-delorean-promote-newton/128

---

I have put the following workaround for my jobs in place, and so far results are good.  The workaround just restarts openstack-ironic-conductor just prior to executing the baremetal import command:

https://review.gerrithub.io/#/c/300572/

---

This will however impact QE CI not based on tripleo-quickstart.

---

the ironic logfile also contains the following exception.  It is not clear if this is related to the failure or not, please advise!

- https://thirdparty-logs.rdoproject.org/jenkins-tripleo-quickstart-periodic-newton-delorean-ha_192gb-3/undercloud/var/log/ironic/ironic-conductor.log.gz#_2016-11-02_13_26_25_076

2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters [-] DBAPIError exception wrapped from (pymysql.err.InternalError) (1927, u'Connection was killed') [SQL: u'SELECT 1']
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call last):
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters     context)
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters     cursor.execute(statement, parameters)
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 146, in execute
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters     result = self._query(query)
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 296, in _query
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters     conn.query(q)
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 781, in query
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters     self._affected_rows = self._read_query_result(unbuffered=unbuffered)
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 942, in _read_query_result
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters     result.read()
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1138, in read
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters     first_packet = self.connection._read_packet()
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 906, in _read_packet
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters     packet.check_error()
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 367, in check_error
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters     err.raise_mysql_exception(self._data)
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 120, in raise_mysql_exception
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters     _check_mysql_exception(errinfo)
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 115, in _check_mysql_exception
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters     raise InternalError(errno, errorvalue)
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters InternalError: (1927, u'Connection was killed')
2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters 

---

Comment 1 Matt Young 2016-11-03 15:48:02 UTC

Current details are being tracked here: 

- https://review.rdoproject.org/etherpad/p/rdo-internal-issues #55

Comment 2 Dmitry Tantsur 2016-11-03 15:52:18 UTC

When exactly did you first see this failure?

Comment 3 Matt Young 2016-11-03 16:09:43 UTC

We first saw the error 11/2, using an image from RDO (RDO newton + centos).  Subsequent builds using RDO Newton RPM's (from hash above) atop RHEL 7.2 base also exhibited the issue.

Comment 5 Matt Young 2016-11-03 18:53:56 UTC

workaround for tripleo-quickstart based CI jobs has landed here:

- https://github.com/redhat-openstack/ansible-role-tripleo-overcloud-prep-images/commit/0aa9d3691bab401a60dd10933136d27f18c5f761

Comment 6 Harry Rybacki 2016-11-03 19:00:08 UTC

Adding external tracker - LP#1639013 -- We are seeing this in all pipelines

Comment 7 Dmitry Tantsur 2016-11-04 10:33:01 UTC

What's your mariadb version? the timing and mysql relationship suspiciously aligns with https://bugs.launchpad.net/tripleo/+bug/1638864

Comment 8 Matt Young 2016-11-04 14:12:22 UTC

The run where we initially saw this was using:

mariadb-10.1.18-1.el7

https://thirdparty-logs.rdoproject.org/jenkins-tripleo-quickstart-periodic-newton-delorean-ha_192gb-3/undercloud/var/log/extra/import-delorean-testing.txt.gz

---

In other news, adding a github issue to track removal of the workaround we have in place in CI 

https://github.com/redhat-openstack/ansible-role-tripleo-overcloud-prep-images/issues/1

Comment 9 Dmitry Tantsur 2016-11-04 14:15:57 UTC

According to https://bugs.launchpad.net/tripleo/+bug/1638864, this version has problems. Could you please try mariadb-10.1.18-3? The issue does not look too related, but I'd prefer to rule out this possibility completely.

Comment 10 Matt Young 2016-11-04 14:36:10 UTC

RDO newton just promoted a new hash:

- https://trunk.rdoproject.org/centos7-newton/8e/bb/8ebb715a52afef8c5eea6fa343a915d97910907c_13bba89f

with:

- https://ci.centos.org/view/rdo/view/promotion-pipeline/job/rdo-delorean-promote-newton/138/
- https://ci.centos.org/job/tripleo-quickstart-promote-newton-delorean-minimal/95/

Containing:

- mariadb-10.1.18-3.el7

- https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-newton-delorean-minimal-95/undercloud/var/log/extra/import-delorean-testing.txt.gz

The internal pipelines are running now on this hash.  Both internal and ci.centos pipelines have the workaround in place.  

We can run a separate run without the workaround to confirm, but not until a little later today when current jobs are complete.  I'll circle back with details.

Comment 12 Matt Young 2016-11-08 06:00:47 UTC

I've just kicked the run mentioned in comment #10

Comment 14 Dmitry Tantsur 2016-11-11 12:23:38 UTC

Hi! Any updates here? Did the MariaDB update fix the jobs?

Comment 15 Matt Young 2016-11-11 16:52:01 UTC

Yup.  I did run the pipeline once with the workaround disabled and it appears to have worked and not reproduced.  Before I could get a second pass, we have run into a series of other issues (some infra, some not).  

We've been chasing these and should ideally be back to green status today, whereupon I can layer in removal of the workaround.  

In a perfect world I could do this in parallel but it's not possible with current HW resources.  I'll circle back tonight, or over the next few days to confirm, but before removing the workaround that is currently in place (and affecting all tripleo-quickstart jobs (ci.centos and internal)) I would like a second data point.

Comment 16 Dmitry Tantsur 2016-11-14 15:16:14 UTC

Thanks! I'll close it for now, but please feel free to reopen.

Note You need to log in before you can comment on or make changes to this bug.