Description of problem: We are seeing this issue in the RDO CI as well as the internal pipelines we use to inform imports of RDO Newton --> OSP 10. We are observing that when the jobs call: "openstack baremetal import --json instackenv.json" we're seeing a failure. the actual command failing (and the log of that execeution): - https://thirdparty-logs.rdoproject.org/jenkins-tripleo-quickstart-periodic-newton-delorean-ha_192gb-3/undercloud/home/stack/overcloud-prep-images.sh - https://thirdparty-logs.rdoproject.org/jenkins-tripleo-quickstart-periodic-newton-delorean-ha_192gb-3/undercloud/home/stack/overcloud_prep_images.log.gz The ironic log from that job: - https://thirdparty-logs.rdoproject.org/jenkins-tripleo-quickstart-periodic-newton-delorean-ha_192gb-3/undercloud/var/log/ironic/ironic-conductor.log.gz the Ironic conf file from that job - https://thirdparty-logs.rdoproject.org/jenkins-tripleo-quickstart-periodic-newton-delorean-ha_192gb-3/undercloud/etc/ironic/ironic.conf.gz Version-Release number of selected component (if applicable): RDO Newton, and shortly OSP 10 builds will exhibit this. We first noticed it here: - https://trunk.rdoproject.org/centos7-newton/ef/c1/efc1fbe18783e9d24e46b99edb3282c87eb85244_96121e15 which RDO promoted via this job: - https://ci.centos.org/view/rdo/view/promotion-pipeline/job/rdo-delorean-promote-newton/128 --- I have put the following workaround for my jobs in place, and so far results are good. The workaround just restarts openstack-ironic-conductor just prior to executing the baremetal import command: https://review.gerrithub.io/#/c/300572/ --- This will however impact QE CI not based on tripleo-quickstart. --- the ironic logfile also contains the following exception. It is not clear if this is related to the failure or not, please advise! - https://thirdparty-logs.rdoproject.org/jenkins-tripleo-quickstart-periodic-newton-delorean-ha_192gb-3/undercloud/var/log/ironic/ironic-conductor.log.gz#_2016-11-02_13_26_25_076 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters [-] DBAPIError exception wrapped from (pymysql.err.InternalError) (1927, u'Connection was killed') [SQL: u'SELECT 1'] 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call last): 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters context) 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters cursor.execute(statement, parameters) 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 146, in execute 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters result = self._query(query) 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 296, in _query 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters conn.query(q) 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 781, in query 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters self._affected_rows = self._read_query_result(unbuffered=unbuffered) 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 942, in _read_query_result 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters result.read() 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1138, in read 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters first_packet = self.connection._read_packet() 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 906, in _read_packet 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters packet.check_error() 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 367, in check_error 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters err.raise_mysql_exception(self._data) 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 120, in raise_mysql_exception 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters _check_mysql_exception(errinfo) 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 115, in _check_mysql_exception 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters raise InternalError(errno, errorvalue) 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters InternalError: (1927, u'Connection was killed') 2016-11-02 13:26:25.076 14628 ERROR oslo_db.sqlalchemy.exc_filters ---
Current details are being tracked here: - https://review.rdoproject.org/etherpad/p/rdo-internal-issues #55
When exactly did you first see this failure?
We first saw the error 11/2, using an image from RDO (RDO newton + centos). Subsequent builds using RDO Newton RPM's (from hash above) atop RHEL 7.2 base also exhibited the issue.
workaround for tripleo-quickstart based CI jobs has landed here: - https://github.com/redhat-openstack/ansible-role-tripleo-overcloud-prep-images/commit/0aa9d3691bab401a60dd10933136d27f18c5f761
Adding external tracker - LP#1639013 -- We are seeing this in all pipelines
What's your mariadb version? the timing and mysql relationship suspiciously aligns with https://bugs.launchpad.net/tripleo/+bug/1638864
The run where we initially saw this was using: mariadb-10.1.18-1.el7 https://thirdparty-logs.rdoproject.org/jenkins-tripleo-quickstart-periodic-newton-delorean-ha_192gb-3/undercloud/var/log/extra/import-delorean-testing.txt.gz --- In other news, adding a github issue to track removal of the workaround we have in place in CI https://github.com/redhat-openstack/ansible-role-tripleo-overcloud-prep-images/issues/1
According to https://bugs.launchpad.net/tripleo/+bug/1638864, this version has problems. Could you please try mariadb-10.1.18-3? The issue does not look too related, but I'd prefer to rule out this possibility completely.
RDO newton just promoted a new hash: - https://trunk.rdoproject.org/centos7-newton/8e/bb/8ebb715a52afef8c5eea6fa343a915d97910907c_13bba89f with: - https://ci.centos.org/view/rdo/view/promotion-pipeline/job/rdo-delorean-promote-newton/138/ - https://ci.centos.org/job/tripleo-quickstart-promote-newton-delorean-minimal/95/ Containing: - mariadb-10.1.18-3.el7 - https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-newton-delorean-minimal-95/undercloud/var/log/extra/import-delorean-testing.txt.gz The internal pipelines are running now on this hash. Both internal and ci.centos pipelines have the workaround in place. We can run a separate run without the workaround to confirm, but not until a little later today when current jobs are complete. I'll circle back with details.
I've just kicked the run mentioned in comment #10
Hi! Any updates here? Did the MariaDB update fix the jobs?
Yup. I did run the pipeline once with the workaround disabled and it appears to have worked and not reproduced. Before I could get a second pass, we have run into a series of other issues (some infra, some not). We've been chasing these and should ideally be back to green status today, whereupon I can layer in removal of the workaround. In a perfect world I could do this in parallel but it's not possible with current HW resources. I'll circle back tonight, or over the next few days to confirm, but before removing the workaround that is currently in place (and affecting all tripleo-quickstart jobs (ci.centos and internal)) I would like a second data point.
Thanks! I'll close it for now, but please feel free to reopen.