Bug 1427569

Summary:	OSP10 -> OSP11 upgrade fails when Nova services are running on a standalone node
Product:	Red Hat OpenStack	Reporter:	Marius Cornea <mcornea>
Component:	openstack-tripleo-heat-templates	Assignee:	Sofer Athlan-Guyot <sathlang>
Status:	CLOSED ERRATA	QA Contact:	Marius Cornea <mcornea>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	11.0 (Ocata)	CC:	aschultz, dbecker, jcoufal, jschluet, mburns, morazi, panbalag, rhel-osp-director-maint, sathlang, slinaber
Target Milestone:	rc	Keywords:	Triaged
Target Release:	11.0 (Ocata)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-6.0.0-4.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-05-17 20:02:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marius Cornea 2017-02-28 16:00:27 UTC

Description of problem:
OSP10 -> OSP11 upgrade fails when Nova services are running on a standalone role. 

roles_data file:
http://paste.openstack.org/show/600798/

Upgrade fails during major-upgrade-composable-steps.yaml with the following error:

stdout: overcloud.AllNodesDeploySteps.ControllerUpgrade_Step2:
  resource_type: OS::Heat::SoftwareDeploymentGroup
  physical_resource_id: 170d8e1d-58e0-4720-8149-a9fd4f2b9e1d
  status: CREATE_FAILED
  status_reason: |
    CREATE aborted
overcloud.AllNodesDeploySteps.NovacontrolUpgrade_Step5.0:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 5cf72dbf-9b22-4f63-8b98-25061864df35
  status: CREATE_FAILED
  status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
    TASK [Run puppet apply to set tranport_url in nova.conf] ***********************
    changed: [localhost]
    
    TASK [Setup cell_v2 (map cell0)] ***********************************************
    fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["nova-manage", "cell_v2", "map_cell0"], "delta": "0:02:12.569490", "end": "2017-02-28 15:41:23.802908", "failed": true, "rc": 1, "start": "2017-02-28 15:39:11.233418", "stderr": "", "stdout": "An error has occurred:
Traceback (most recent call last):
  File \"/usr/lib/python2.7/site-packages/nova/cmd/manage.py\", line 1594, in main
    ret = fn(*fn_args, **fn_kwargs)
  File \"/usr/lib/python2.7/site-packages/nova/cmd/manage.py\", line 1140, in map_cell0
    self._map_cell0(database_connection=database_connection)
  File \"/usr/lib/python2.7/site-packages/nova/cmd/manage.py\", line 1170, in _map_cell0
    cell_mapping.create()
  File \"/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py\", line 226, in wrapper
    return fn(self, *args, **kwargs)
  File \"/usr/lib/python2.7/site-packages/nova/objects/cell_mapping.py\", line 71, in create
    db_mapping = self._create_in_db(self._context, self.obj_get_changes())
  File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py\", line 893, in wrapper
    with self._transaction_scope(context):
  File \"/usr/lib64/python2.7/contextlib.py\", line 17, in __enter__
    return self.gen.next()
  File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py\", line 944, in _transaction_scope
    allow_async=self._allow_async) as resource:
  File \"/usr/lib64/python2.7/contextlib.py\", line 17, in __enter__
    return self.gen.next()
  File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py\", line 558, in _session
    bind=self.connection, mode=self.mode)
  File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py\", line 317, in _create_session
    self._start()
  File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py\", line 403, in _start
    engine_args, maker_args)
  File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py\", line 427, in _setup_for_connection
    sql_connection=sql_connection, **engine_kwargs)
  File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py\", line 155, in create_engine
    test_conn = _test_connection(engine, max_retries, retry_interval)
  File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py\", line 339, in _test_connection
    six.reraise(type(de_ref), de_ref)
  File \"<string>\", line 2, in reraise
DBConnectionError: (pymysql.err.OperationalError) (2003, \"Can't connect to MySQL server on '172.17.1.13' ([Errno 113] EHOSTUNREACH)\")", "stdout_lines": ["An error has occurred:", "Traceback (most recent call last):", "  File \"/usr/lib/python2.7/site-packages/nova/cmd/manage.py\", line 1594, in main", "    ret = fn(*fn_args, **fn_kwargs)", "  File \"/usr/lib/python2.7/site-packages/nova/cmd/manage.py\", line 1140, in map_cell0", "    self._map_cell0(database_connection=database_connection)", "  File \"/usr/lib/python2.7/site-packages/nova/cmd/manage.py\", line 1170, in _map_cell0", "    cell_mapping.create()", "  File \"/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py\", line 226, in wrapper", "    return fn(self, *args, **kwargs)", "  File \"/usr/lib/python2.7/site-packages/nova/objects/cell_mapping.py\", line 71, in create", "    db_mapping = self._create_in_db(self._context, self.obj_get_changes())", "  File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py\", line 893, in wrapper", "    with self._transaction_scope(context):", "  File \"/usr/lib64/python2.7/contextlib.py\", line 17, in __enter__", "    return self.gen.next()", "  File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py\", line 944, in _transaction_scope", "    allow_async=self._allow_async) as resource:", "  File \"/usr/lib64/python2.7/contextlib.py\", line 17, in __enter__", "    return self.gen.next()", "  File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py\", line 558, in _session", "    bind=self.connection, mode=self.mode)", "  File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py\", line 317, in _create_session", "    self._start()", "  File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py\", line 403, in _start", "    engine_args, maker_args)", "  File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py\", line 427, in _setup_for_connection", "    sql_connection=sql_connection, **engine_kwargs)", "  File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py\", line 155, in create_engine", "    test_conn = _test_connection(engine, max_retries, retry_interval)", "  File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py\", line 339, in _test_connection", "    six.reraise(type(de_ref), de_ref)", "  File \"<string>\", line 2, in reraise", "DBConnectionError: (pymysql.err.OperationalError) (2003, \"Can't connect to MySQL server on '172.17.1.13' ([Errno 113] EHOSTUNREACH)\")"], "warnings": []}
    	to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/b106b80f-8c24-4896-98d3-06ddf74f7508_playbook.retry



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Deploy OSP10 overcloud with a standalone running nova control plane services
2. Upgrade OSP10 to OSP11

Actual results:
Upgrade fails while running Setup cell_v2 (map cell0) step.

Expected results:
Upgrade succeeds.

Additional info:
172.17.1.13 is the internal API VIP but it cannot be reached because the cluster is not running when this step is run.

Comment 2 Sofer Athlan-Guyot 2017-03-24 12:20:57 UTC

Got a successful run in CI, so moving this one to POST.  Checking if it's still working with the latest puddle.

Comment 3 Marius Cornea 2017-03-24 15:26:09 UTC

(In reply to Sofer Athlan-Guyot from comment #2)
> Got a successful run in CI, so moving this one to POST.  Checking if it's
> still working with the latest puddle.

I wasn't able to reproduce this issue with latest puddle. I think we're good on this one.

Comment 5 Sofer Athlan-Guyot 2017-04-03 11:01:40 UTC

Adding compute for visibility.

Comment 6 Sofer Athlan-Guyot 2017-04-03 11:33:11 UTC

Removing compute, as it's unrelated.  The pcs cluster is not started making the database migration failed as the vip configured in nova::cell0_database_connection isn't reachable.  But this is happening at step5 while all the database should be back in step4.

Comment 7 Sofer Athlan-Guyot 2017-04-03 15:59:02 UTC

Hi,

so the upgrade of the custom role novacontrol is happening at the same
time than the upgrade of the controller node:

I prefix Novacontrol with N and controller logs with C:

 - C: step0: Apr 03 09:08:55

 - N: step0: Apr 03 09:07:36
 - N: step1: Apr 03 09:08:27
 - N: step2: Apr 03 09:08:52

 - C: step1: Apr 03 09:13:56

 - N: step3: Apr 03 09:14:24
 - N: step4: Apr 03 09:14:40

 - C: step2: Apr 03 09:15:01

 - N: step5: Apr 03 09:17:28

 - C: step3: Apr 03 09:20:48
 - C: step4: never happened
 - C: step5: never happened
 

So the Novacontrol role got the time to reach the step5 while the
controller was still at step3.

We shouldn't have this kind of intermixed upgrade happening.  Will
check further on why this happen.

Comment 8 Sofer Athlan-Guyot 2017-04-06 09:57:55 UTC

In stable/ocata.

Comment 11 errata-xmlrpc 2017-05-17 20:02:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245