Bug 1434279

Summary:	Nova database sync times out during deployment
Product:	Red Hat OpenStack	Reporter:	Dan Macpherson <dmacpher>
Component:	puppet-nova	Assignee:	OSP DFG:Compute <osp-dfg-compute>
Status:	CLOSED NEXTRELEASE	QA Contact:	Joe H. Rahme <jhakimra>
Severity:	medium	Docs Contact:
Priority:	low
Version:	11.0 (Ocata)	CC:	akaris, aschultz, dmacpher, ipilcher, jjoyce, jschluet, mbooth, mschuppe, owalsh, slinaber, tvignaud
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1434520 (view as bug list)		Environment:
Last Closed:	2018-06-19 13:38:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1434520

Description Dan Macpherson 2017-03-21 07:33:40 UTC

During a deployment on lower spec'd systems, the "nova-manage db sync" can take longer than five minutes. However, when deploying via the director, the Nova Puppet module has a db_sync_timeout of 300 seconds. This can cause director-based deployments failures. For example, here's the Puppet log during the nova-manage db sync of my test:

Error: /Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]: Failed to call refresh: Command exceeded timeout
Error: /Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]: Command exceeded timeout

And the nova schema changes in the future, it might be a good idea to bump the timeout to something higher.

As a workaround, you can set the timeout to something larger via an environment file using the nova::db::sync::db_sync_timeout hieradata. For example:

parameter_defaults:
  ExtraConfig:
    nova::db::sync::db_sync_timeout: 600

Comment 1 Dan Macpherson 2017-03-21 07:35:56 UTC

"And the nova schema changes in the future"

Meant to say:

"As the nova schema changes in the future"

Comment 2 Dan Macpherson 2017-03-21 08:27:57 UTC

Just to note I'm testing this on a set of 3 VMs for Controller nodes, each with 2 vCPUs and 10Gb of memory. 

Here's a head and tail of nova-manage.log:

[root@overcloud-controller-0 nova]# head -n4 nova-manage.log 
2017-03-21 08:11:01.472 45523 INFO migrate.versioning.api [-] 0 -> 1... 
2017-03-21 08:11:02.346 45523 INFO migrate.versioning.api [-] done
2017-03-21 08:11:02.346 45523 INFO migrate.versioning.api [-] 1 -> 2... 
2017-03-21 08:11:03.501 45523 INFO migrate.versioning.api [-] done
[root@overcloud-controller-0 nova]# tail -n4 nova-manage.log 
2017-03-21 08:19:48.633 49098 INFO migrate.versioning.api [req-9f48372f-ab93-4286-9f21-7dd10662282c - - - - -] 345 -> 346... 
2017-03-21 08:19:51.867 49098 INFO migrate.versioning.api [req-9f48372f-ab93-4286-9f21-7dd10662282c - - - - -] done
2017-03-21 08:19:51.868 49098 INFO migrate.versioning.api [req-9f48372f-ab93-4286-9f21-7dd10662282c - - - - -] 346 -> 347... 
2017-03-21 08:19:52.477 49098 INFO migrate.versioning.api [req-9f48372f-ab93-4286-9f21-7dd10662282c - - - - -] done

Total time for db sync is 8 minutes and 51 seconds.

Granted, enterprise environments will have higher specs and mean faster db sync, but I can see a lot of people testing out with lower spec PoCs that will encounter this issue.

Comment 3 Ollie Walsh 2017-03-23 18:58:15 UTC

Alex - IIRC the timeout was increased to 900 or 600s, has this been reverted?

Comment 4 Alex Schultz 2017-03-23 19:04:55 UTC

No this was only changed for the undercloud not the overcloud. 

https://github.com/openstack/instack-undercloud/blob/master/elements/puppet-stack-config/puppet-stack-config.yaml.template#L447

The overcloud should never take that long and the environment this is being deployed to is not sufficient.

Comment 5 Ollie Walsh 2017-03-23 20:09:38 UTC

Dan, could you provide more details on the system spec? Storage spec especially.

Comment 6 Ian Pilcher 2017-06-01 23:38:29 UTC

FYI, I just hit this with a customer while deploying OSP 10 on HPE 460c blades with 512 GB or memory and 28 cores.  I'm pretty sure that this is "sufficient."  :-)

I suspect that we hit this because we are using FCoE storage on a NetApp 8040.  The storage is plenty fast, but it's latency is a bit higher than local storage, which hits hard during database creation.

Comment 7 Ollie Walsh 2017-06-02 01:04:27 UTC

Any info on the higher latency? Are we talking orders of magnitude or a few ms?

It depends a lot on what's behind the NetApp 8040 too, and how many client, and what those clients are doing etc... etc... but if it's not ridiculously over-subscribed and the latency isn't insane as a result then we probably should bump up the timeouts a little for non-local storage at least.

Comment 8 Ian Pilcher 2017-06-02 14:10:44 UTC

(In reply to Ollie Walsh from comment #7)
> Any info on the higher latency? Are we talking orders of magnitude or a few
> ms?

The customer was showing me the pretty graphs from the NetApp yesterday.  IIRC, the highest we saw was around 10 ms.  I'm not a storage/performance expert, so I don't really know if that's decent/terrible/pathological.

Any objection to my adding the customer directly to this BZ?

Comment 9 Ollie Walsh 2017-06-02 14:26:19 UTC

*** Bug 1456608 has been marked as a duplicate of this bug. ***

Comment 10 Ollie Walsh 2017-06-02 14:39:02 UTC

No objection but would be good to raise another BZ with logs for this specific config.

Comment 14 Martin Schuppert 2018-06-19 13:38:01 UTC

Starting with OSP12 via BZ1434520 the DatabaseSyncTimeout is set to 900 via environments/low-memory-usage.yaml which can be used for systems with low resources. 

Closing this BZ as OSP11 is EOL. Please feel free to reopen it in case the above mentioned solution is not enough.