Bug 1434279 - Nova database sync times out during deployment
Summary: Nova database sync times out during deployment
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-nova
Version: 11.0 (Ocata)
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: ---
Assignee: OSP DFG:Compute
QA Contact: Joe H. Rahme
URL:
Whiteboard:
: 1456608 (view as bug list)
Depends On:
Blocks: 1434520
TreeView+ depends on / blocked
 
Reported: 2017-03-21 07:33 UTC by Dan Macpherson
Modified: 2023-03-21 17:54 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1434520 (view as bug list)
Environment:
Last Closed: 2018-06-19 13:38:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Dan Macpherson 2017-03-21 07:33:40 UTC
During a deployment on lower spec'd systems, the "nova-manage db sync" can take longer than five minutes. However, when deploying via the director, the Nova Puppet module has a db_sync_timeout of 300 seconds. This can cause director-based deployments failures. For example, here's the Puppet log during the nova-manage db sync of my test:

Error: /Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]: Failed to call refresh: Command exceeded timeout
Error: /Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]: Command exceeded timeout

And the nova schema changes in the future, it might be a good idea to bump the timeout to something higher.

As a workaround, you can set the timeout to something larger via an environment file using the nova::db::sync::db_sync_timeout hieradata. For example:

parameter_defaults:
  ExtraConfig:
    nova::db::sync::db_sync_timeout: 600

Comment 1 Dan Macpherson 2017-03-21 07:35:56 UTC
"And the nova schema changes in the future"

Meant to say:

"As the nova schema changes in the future"

Comment 2 Dan Macpherson 2017-03-21 08:27:57 UTC
Just to note I'm testing this on a set of 3 VMs for Controller nodes, each with 2 vCPUs and 10Gb of memory. 

Here's a head and tail of nova-manage.log:

[root@overcloud-controller-0 nova]# head -n4 nova-manage.log 
2017-03-21 08:11:01.472 45523 INFO migrate.versioning.api [-] 0 -> 1... 
2017-03-21 08:11:02.346 45523 INFO migrate.versioning.api [-] done
2017-03-21 08:11:02.346 45523 INFO migrate.versioning.api [-] 1 -> 2... 
2017-03-21 08:11:03.501 45523 INFO migrate.versioning.api [-] done
[root@overcloud-controller-0 nova]# tail -n4 nova-manage.log 
2017-03-21 08:19:48.633 49098 INFO migrate.versioning.api [req-9f48372f-ab93-4286-9f21-7dd10662282c - - - - -] 345 -> 346... 
2017-03-21 08:19:51.867 49098 INFO migrate.versioning.api [req-9f48372f-ab93-4286-9f21-7dd10662282c - - - - -] done
2017-03-21 08:19:51.868 49098 INFO migrate.versioning.api [req-9f48372f-ab93-4286-9f21-7dd10662282c - - - - -] 346 -> 347... 
2017-03-21 08:19:52.477 49098 INFO migrate.versioning.api [req-9f48372f-ab93-4286-9f21-7dd10662282c - - - - -] done

Total time for db sync is 8 minutes and 51 seconds.

Granted, enterprise environments will have higher specs and mean faster db sync, but I can see a lot of people testing out with lower spec PoCs that will encounter this issue.

Comment 3 Ollie Walsh 2017-03-23 18:58:15 UTC
Alex - IIRC the timeout was increased to 900 or 600s, has this been reverted?

Comment 4 Alex Schultz 2017-03-23 19:04:55 UTC
No this was only changed for the undercloud not the overcloud. 

https://github.com/openstack/instack-undercloud/blob/master/elements/puppet-stack-config/puppet-stack-config.yaml.template#L447

The overcloud should never take that long and the environment this is being deployed to is not sufficient.

Comment 5 Ollie Walsh 2017-03-23 20:09:38 UTC
Dan, could you provide more details on the system spec? Storage spec especially.

Comment 6 Ian Pilcher 2017-06-01 23:38:29 UTC
FYI, I just hit this with a customer while deploying OSP 10 on HPE 460c blades with 512 GB or memory and 28 cores.  I'm pretty sure that this is "sufficient."  :-)

I suspect that we hit this because we are using FCoE storage on a NetApp 8040.  The storage is plenty fast, but it's latency is a bit higher than local storage, which hits hard during database creation.

Comment 7 Ollie Walsh 2017-06-02 01:04:27 UTC
Any info on the higher latency? Are we talking orders of magnitude or a few ms?

It depends a lot on what's behind the NetApp 8040 too, and how many client, and what those clients are doing etc... etc... but if it's not ridiculously over-subscribed and the latency isn't insane as a result then we probably should bump up the timeouts a little for non-local storage at least.

Comment 8 Ian Pilcher 2017-06-02 14:10:44 UTC
(In reply to Ollie Walsh from comment #7)
> Any info on the higher latency? Are we talking orders of magnitude or a few
> ms?

The customer was showing me the pretty graphs from the NetApp yesterday.  IIRC, the highest we saw was around 10 ms.  I'm not a storage/performance expert, so I don't really know if that's decent/terrible/pathological.

Any objection to my adding the customer directly to this BZ?

Comment 9 Ollie Walsh 2017-06-02 14:26:19 UTC
*** Bug 1456608 has been marked as a duplicate of this bug. ***

Comment 10 Ollie Walsh 2017-06-02 14:39:02 UTC
No objection but would be good to raise another BZ with logs for this specific config.

Comment 14 Martin Schuppert 2018-06-19 13:38:01 UTC
Starting with OSP12 via BZ1434520 the DatabaseSyncTimeout is set to 900 via environments/low-memory-usage.yaml which can be used for systems with low resources. 

Closing this BZ as OSP11 is EOL. Please feel free to reopen it in case the above mentioned solution is not enough.


Note You need to log in before you can comment on or make changes to this bug.