During a deployment on lower spec'd systems, the "nova-manage db sync" can take longer than five minutes. However, when deploying via the director, the Nova Puppet module has a db_sync_timeout of 300 seconds. This can cause director-based deployments failures. For example, here's the Puppet log during the nova-manage db sync of my test: Error: /Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]: Failed to call refresh: Command exceeded timeout Error: /Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]: Command exceeded timeout And the nova schema changes in the future, it might be a good idea to bump the timeout to something higher. As a workaround, you can set the timeout to something larger via an environment file using the nova::db::sync::db_sync_timeout hieradata. For example: parameter_defaults: ExtraConfig: nova::db::sync::db_sync_timeout: 600
"And the nova schema changes in the future" Meant to say: "As the nova schema changes in the future"
Just to note I'm testing this on a set of 3 VMs for Controller nodes, each with 2 vCPUs and 10Gb of memory. Here's a head and tail of nova-manage.log: [root@overcloud-controller-0 nova]# head -n4 nova-manage.log 2017-03-21 08:11:01.472 45523 INFO migrate.versioning.api [-] 0 -> 1... 2017-03-21 08:11:02.346 45523 INFO migrate.versioning.api [-] done 2017-03-21 08:11:02.346 45523 INFO migrate.versioning.api [-] 1 -> 2... 2017-03-21 08:11:03.501 45523 INFO migrate.versioning.api [-] done [root@overcloud-controller-0 nova]# tail -n4 nova-manage.log 2017-03-21 08:19:48.633 49098 INFO migrate.versioning.api [req-9f48372f-ab93-4286-9f21-7dd10662282c - - - - -] 345 -> 346... 2017-03-21 08:19:51.867 49098 INFO migrate.versioning.api [req-9f48372f-ab93-4286-9f21-7dd10662282c - - - - -] done 2017-03-21 08:19:51.868 49098 INFO migrate.versioning.api [req-9f48372f-ab93-4286-9f21-7dd10662282c - - - - -] 346 -> 347... 2017-03-21 08:19:52.477 49098 INFO migrate.versioning.api [req-9f48372f-ab93-4286-9f21-7dd10662282c - - - - -] done Total time for db sync is 8 minutes and 51 seconds. Granted, enterprise environments will have higher specs and mean faster db sync, but I can see a lot of people testing out with lower spec PoCs that will encounter this issue.
Alex - IIRC the timeout was increased to 900 or 600s, has this been reverted?
No this was only changed for the undercloud not the overcloud. https://github.com/openstack/instack-undercloud/blob/master/elements/puppet-stack-config/puppet-stack-config.yaml.template#L447 The overcloud should never take that long and the environment this is being deployed to is not sufficient.
Dan, could you provide more details on the system spec? Storage spec especially.
FYI, I just hit this with a customer while deploying OSP 10 on HPE 460c blades with 512 GB or memory and 28 cores. I'm pretty sure that this is "sufficient." :-) I suspect that we hit this because we are using FCoE storage on a NetApp 8040. The storage is plenty fast, but it's latency is a bit higher than local storage, which hits hard during database creation.
Any info on the higher latency? Are we talking orders of magnitude or a few ms? It depends a lot on what's behind the NetApp 8040 too, and how many client, and what those clients are doing etc... etc... but if it's not ridiculously over-subscribed and the latency isn't insane as a result then we probably should bump up the timeouts a little for non-local storage at least.
(In reply to Ollie Walsh from comment #7) > Any info on the higher latency? Are we talking orders of magnitude or a few > ms? The customer was showing me the pretty graphs from the NetApp yesterday. IIRC, the highest we saw was around 10 ms. I'm not a storage/performance expert, so I don't really know if that's decent/terrible/pathological. Any objection to my adding the customer directly to this BZ?
*** Bug 1456608 has been marked as a duplicate of this bug. ***
No objection but would be good to raise another BZ with logs for this specific config.
Starting with OSP12 via BZ1434520 the DatabaseSyncTimeout is set to 900 via environments/low-memory-usage.yaml which can be used for systems with low resources. Closing this BZ as OSP11 is EOL. Please feel free to reopen it in case the above mentioned solution is not enough.