Created attachment 1426083 [details]
output of openstack software deployment show for the failed ID
Description of problem:
overcloud deployment fails at step 3 when it tries to do cinder-manage db sync. checking the failure logs via `openstack software deployment show <failed-id> --long` shows cinder db sync timeout error.
further checking /var/log/cinder/cinder-manage.log, shows mysql connection error - cinder-manage is unable to reach glance-api.
it tries to connect to an IP which is not available anywhere in the overcloud/undercloud. not sure why cinder.conf has glance-api server's IP set to something other than the host's IP in the internal net cidr, since there is only one controller.
this failure occurred after updating the containers to the latest tag available in pike `12.0-20180405.1`.
it used to work with tag `12.0-20180124.1`
Version-Release number of selected component (if applicable):
glance-api tag 12.0-20180405.1
100% reproducible when deploying overcloud with Big Switch Networks plugin for neutron ml2. however the problem doesn't seem to root from that.
Steps to Reproduce:
1. fetch the latest container images for overcloud
2. start overcloud deployment with Big Switch Networks neutron ml2 plugin enabled
3. deployment fails at step 3
deployment fails at step 3 due to cinder-manage db sync timeout
attached the output of software deployment show command. also running sosreport shortly after creating this BZ ticket.
found a similar BZ bug #1539682, checked the related bug #1539192 and its dupe about ceph deployment error. however it does not fit this deployment case, since we do not have ceph storage.
let me know if you'd need the env files passed during deploy overcloud command and the command itself. not sure if that's included in sosreport.
also, i set the component to openstack-cinder, but feel free to change it based on your analysis.
hi, the sosreport is larger than 20MB. is there an alternative way to upload/share it?
Does this need extra info from our end to debug this further?
Also, can you point to some doc that describes how to share attachments that are larger than 20MB?
This seems more of a neutron issue if cinder is failing due to a network connectivity problem.
Another thing to note is that in OSP-12, cinder runs on the baremetal host.
> Another thing to note is that in OSP-12, cinder runs on the baremetal host.
Yep - cinder is on baremetal, but glance-api is containerized.
And the error started happening after updating the container images to latest available tag. So I thought it would be somehow related :)
It could be a neutron issue. However, the IP configured for glance-api in cinder.conf is not observed in the setup (when checking by running `ip a` on controller and compute nodes). Not sure if its a VIP (virtual IP). Which is why I thought it might be a configuration issue for cinder.
Just my 2 cents.
Please attach an sosreport.
Can you let me know how to attach or share sosreport which is greater than 20MB?
ah, nvm. found it in the KB: https://access.redhat.com/solutions/2112
Created attachment 1427369 [details]
sosreport part 1
Created attachment 1427375 [details]
sosreport part 2
Created attachment 1427376 [details]
sosreport part 3
Created attachment 1427377 [details]
sosreport part 4
I've attached the sosreport by splitting it into 4 parts, since the file was larger than 20MB.
Please let me know if I can provide any other information to help debug it.
We found a similar issue in this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1452082 - which might be the same root cause. We have a single overcloud controller as well. Does that require a change to the deploy command?
Our current deployment command looks like this:
openstack overcloud deploy --templates -r /home/stack/templates/roles_data.yaml -e /home/stack/templates/node-info.yaml -e /home/stack/templates/overcloud_images.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /home/stack/templates/network-environment.yaml -e /home/stack/templates/bigswitch-p.yaml -e /home/stack/templates/bigswitch_images.yaml --ntp-server 10.8.29.9 --timeout 150
Overriding using roles_data.yaml is only to enable extra service on compute for BSN. everything else is default.
Please let us know if this helps and if we can provide more info.
moving needinfo to target Assaf directly
We tried HA setup as well . stil same problem where deployment failed step 3
(undercloud) [stack@rhosp12-director ~]$ openstack stack failures list --long overcloud | grep Error
Error: resources: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
"Error: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: Failed to call refresh: Command exceeded timeout",
"Error: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: Command exceeded timeout",
(undercloud) [stack@rhosp12-director ~]$
"Debug: Finishing transaction 55716760",
"Debug: Storing state",
"Debug: Stored state in 0.09 seconds",
"Notice: Applied catalog in 330.24 seconds",
"Debug: Applying settings catalog for sections reporting, metrics",
"Debug: Finishing transaction 99257380",
"Debug: Received report to process from overcloud-controller-0.bigswitch.com",
"Debug: Processing report from overcloud-controller-0.bigswitch.com with processor Puppet::Reports::Store"
to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/69104930-415d-4e85-9e18-2210bebe06a7_playbook.retry
PLAY RECAP *********************************************************************
localhost : ok=4 changed=1 unreachable=0 failed=1
issue seen after updating overcloud images with overcloud-full-latest-12.0.tar -> /usr/share/rhosp-director-images/overcloud-full-12.0-20180404.1.el7ost.tar
Previous overcloud image dated "overcloud-full-12.0-20180126.1.el7ost.tar" worked fine.
We aren't able to reproduce, I would advise opening a support ticket and working with GSS to figure this out.