Bug 1571309

Summary: overcloud deployment fails due to cinder-manage db sync timeout exceeded
Product: Red Hat OpenStack Reporter: bigswitch <rhosp-bugs-internal>
Component: openstack-neutronAssignee: Assaf Muller <amuller>
Status: CLOSED WORKSFORME QA Contact: Toni Freger <tfreger>
Severity: high Docs Contact:
Priority: unspecified    
Version: 12.0 (Pike)CC: abishop, amuller, chrisw, mburns, nyechiel, srevivo
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-02 14:01:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
output of openstack software deployment show for the failed ID
none
sosreport part 1
none
sosreport part 2
none
sosreport part 3
none
sosreport part 4 none

Description bigswitch 2018-04-24 13:40:00 UTC
Created attachment 1426083 [details]
output of openstack software deployment show for the failed ID

Description of problem:
overcloud deployment fails at step 3 when it tries to do cinder-manage db sync. checking the failure logs via `openstack software deployment show <failed-id> --long` shows cinder db sync timeout error.

further checking /var/log/cinder/cinder-manage.log, shows mysql connection error - cinder-manage is unable to reach glance-api.
it tries to connect to an IP which is not available anywhere in the overcloud/undercloud. not sure why cinder.conf has glance-api server's IP set to something other than the host's IP in the internal net cidr, since there is only one controller.


this failure occurred after updating the containers to the latest tag available in pike `12.0-20180405.1`.
it used to work with tag `12.0-20180124.1`

Version-Release number of selected component (if applicable):
glance-api tag 12.0-20180405.1

How reproducible:
100% reproducible when deploying overcloud with Big Switch Networks plugin for neutron ml2. however the problem doesn't seem to root from that.

Steps to Reproduce:
1. fetch the latest container images for overcloud
2. start overcloud deployment with Big Switch Networks neutron ml2 plugin enabled
3. deployment fails at step 3

Actual results:
deployment fails at step 3 due to cinder-manage db sync timeout


Expected results:
deployment succeeds

Additional info:
attached the output of software deployment show command. also running sosreport shortly after creating this BZ ticket.
found a similar BZ bug #1539682, checked the related bug #1539192 and its dupe about ceph deployment error. however it does not fit this deployment case, since we do not have ceph storage.

let me know if you'd need the env files passed during deploy overcloud command and the command itself. not sure if that's included in sosreport.
also, i set the component to openstack-cinder, but feel free to change it based on your analysis.

Comment 1 bigswitch 2018-04-24 14:18:12 UTC
hi, the sosreport is larger than 20MB. is there an alternative way to upload/share it?

Comment 2 bigswitch 2018-04-26 12:43:42 UTC
Hello,

Does this need extra info from our end to debug this further?
Also, can you point to some doc that describes how to share attachments that are larger than 20MB?

Thanks!
Aditya Vaja

Comment 3 Alan Bishop 2018-04-26 13:14:44 UTC
This seems more of a neutron issue if cinder is failing due to a network connectivity problem.

Another thing to note is that in OSP-12, cinder runs on the baremetal host.

Comment 4 bigswitch 2018-04-26 14:00:39 UTC
> Another thing to note is that in OSP-12, cinder runs on the baremetal host.

Yep - cinder is on baremetal, but glance-api is containerized.
And the error started happening after updating the container images to latest available tag. So I thought it would be somehow related :)

It could be a neutron issue. However, the IP configured for glance-api in cinder.conf is not observed in the setup (when checking by running `ip a` on controller and compute nodes). Not sure if its a VIP (virtual IP). Which is why I thought it might be a configuration issue for cinder.

Just my 2 cents.

- Aditya

Comment 5 Assaf Muller 2018-04-26 17:56:33 UTC
Please attach an sosreport.

Comment 6 bigswitch 2018-04-26 18:00:58 UTC
Hi Assaf,

Can you let me know how to attach or share sosreport which is greater than 20MB?

Thanks!
- Aditya

Comment 7 bigswitch 2018-04-26 18:12:01 UTC
ah, nvm. found it in the KB: https://access.redhat.com/solutions/2112

Comment 8 bigswitch 2018-04-26 18:24:21 UTC
Created attachment 1427369 [details]
sosreport part 1

Comment 9 bigswitch 2018-04-26 18:25:31 UTC
Created attachment 1427375 [details]
sosreport part 2

Comment 10 bigswitch 2018-04-26 18:26:43 UTC
Created attachment 1427376 [details]
sosreport part 3

Comment 11 bigswitch 2018-04-26 18:27:47 UTC
Created attachment 1427377 [details]
sosreport part 4

Comment 12 bigswitch 2018-04-26 18:30:54 UTC
Hello Assaf,

I've attached the sosreport by splitting it into 4 parts, since the file was larger than 20MB.
Please let me know if I can provide any other information to help debug it.

Thanks!
- Aditya

Comment 13 bigswitch 2018-05-01 05:27:31 UTC
Hi,

We found a similar issue in this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1452082 - which might be the same root cause. We have a single overcloud controller as well. Does that require a change to the deploy command?

Our current deployment command looks like this:
openstack overcloud deploy --templates  -r /home/stack/templates/roles_data.yaml -e /home/stack/templates/node-info.yaml -e /home/stack/templates/overcloud_images.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /home/stack/templates/network-environment.yaml -e /home/stack/templates/bigswitch-p.yaml -e /home/stack/templates/bigswitch_images.yaml  --ntp-server 10.8.29.9 --timeout 150

Overriding using roles_data.yaml is only to enable extra service on compute for BSN. everything else is default.

Please let us know if this helps and if we can provide more info.

Thanks!
- Aditya

Comment 14 Mike Burns 2018-05-01 17:42:25 UTC
moving needinfo to target Assaf directly

Comment 15 bigswitch 2018-05-02 05:50:40 UTC
We tried HA setup as well . stil same problem where deployment failed step 3

(undercloud) [stack@rhosp12-director ~]$ openstack stack failures list --long overcloud | grep Error
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
            "Error: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: Failed to call refresh: Command exceeded timeout",
            "Error: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: Command exceeded timeout",
(undercloud) [stack@rhosp12-director ~]$




          "Debug: Finishing transaction 55716760",
            "Debug: Storing state",
            "Debug: Stored state in 0.09 seconds",
            "Notice: Applied catalog in 330.24 seconds",
            "Debug: Applying settings catalog for sections reporting, metrics",
            "Debug: Finishing transaction 99257380",
            "Debug: Received report to process from overcloud-controller-0.bigswitch.com",
            "Debug: Processing report from overcloud-controller-0.bigswitch.com with processor Puppet::Reports::Store"
        ],
        "failed_when_result": true
    }
        to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/69104930-415d-4e85-9e18-2210bebe06a7_playbook.retry

    PLAY RECAP *********************************************************************
    localhost                  : ok=4    changed=1    unreachable=0    failed=1

  deploy_stderr: |

Comment 16 bigswitch 2018-05-02 05:52:24 UTC
issue seen after updating overcloud images with overcloud-full-latest-12.0.tar -> /usr/share/rhosp-director-images/overcloud-full-12.0-20180404.1.el7ost.tar


Previous overcloud image dated "overcloud-full-12.0-20180126.1.el7ost.tar" worked fine.

Comment 17 Assaf Muller 2018-05-02 14:01:30 UTC
We aren't able to reproduce, I would advise opening a support ticket and working with GSS to figure this out.