Bug 1538336
Summary: | [OSP13][Deployment] Redeployment of overcloud fails during ControllerDeployment_Step4.2 when /usr/bin/gnocchi-upgrade fails badly. | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Omri Hochman <ohochman> |
Component: | openstack-tripleo-heat-templates | Assignee: | Pradeep Kilambi <pkilambi> |
Status: | CLOSED ERRATA | QA Contact: | Sasha Smolyak <ssmolyak> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 13.0 (Queens) | CC: | agurenko, apannu, apevec, houyatao, jjoyce, johfulto, jschluet, lhh, mburns, ohochman, rhel-osp-director-maint, sasha, sclewis |
Target Milestone: | beta | Keywords: | Reopened, Triaged |
Target Release: | 13.0 (Queens) | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-heat-templates-8.0.2-0.20180327213843.f25e2d8.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-06-27 13:43:23 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Omri Hochman
2018-01-24 20:51:49 UTC
[root@undercloud74 ~]# skopeo inspect --tls-verify=false docker://docker-registry.engineering.redhat.com/rhosp13/openstack-gnocchi-metricd:2018-01-22.1 { "Name": "docker-registry.engineering.redhat.com/rhosp13/openstack-gnocchi-metricd", "Tag": "latest", "Digest": "sha256:2cfc66bc2b99de2d358653f8d5200da38ae85b8e3e2d501944e129506be4b821", "RepoTags": [ "latest", "20180112.1", "13.0", "13.0-20180113.1", "2018-01-22.1", "13.0-20180112.1", "2017-12-20.1", "2018-01-03.2", "2018-01-10.4", "2018-01-12.2", "2018-01-17.2", "2018-01-19.1" ], "Created": "2018-01-16T16:58:28.936808Z", "DockerVersion": "1.12.6", "Labels": { "Kolla-SHA": "6.0.0.0b2-54-gbee9ea39", "architecture": "x86_64", "authoritative-source-url": "registry.access.redhat.com", "build-date": "2018-01-16T16:43:56.917775", "com.redhat.build-host": "ip-10-29-120-186.ec2.internal", "com.redhat.component": "openstack-gnocchi-metricd-docker", "description": "Red Hat OpenStack Platform 13.0 gnocchi-metricd", "distribution-scope": "public", "io.k8s.description": "Red Hat OpenStack Platform 13.0 gnocchi-metricd", "io.k8s.display-name": "Red Hat OpenStack Platform 13.0 gnocchi-metricd", "io.openshift.tags": "rhosp osp openstack osp-13.0", "kolla_version": "bee9ea39ff1b1c960c5f4f8f1a26fcced71a4ec3", "name": "rhosp13/openstack-gnocchi-metricd", "release": "3", "summary": "Red Hat OpenStack Platform 13.0 gnocchi-metricd", "tripleo-common_version": "8.3.0-2-g04317ff", "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/rhosp13/openstack-gnocchi-metricd/images/13.0-3", "vcs-ref": "987c66824e1f7a0072f2c6ad8284a6b1d03762fa", "vcs-type": "git", "vendor": "Red Hat, Inc.", "version": "13.0", "version-release": "13.0-20180112.1" }, "Architecture": "amd64", "Os": "linux", "Layers": [ "sha256:9cadd93b16ff2a0c51ac967ea2abfadfac50cfa3af8b5bf983d89b8f8647f3e4", "sha256:4aa565ad8b7a87248163ce7dba1dd3894821aac97e846b932ff6b8ef9a8a508a", "sha256:a45298a8cdc01417dbd6afa9bd3fd661e746b2537a3a952f4bba91fc1c78e824", "sha256:50414ef47dd94a43681dc858f0ccfcfc7c8bf26f6cd182807ccafb372ded330b", "sha256:602f31cb60adedc3c9d9f8f1e5907142731a7691aabf353b8d16bad9a6a002f1", "sha256:d91fde1655b99c5958b36f01c4c448fa8920cf939c0102c5d1870e05f8f02f7c" ] } (In reply to Pradeep Kilambi from comment #3) > This doesnt seem like a fresh deploy or even an upgrade scenario? Looks like > there is already a gnocchi db on disk, when you started the install. Did > your initial deploy fail for some reason or was aborted and you re-launched > the deploy? Can you try another fresh deploy and see if you can reproduce > this? The issue reproduce the second time on the same environment, but You are right, the initial deployment failed because there was an issue to pull containers from the local registry, after fixing that registry issue, I executed --> #openstack overcloud delete overcloud When the stack was removed I run deployment again and then the Bug Happened. - Now after chatting about it in the IRC this is still a valid bug although it happened only after deleting the Stack and running deployment again. the theory for root cause of the problem could be that the delete leaves unclean environment (DB or containers on the nodes) and therefore the issue occurs. (In reply to Omri Hochman from comment #4) > - Now after chatting about it in the IRC this is still a valid bug although > it happened only after deleting the Stack and running deployment again. > the theory for root cause of the problem could be that the delete leaves > unclean environment (DB or containers on the nodes) and therefore the issue > occurs. After running : ----------------- - openstack overcloud delete overcloud - start the nodes using ironic power on - ssh the nodes - run sudo docker ps / run sudo docker images Results : --------- It seems that we still have docker containers and images on the overcloud nodes after running "#openstack overcloud delete overcloud". whether or not that is the source of the issue in this bug , I'll open a separate bug for that. https://bugzilla.redhat.com/show_bug.cgi?id=1538777 Re-opened the issue reproduced on clean deployment : \"ObjectNotFound: error opening pool 'metrics'\", (undercloud) [stack@undercloud74 ~]$ echo -e `heat deployment-show 37bc65f5-6986-4f27-b583-986333b648a4`|grep -i error WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.0.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.0.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning \"Error running ['docker', 'run', '--name', 'gnocchi_db_sync', '--label', 'config_id=tripleo_step4', '--label', 'container_name=gnocchi_db_sync', '--label', 'managed_by=paunch', '--label', 'config_data={\\"environment\\": [\\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\\", \\"TRIPLEO_CONFIG_HASH=1a569d012dc804939398b671bf257703\\"], \\"user\\": \\"root\\", \\"volumes\\": [\\"/etc/hosts:/etc/hosts:ro\\", \\"/etc/localtime:/etc/localtime:ro\\", \\"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\\", \\"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\\", \\"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\\", \\"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\\", \\"/dev/log:/dev/log\\", \\"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\\", \\"/etc/puppet:/etc/puppet:ro\\", \\"/var/lib/kolla/config_files/gnocchi_db_sync.json:/var/lib/kolla/config_files/config.json:ro\\", \\"/var/lib/config-data/puppet-generated/gnocchi/:/var/lib/kolla/config_files/src:ro\\", \\"/var/log/containers/gnocchi:/var/log/gnocchi\\", \\"/var/log/containers/httpd/gnocchi-api:/var/log/httpd\\", \\"/etc/ceph:/var/lib/kolla/config_files/src-ceph:ro\\"], \\"image\\": \\"192.168.0.1:8787/rhosp13/openstack-gnocchi-api:13.0-20180112.1\\", \\"detach\\": false, \\"net\\": \\"host\\", \\"privileged\\": false}', '--env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS', '--env=TRIPLEO_CONFIG_HASH=1a569d012dc804939398b671bf257703', '--net=host', '--privileged=false', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/var/lib/kolla/config_files/gnocchi_db_sync.json:/var/lib/kolla/config_files/config.json:ro', '--volume=/var/lib/config-data/puppet-generated/gnocchi/:/var/lib/kolla/config_files/src:ro', '--volume=/var/log/containers/gnocchi:/var/log/gnocchi', '--volume=/var/log/containers/httpd/gnocchi-api:/var/log/httpd', '--volume=/etc/ceph:/var/lib/kolla/config_files/src-ceph:ro', '192.168.0.1:8787/rhosp13/openstack-gnocchi-api:13.0-20180112.1']. [1]\", \"ObjectNotFound: error opening pool 'metrics'\", (undercloud) [stack@undercloud74 ~]$ (undercloud) [stack@undercloud74 ~]$ (undercloud) [stack@undercloud74 ~]$ openstack stack list +--------------------------------------+------------+----------------------------------+---------------+----------------------+--------------+ | ID | Stack Name | Project | Stack Status | Creation Time | Updated Time | +--------------------------------------+------------+----------------------------------+---------------+----------------------+--------------+ | 3b94d14f-b2cf-4fbc-9cff-e4533293c1a3 | overcloud | d2ad266cecf9419f9fd906d2c916d998 | CREATE_FAILED | 2018-01-28T15:25:13Z | None | +--------------------------------------+------------+----------------------------------+---------------+----------------------+--------------+ (In reply to Omri Hochman from comment #7) > Re-opened the issue reproduced on clean deployment : > \"Error running ['docker', 'run', '--name', 'gnocchi_db_sync', '--label', > 'config_id=tripleo_step4', '--label', 'container_name=gnocchi_db_sync', > '--label', 'managed_by=paunch', '--label', 'config_data={\\"environment\\": > [\\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\\", > \\"TRIPLEO_CONFIG_HASH=1a569d012dc804939398b671bf257703\\"], \\"user\\": > \\"root\\", \\"volumes\\": [\\"/etc/hosts:/etc/hosts:ro\\", > \\"/etc/localtime:/etc/localtime:ro\\", > \\"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\\", > \\"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\\", > \\"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust. > crt:ro\\", \\"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\\", > \\"/dev/log:/dev/log\\", > \\"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\\", > \\"/etc/puppet:/etc/puppet:ro\\", > \\"/var/lib/kolla/config_files/gnocchi_db_sync.json:/var/lib/kolla/ > config_files/config.json:ro\\", > \\"/var/lib/config-data/puppet-generated/gnocchi/:/var/lib/kolla/ > config_files/src:ro\\", \\"/var/log/containers/gnocchi:/var/log/gnocchi\\", > \\"/var/log/containers/httpd/gnocchi-api:/var/log/httpd\\", > \\"/etc/ceph:/var/lib/kolla/config_files/src-ceph:ro\\"], \\"image\\": > \\"192.168.0.1:8787/rhosp13/openstack-gnocchi-api:13.0-20180112.1\\", > \\"detach\\": false, \\"net\\": \\"host\\", \\"privileged\\": false}', > '--env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS', > '--env=TRIPLEO_CONFIG_HASH=1a569d012dc804939398b671bf257703', '--net=host', > '--privileged=false', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', > '--volume=/etc/localtime:/etc/localtime:ro', > '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', > '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt: > ro', > '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca- > bundle.trust.crt:ro', > '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', > '--volume=/dev/log:/dev/log', > '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', > '--volume=/etc/puppet:/etc/puppet:ro', > '--volume=/var/lib/kolla/config_files/gnocchi_db_sync.json:/var/lib/kolla/ > config_files/config.json:ro', > '--volume=/var/lib/config-data/puppet-generated/gnocchi/:/var/lib/kolla/ > config_files/src:ro', > '--volume=/var/log/containers/gnocchi:/var/log/gnocchi', > '--volume=/var/log/containers/httpd/gnocchi-api:/var/log/httpd', > '--volume=/etc/ceph:/var/lib/kolla/config_files/src-ceph:ro', > '192.168.0.1:8787/rhosp13/openstack-gnocchi-api:13.0-20180112.1']. [1]\", > \"ObjectNotFound: error opening pool 'metrics'\", This is not the same issue as the original. IN this case looks like your ceph cluster does not have metrics pool. Is this external ceph? if so, make sure you create those pools accordingly. (In reply to Pradeep Kilambi from comment #8) > (In reply to Omri Hochman from comment #7) > > Re-opened the issue reproduced on clean deployment : > > This is not the same issue as the original. IN this case looks like your > ceph cluster does not have metrics pool. Is this external ceph? if so, make > sure you create those pools accordingly. Thanks checking on that. It's internal ceph with standart: 3X controller 1X compute 3X ceph deployment. The problem is how tripleo set up the ceph-ansible deployment. - ceph-ansible call has no input via extra vars [1] - inventory has no input [2] - thus no arguments were passed to ceph-ansible - thus, ceph-ansible skipped all of its tasks, no input provided, and the playbook run returned no error [3] [1] 2018-01-28 11:27:28.996 31461 DEBUG oslo_concurrency.processutils [req-1b57e863-20e4-414b-8b0c-62d37514f64b f8716113cb2d44259eeebf97c3570146 d2ad266cecf9419f9fd906d2c916d998 - default default] CMD "ansible-playbook /usr/share/ceph-ansible/site-docker.yml.sample --user tripleo-admin --become --become-user root --inventory-file /tmp/ansible-mistral-action60aVA0/inventory.yaml --private-key /tmp/ansible-mistral-action60aVA0/ssh_private_key --skip-tags package-install,with_pkg" returned: 0 in 873.287s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:409 [2] As per https://github.com/fultonj/tripleo-ceph-ansible/blob/master/get-inventory.sh: Inventory from 2018-01-28 16:27:30 { "mgr_ips": [ "192.168.0.8", "192.168.0.19", "192.168.0.17" ], "mon_ips": [ "192.168.0.8", "192.168.0.19", "192.168.0.17" ], "mds_ips": [], "osd_ips": [ "192.168.0.16", "192.168.0.12", "192.168.0.13" ], "rbdmirror_ips": [], "rgw_ips": [], "client_ips": [ "192.168.0.15" ], "nfs_ips": [] } [3] 2018-01-28 11:16:15,456 p=3697 u=mistral | TASK [ceph-mon : create openstack pool(s)] ************************************* 2018-01-28 11:16:15,513 p=3697 u=mistral | skipping: [192.168.0.8] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'images'}) 2018-01-28 11:16:15,536 p=3697 u=mistral | skipping: [192.168.0.8] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'metrics'}) 2018-01-28 11:16:15,559 p=3697 u=mistral | skipping: [192.168.0.8] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'backups'}) 2018-01-28 11:16:15,581 p=3697 u=mistral | skipping: [192.168.0.8] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'vms'}) 2018-01-28 11:16:15,600 p=3697 u=mistral | skipping: [192.168.0.8] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'volumes'}) the issue from last comments is going to be taking care of by tracking : https://bugzilla.redhat.com/show_bug.cgi?id=1539852 *** This bug has been marked as a duplicate of bug 1538777 *** reopen , as it reproduced again with clean deploy. we thought it might be related to this #1552685 but the error is different, Next move, trying to W/A it by adding to the deploy_command: -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml (In reply to Omri Hochman from comment #13) > reopen , as it reproduced again with clean deploy. > > we thought it might be related to this #1552685 but the error is different, > > Next move, trying to W/A it by adding to the deploy_command: > -e > /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry. > yaml With disabled gnocchi CREATE_COMPLETE Seeing it last 2 days with all RHOS 13 IR deployments with latest puddle. I've added workaround to the CI jobs and re-triggered all of them. unable to reproduce with: openstack-tripleo-heat-templates-8.0.2-0.20180327213843.f25e2d8.el7ost.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086 |