Description of problem: While upgrading the overcloud from OSP14-to OSP15 after reinstalling the first controller in RHEL8 and running the upgrade run step on that node, it gets stuck while creating containers STEP 3. Cinder container is stucked trying to perform a cinder-manage db sync step however mysql is not running. Version-Release number of selected component (if applicable): RHOSP 15 How reproducible: always Actual results: openstack upgrade run on controller-1 succeeds Expected results: openstack upgrade run keeps running in step3 waiting on the cinder_api container Additional info:
Some more information: - cinder-manage db sync failed because galera containers are down - It seems DB is down because of podman not being able to properly remove the old containers: "<13>Nov 6 15:36:34 puppet-user: Transaction evaluation: 0.36", "<13>Nov 6 15:36:34 puppet-user: Catalog application: 0.37", "<13>Nov 6 15:36:34 puppet-user: Config retrieval: 0.80", "<13>Nov 6 15:36:34 puppet-user: Last run: 1573054594", "<13>Nov 6 15:36:34 puppet-user: Resources: 0.00", "<13>Nov 6 15:36:34 puppet-user: Total: 0.37", "<13>Nov 6 15:36:34 puppet-user: Version:", "<13>Nov 6 15:36:34 puppet-user: Config: 1573054593", "<13>Nov 6 15:36:34 puppet-user: Puppet: 5.5.10", "+ '[' -f /root/.my.cnf -a -f /var/lib/config-data/ceilometer/root/.my.cnf ']'", "+ rsync -a -R --delay-updates --delete-after --exclude=/etc/puppetlabs/ --exclude=/opt/puppetlabs/ /etc /root /opt /var/spool/cron /var/lib/config-data/ceilometer", "++ stat -c %y /var/lib/config-data/ceilometer.origin_of_time", "+ echo 'Gathering files modified after 2019-11-06 15:36:21.956335971 +0000'", "+ mkdir -p /var/lib/config-data/puppet-generated/ceilometer", "+ rsync -a -R -0 --delay-updates --delete-after --exclude=/etc/puppetlabs/ --exclude=/opt/puppetlabs/ --files-from=/dev/fd/63 / /var/lib/config-data/puppet-generate d/ceilometer", "++ find /etc /root /opt /var/spool/cron -newer /var/lib/config-data/ceilometer.origin_of_time -not -path '/etc/puppet*' -print0", "+ tar -c --mtime=1970-01-01 '--exclude=*/etc/swift/backups/*' '--exclude=*/etc/libvirt/passwd.db' -f - /var/lib/config-data/ceilometer", "+ tar -c --mtime=1970-01-01 '--exclude=*/etc/swift/backups/*' '--exclude=*/etc/libvirt/passwd.db' -f - /var/lib/config-data/puppet-generated/ceilometer --mtime=1970 -01-01", "2019-11-06 15:36:35,230 INFO: 276650 -- Removing container: container-puppet-ceilometer-hw1xea3h", "2019-11-06 15:36:35,368 DEBUG: 276650 -- container-puppet-ceilometer-hw1xea3h", "2019-11-06 15:36:35,368 DEBUG: 276650 -- Error: refusing to remove \"container-puppet-ceilometer-hw1xea3h\" as it exists in libpod as container 7b398d0d0d5ab98c000c f4c9e37e486bbeb21532d9b01289e03a15d1f471c6e6: container already exists", "2019-11-06 15:36:35,368 INFO: 276650 -- Finished processing puppet configs for ceilometer", "2019-11-06 15:36:35,369 INFO: 276650 -- Starting configuration of cinder using image 172.16.0.1:8787/rhosp15-rhel8/openstack-cinder-api:15.0-70", "2019-11-06 15:36:35,369 DEBUG: 276650 -- config_volume cinder", "2019-11-06 15:36:35,369 DEBUG: 276650 -- puppet_tags file,file_line,concat,augeas,cron,cinder_config,cinder_type,file,concat,file_line,cinder_config,file,concat,fil e_line,cinder_config,file,concat,file_line", "2019-11-06 15:36:35,369 DEBUG: 276650 -- manifest include ::tripleo::profile::base::cinder::api", "include ::tripleo::profile::base::cinder::scheduler", "include ::tripleo::profile::base::lvm", "2019-11-06 15:36:35,369 DEBUG: 276650 -- config_image 172.16.0.1:8787/rhosp15-rhel8/openstack-cinder-api:15.0-70", "2019-11-06 15:36:35,369 DEBUG: 276650 -- volumes []", "2019-11-06 15:36:35,369 DEBUG: 276650 -- privileged False", "2019-11-06 15:36:35,369 DEBUG: 276650 -- check_mode 0", "2019-11-06 15:36:35,452 INFO: 276650 -- Removing container: container-puppet-cinder-6d35s3eu", "2019-11-06 15:36:35,530 DEBUG: 276650 -- container-puppet-cinder-6d35s3eu", "2019-11-06 15:36:35,530 DEBUG: 276650 -- Error: no container with ID or name \"container-puppet-cinder-6d35s3eu\" found: no such container", - The deployment doesnot stop here and moves on from step 1 to step 3, even after the failure which is probably a different bug
The source of the issue is in the pacemaker managed services. When trying to create the resources during deploy steps, we get all the time the following error: Error: unable to get cib: 2019-11-07 07:12:19,839 p=55349 u=mistral | TASK [Start containers for step 2] ********************************************* 2019-11-07 07:12:19,839 p=55349 u=mistral | Thursday 07 November 2019 07:12:19 -0500 (0:00:00.127) 0:12:43.998 ***** 2019-11-07 07:19:57,461 p=55349 u=mistral | ok: [lab-controller01] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false} 2019-11-07 07:19:57,569 p=55349 u=mistral | TASK [Debug output for task: Start containers for step 2] ********************** 2019-11-07 07:19:57,569 p=55349 u=mistral | Thursday 07 November 2019 07:19:57 -0500 (0:07:37.730) 0:20:21.728 ***** 2019-11-07 07:19:57,619 p=55349 u=mistral | fatal: [lab-controller01]: FAILED! => { "failed_when_result": true, "outputs.stdout_lines | default([]) | union(outputs.stderr_lines | default([]))": [ "stdout: 42d32709434380a345435f46d32e76ce63a0fdd0515316d5d600b496315a6c31", "", "stderr: Trying to pull 172.16.0.1:8787/rhosp15-rhel8/openstack-aodh-api:15.0-66...Getting image source signatures", "Copying blob sha256:a60e73ae88b1d89e3692266bef07acc59a73db24382fdd19fc273c32aa5e97fb", "Copying blob sha256:641d7cc5cbc48a13c68806cf25d5bcf76ea2157c3181e1db4f5d0edae34954ac", "Copying blob sha256:c65691897a4d140d441e2024ce086de996b1c4620832b90c973db81329577274", "Copying blob sha256:88ae1403b98ee0e6074e30cd01575ab6dec0fe566a10297a504b98b35791360a", "Copying blob sha256:f87dfcac8eff5df8645ed06bc1809d522ce69d0280517d38d89a988b68bcc419", "Copying blob sha256:4729a68b42a362a0a3b210060b03f45e54bf9293fecc10235d4ce9fcaa7c251b", "Copying config sha256:42d32709434380a345435f46d32e76ce63a0fdd0515316d5d600b496315a6c31", ................. "+ rc=6", "Error running ['podman', 'run', '--name', 'haproxy_init_bundle', '--label', 'config_id=tripleo_step2', '--label', 'container_name=haproxy_init_bundle', '--label', 'managed_by=paunch', '--label', 'config_data={\"command\": [\"/container_puppet_apply.sh\", \"2\", \"file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ip,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation\", \"include ::tripleo::profile::base::pacemaker; include ::tripleo::profile::pacemaker::haproxy_bundle\", \"\"], \"detach\": false, \"environment\": [\"TRIPLEO_DEPLOY_IDENTIFIER=1573126751\"], \"image\": \"172.16.0.1:8787/rhosp15-rhel8/openstack-haproxy:15.0-76\", \"ipc\": \"host\", \"net\": \"host\", \"privileged\": true, \"start_order\": 3, \"user\": \"root\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro\", \"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/var/lib/container-config-scripts/container_puppet_apply.sh:/container_puppet_apply.sh:ro\", \"/etc/puppet:/tmp/puppet-etc:ro\", \"/usr/share/openstack-puppet/modules:/usr/share/openstack-puppet/modules:ro\", \"/etc/pki/tls/private/overcloud_endpoint.pem:/etc/pki/tls/private/overcloud_endpoint.pem:ro\"]}', '--conmon-pidfile=/var/run/haproxy_init_bundle.pid', '--log-driver', 'k8s-file', '--log-opt', 'path=/var/log/containers/stdouts/haproxy_init_bundle.log', '--env=TRIPLEO_DEPLOY_IDENTIFIER=1573126751', '--net=host', '--ipc=host', '--privileged=true', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/var/lib/container-config-scripts/container_puppet_apply.sh:/container_puppet_apply.sh:ro', '--volume=/etc/puppet:/tmp/puppet-etc:ro', '--volume=/usr/share/openstack-puppet/modules:/usr/share/openstack-puppet/modules:ro', '--volume=/etc/pki/tls/private/overcloud_endpoint.pem:/etc/pki/tls/private/overcloud_endpoint.pem:ro', '172.16.0.1:8787/rhosp15-rhel8/openstack-haproxy:15.0-76', '/container_puppet_apply.sh', '2', 'file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ip,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation', 'include ::tripleo::profile::base::pacemaker; include ::tripleo::profile::pacemaker::haproxy_bundle', '']. [6]", "Notice: Scope(Class[Tripleo::Firewall::Post]): At this stage, all network traffic is blocked.", .............. "Info: Haproxy::Config[haproxy]: Unscheduling all events on Haproxy::Config[haproxy]", "Notice: Applied catalog in 5.24 seconds", " Total: 2", " Success: 2", " Failure: 8", " Total: 10", " Changed: 1", " Skipped: 196", " Failed: 8", " Out of sync: 9", " Total: 249", " Concat file: 0.00", " Concat fragment: 0.00", " File: 0.08", " Pcmk property: 1.70", " Last run: 1573129196", " Pcmk resource: 2.55", " Config retrieval: 2.76", " Transaction evaluation: 5.23", " Catalog application: 5.24", " Total: 5.24", " Config: 1573129188", "+ TAGS=file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ip,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation", "+ CONFIG='include ::tripleo::profile::base::pacemaker; include ::tripleo::profile::pacemaker::haproxy_bundle'", "+ puppet apply --verbose --detailed-exitcodes --summarize --color=false --modulepath /etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --tags file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ip,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation -e 'noop_resource('\\''package'\\''); include ::tripleo::profile::base::pacemaker; include ::tripleo::profile::pacemaker::haproxy_bundle'", "Warning: tag is a metaparam; this value will inherit to all contained resources in the tripleo::firewall::rule definition", "Warning: ModuleLoader: module 'concat' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules\\n (file & line not available)", " with Stdlib::Compat::Hash. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/tripleo/manifests/firewall/rule.pp\", 148]:", "Warning: This method is deprecated, please use match expressions with Stdlib::Compat::Ipv6 instead. They are described at https://docs.puppet.com/puppet/latest/reference/lang_data_type.html#match-expressions. at [\"/etc/puppet/modules/tripleo/manifests/pacemaker/haproxy_with_vip.pp\", 75]:", "Warning: Scope(Haproxy::Config[haproxy]): haproxy: The $merge_options parameter will default to true in the next major release. Please review the documentation regarding the implications.", "Error: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Pcmk_property[property--stonith-enabled]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20191107-10-1mteb0l failed with code: 1 -> Error: unable to get cib", "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Haproxy_bundle/Pacemaker::Property[haproxy-role-lab-controller01]/Pcmk_property[property-lab-controller01-haproxy-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20191107-10-lsb5qu failed with code: 1 -> Error: unable to get cib", "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Haproxy_bundle/Tripleo::Pacemaker::Haproxy_with_vip[haproxy_and_control_vip]/Pacemaker::Resource::Ip[control_vip]/Pcmk_resource[ip-172.16.0.250]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20191107-10-3k6ivn failed with code: 1 -> Error: unable to get cib", [heat-admin@lab-controller01 ~]$ sudo podman ps --all | grep "Exited (6)" 32803f7b8932 172.16.0.1:8787/rhosp15-rhel8/openstack-haproxy:15.0-76 dumb-init --singl... 3 days ago Exited (6) 3 days ago haproxy_init_bundle c0808beb412c 172.16.0.1:8787/rhosp15-rhel8/openstack-redis:15.0-72 dumb-init --singl... 3 days ago Exited (6) 3 days ago redis_init_bundle The origin of the problem seems to lay on the different pacemaker packages versions we can find in the container images and the controller: CONTAINER: [heat-admin@lab-controller01 ~]$ sudo podman run -it --name haproxy_init_test --net host 172.16.0.1:8787/rhosp15-rhel8/openstack-haproxy:15.0-76 bash ()[root@lab-controller01 /]# rpm -qa | grep pcs pcs-0.10.1-4.el8_0.4.x86_64 ()[root@lab-controller01 /]# rpm -qa | grep pacemaker pacemaker-cli-2.0.1-4.el8_0.4.x86_64 pacemaker-libs-2.0.1-4.el8_0.4.x86_64 pacemaker-remote-2.0.1-4.el8_0.4.x86_64 pacemaker-2.0.1-4.el8_0.4.x86_64 puppet-pacemaker-0.7.3-0.20190807230458.8b30131.el8ost.noarch pacemaker-schemas-2.0.1-4.el8_0.4.noarch pacemaker-cluster-libs-2.0.1-4.el8_0.4.x86_64 ()[root@lab-controller01 /]# exit exit CONTROLLER-0 [heat-admin@lab-controller01 ~]$ sudo rpm -qa | grep pcs pcs-0.10.2-4.el8.x86_64 [heat-admin@lab-controller01 ~]$ sudo rpm -qa | grep pacemaker puppet-pacemaker-0.7.3-0.20190807230458.8b30131.el8ost.noarch pacemaker-schemas-2.0.2-3.el8_1.2.noarch pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64 pacemaker-2.0.2-3.el8_1.2.x86_64 pacemaker-libs-2.0.2-3.el8_1.2.x86_64 pacemaker-cli-2.0.2-3.el8_1.2.x86_64 However, the job is using the stage CDN and redhat.registry.io to provide with the containers. Digging a little bit further, it looks like the containers are build over RHEL 8.0 while the Controller is alreayd in RHEL8.1: [heat-admin@lab-controller01 ~]$ sudo podman run -it --name haproxy_init_test --net host 172.16.0.1:8787/rhosp15-rhel8/openstack-haproxy:15.0-76 bash ()[root@lab-controller01 /]# cat /etc/redhat-release Red Hat Enterprise Linux release 8.0 (Ootpa) [heat-admin@lab-controller01 ~]$ cat /etc/redhat-release Red Hat Enterprise Linux release 8.1 (Ootpa) In order to bypass this issue we require the very same pacemaker packages in the pcmkr managed container images as we have in the host. This bug blocked us to continue with our Upgrade testing during the EMEA Hackfest in two environments. So it's quite important to get it solved, as an upgrade using CDN + redhat's registry wouldn't be possible at the moment.
So, unfortunately, pcmk requires to have the same content across host and HA containers (in order to remove this restriction in pcmk/pcs https://bugzilla.redhat.com/show_bug.cgi?id=1603613 would have to get fixed). So right now the only way to 'fix' this is to either: A) use a rhel 8.0 overcloud with the rhel-8.0 based containers B) get some rhel 8.1 HA containers built B.1) Either we have releng rebuild them based on 8.1 content or B.2) we build them ourselves. Now on to B.2) we could probably work on some commands to do exactly that if there is interest/need ?
To avoid tracking this issue in multiple places, let's close this bugzilla as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1769408. *** This bug has been marked as a duplicate of bug 1769408 ***