1769473 – OSP14->OSP15 Major upgrade stuck in Error: unable to get cib

Bug 1769473 - OSP14->OSP15 Major upgrade stuck in Error: unable to get cib

Summary: OSP14->OSP15 Major upgrade stuck in Error: unable to get cib

Keywords:
Status:	CLOSED DUPLICATE of bug 1769408
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo
Sub Component:
Version:	15.0 (Stein)
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	James Slagle
QA Contact:	Arik Chernetsky
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-11-06 17:07 UTC by Mauro Oddi
Modified:	2019-11-11 14:15 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-11-11 14:15:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Mauro Oddi 2019-11-06 17:07:30 UTC

Description of problem:
While upgrading the overcloud from OSP14-to OSP15  after reinstalling the first controller in RHEL8 and running the upgrade run step on that node, it gets stuck while creating containers STEP 3.
Cinder container is stucked trying to perform a cinder-manage db sync step however mysql is not running.

Version-Release number of selected component (if applicable):
RHOSP 15

How reproducible:
always


Actual results:
openstack upgrade run  on controller-1 succeeds

Expected results:
openstack upgrade run keeps running in step3 waiting on the cinder_api container

Additional info:

Comment 2 Mauro Oddi 2019-11-07 08:23:28 UTC

Some more information:


 - cinder-manage db sync failed because galera containers are down

 - It seems DB is down because of podman not being able to properly remove the old containers:

        "<13>Nov  6 15:36:34 puppet-user:    Transaction evaluation: 0.36",
        "<13>Nov  6 15:36:34 puppet-user:    Catalog application: 0.37",
        "<13>Nov  6 15:36:34 puppet-user:    Config retrieval: 0.80",
        "<13>Nov  6 15:36:34 puppet-user:          Last run: 1573054594",
        "<13>Nov  6 15:36:34 puppet-user:         Resources: 0.00",
        "<13>Nov  6 15:36:34 puppet-user:             Total: 0.37",
        "<13>Nov  6 15:36:34 puppet-user: Version:",
        "<13>Nov  6 15:36:34 puppet-user:            Config: 1573054593",
        "<13>Nov  6 15:36:34 puppet-user:            Puppet: 5.5.10",
        "+ '[' -f /root/.my.cnf -a -f /var/lib/config-data/ceilometer/root/.my.cnf ']'",                                                                                    
        "+ rsync -a -R --delay-updates --delete-after --exclude=/etc/puppetlabs/ --exclude=/opt/puppetlabs/ /etc /root /opt /var/spool/cron /var/lib/config-data/ceilometer",
        "++ stat -c %y /var/lib/config-data/ceilometer.origin_of_time",
        "+ echo 'Gathering files modified after 2019-11-06 15:36:21.956335971 +0000'",                                                                                      
        "+ mkdir -p /var/lib/config-data/puppet-generated/ceilometer",
        "+ rsync -a -R -0 --delay-updates --delete-after --exclude=/etc/puppetlabs/ --exclude=/opt/puppetlabs/ --files-from=/dev/fd/63 / /var/lib/config-data/puppet-generate
d/ceilometer",
        "++ find /etc /root /opt /var/spool/cron -newer /var/lib/config-data/ceilometer.origin_of_time -not -path '/etc/puppet*' -print0",                                  
        "+ tar -c --mtime=1970-01-01 '--exclude=*/etc/swift/backups/*' '--exclude=*/etc/libvirt/passwd.db' -f - /var/lib/config-data/ceilometer",                           
        "+ tar -c --mtime=1970-01-01 '--exclude=*/etc/swift/backups/*' '--exclude=*/etc/libvirt/passwd.db' -f - /var/lib/config-data/puppet-generated/ceilometer --mtime=1970
-01-01",
        "2019-11-06 15:36:35,230 INFO: 276650 -- Removing container: container-puppet-ceilometer-hw1xea3h",                                                                 
        "2019-11-06 15:36:35,368 DEBUG: 276650 -- container-puppet-ceilometer-hw1xea3h",                                                                                    
        "2019-11-06 15:36:35,368 DEBUG: 276650 -- Error: refusing to remove \"container-puppet-ceilometer-hw1xea3h\" as it exists in libpod as container 7b398d0d0d5ab98c000c
f4c9e37e486bbeb21532d9b01289e03a15d1f471c6e6: container already exists",
        "2019-11-06 15:36:35,368 INFO: 276650 -- Finished processing puppet configs for ceilometer",                                                                        
        "2019-11-06 15:36:35,369 INFO: 276650 -- Starting configuration of cinder using image 172.16.0.1:8787/rhosp15-rhel8/openstack-cinder-api:15.0-70",                  
        "2019-11-06 15:36:35,369 DEBUG: 276650 -- config_volume cinder",
        "2019-11-06 15:36:35,369 DEBUG: 276650 -- puppet_tags file,file_line,concat,augeas,cron,cinder_config,cinder_type,file,concat,file_line,cinder_config,file,concat,fil
e_line,cinder_config,file,concat,file_line",
        "2019-11-06 15:36:35,369 DEBUG: 276650 -- manifest include ::tripleo::profile::base::cinder::api",                                                                  
        "include ::tripleo::profile::base::cinder::scheduler",
        "include ::tripleo::profile::base::lvm",
        "2019-11-06 15:36:35,369 DEBUG: 276650 -- config_image 172.16.0.1:8787/rhosp15-rhel8/openstack-cinder-api:15.0-70",                                                 
        "2019-11-06 15:36:35,369 DEBUG: 276650 -- volumes []",
        "2019-11-06 15:36:35,369 DEBUG: 276650 -- privileged False",
        "2019-11-06 15:36:35,369 DEBUG: 276650 -- check_mode 0",
        "2019-11-06 15:36:35,452 INFO: 276650 -- Removing container: container-puppet-cinder-6d35s3eu",                                                                     
        "2019-11-06 15:36:35,530 DEBUG: 276650 -- container-puppet-cinder-6d35s3eu",
        "2019-11-06 15:36:35,530 DEBUG: 276650 -- Error: no container with ID or name \"container-puppet-cinder-6d35s3eu\" found: no such container",  


 - The deployment doesnot stop here and moves on from step 1 to step 3, even after the failure which is probably a different bug

Comment 3 Jose Luis Franco 2019-11-11 11:17:49 UTC

The source of the issue is in the pacemaker managed services. When trying to create the resources during deploy steps, we get all the time the following error:
Error: unable to get cib:

2019-11-07 07:12:19,839 p=55349 u=mistral |  TASK [Start containers for step 2] *********************************************
2019-11-07 07:12:19,839 p=55349 u=mistral |  Thursday 07 November 2019  07:12:19 -0500 (0:00:00.127)       0:12:43.998 *****
2019-11-07 07:19:57,461 p=55349 u=mistral |  ok: [lab-controller01] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2019-11-07 07:19:57,569 p=55349 u=mistral |  TASK [Debug output for task: Start containers for step 2] **********************
2019-11-07 07:19:57,569 p=55349 u=mistral |  Thursday 07 November 2019  07:19:57 -0500 (0:07:37.730)       0:20:21.728 *****
2019-11-07 07:19:57,619 p=55349 u=mistral |  fatal: [lab-controller01]: FAILED! => {
    "failed_when_result": true,
    "outputs.stdout_lines | default([]) | union(outputs.stderr_lines | default([]))": [
        "stdout: 42d32709434380a345435f46d32e76ce63a0fdd0515316d5d600b496315a6c31",
        "",
        "stderr: Trying to pull 172.16.0.1:8787/rhosp15-rhel8/openstack-aodh-api:15.0-66...Getting image source signatures",
        "Copying blob sha256:a60e73ae88b1d89e3692266bef07acc59a73db24382fdd19fc273c32aa5e97fb",
        "Copying blob sha256:641d7cc5cbc48a13c68806cf25d5bcf76ea2157c3181e1db4f5d0edae34954ac",
        "Copying blob sha256:c65691897a4d140d441e2024ce086de996b1c4620832b90c973db81329577274",
        "Copying blob sha256:88ae1403b98ee0e6074e30cd01575ab6dec0fe566a10297a504b98b35791360a",
        "Copying blob sha256:f87dfcac8eff5df8645ed06bc1809d522ce69d0280517d38d89a988b68bcc419",
        "Copying blob sha256:4729a68b42a362a0a3b210060b03f45e54bf9293fecc10235d4ce9fcaa7c251b",
        "Copying config sha256:42d32709434380a345435f46d32e76ce63a0fdd0515316d5d600b496315a6c31",

.................

        "+ rc=6",
        "Error running ['podman', 'run', '--name', 'haproxy_init_bundle', '--label', 'config_id=tripleo_step2', '--label', 'container_name=haproxy_init_bundle', '--label', 'managed_by=paunch', '--label', 'config_data={\"command\": [\"/container_puppet_apply.sh\", \"2\", \"file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ip,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation\", \"include ::tripleo::profile::base::pacemaker; include ::tripleo::profile::pacemaker::haproxy_bundle\", \"\"], \"detach\": false, \"environment\": [\"TRIPLEO_DEPLOY_IDENTIFIER=1573126751\"], \"image\": \"172.16.0.1:8787/rhosp15-rhel8/openstack-haproxy:15.0-76\", \"ipc\": \"host\", \"net\": \"host\", \"privileged\": true, \"start_order\": 3, \"user\": \"root\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro\", \"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/var/lib/container-config-scripts/container_puppet_apply.sh:/container_puppet_apply.sh:ro\", \"/etc/puppet:/tmp/puppet-etc:ro\", \"/usr/share/openstack-puppet/modules:/usr/share/openstack-puppet/modules:ro\", \"/etc/pki/tls/private/overcloud_endpoint.pem:/etc/pki/tls/private/overcloud_endpoint.pem:ro\"]}', '--conmon-pidfile=/var/run/haproxy_init_bundle.pid', '--log-driver', 'k8s-file', '--log-opt', 'path=/var/log/containers/stdouts/haproxy_init_bundle.log', '--env=TRIPLEO_DEPLOY_IDENTIFIER=1573126751', '--net=host', '--ipc=host', '--privileged=true', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/var/lib/container-config-scripts/container_puppet_apply.sh:/container_puppet_apply.sh:ro', '--volume=/etc/puppet:/tmp/puppet-etc:ro', '--volume=/usr/share/openstack-puppet/modules:/usr/share/openstack-puppet/modules:ro', '--volume=/etc/pki/tls/private/overcloud_endpoint.pem:/etc/pki/tls/private/overcloud_endpoint.pem:ro', '172.16.0.1:8787/rhosp15-rhel8/openstack-haproxy:15.0-76', '/container_puppet_apply.sh', '2', 'file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ip,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation', 'include ::tripleo::profile::base::pacemaker; include ::tripleo::profile::pacemaker::haproxy_bundle', '']. [6]",
        "Notice: Scope(Class[Tripleo::Firewall::Post]): At this stage, all network traffic is blocked.",

..............

        "Info: Haproxy::Config[haproxy]: Unscheduling all events on Haproxy::Config[haproxy]",
        "Notice: Applied catalog in 5.24 seconds",
        "            Total: 2",
        "          Success: 2",
        "          Failure: 8",
        "            Total: 10",
        "          Changed: 1",
        "          Skipped: 196",
        "           Failed: 8",
        "      Out of sync: 9",
        "            Total: 249",
        "      Concat file: 0.00",
        "   Concat fragment: 0.00",
        "             File: 0.08",
        "    Pcmk property: 1.70",
        "         Last run: 1573129196",
        "    Pcmk resource: 2.55",
        "   Config retrieval: 2.76",
        "   Transaction evaluation: 5.23",
        "   Catalog application: 5.24",
        "            Total: 5.24",
        "           Config: 1573129188",

        "+ TAGS=file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ip,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation",
        "+ CONFIG='include ::tripleo::profile::base::pacemaker; include ::tripleo::profile::pacemaker::haproxy_bundle'",
        "+ puppet apply --verbose --detailed-exitcodes --summarize --color=false --modulepath /etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --tags file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ip,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation -e 'noop_resource('\\''package'\\''); include ::tripleo::profile::base::pacemaker; include ::tripleo::profile::pacemaker::haproxy_bundle'",
        "Warning: tag is a metaparam; this value will inherit to all contained resources in the tripleo::firewall::rule definition",
        "Warning: ModuleLoader: module 'concat' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules\\n   (file & line not available)",
        "                    with Stdlib::Compat::Hash. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/tripleo/manifests/firewall/rule.pp\", 148]:",
        "Warning: This method is deprecated, please use match expressions with Stdlib::Compat::Ipv6 instead. They are described at https://docs.puppet.com/puppet/latest/reference/lang_data_type.html#match-expressions. at [\"/etc/puppet/modules/tripleo/manifests/pacemaker/haproxy_with_vip.pp\", 75]:",
        "Warning: Scope(Haproxy::Config[haproxy]): haproxy: The $merge_options parameter will default to true in the next major release. Please review the documentation regarding the implications.",
        "Error: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Pcmk_property[property--stonith-enabled]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20191107-10-1mteb0l failed with code: 1 -> Error: unable to get cib",
        "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Haproxy_bundle/Pacemaker::Property[haproxy-role-lab-controller01]/Pcmk_property[property-lab-controller01-haproxy-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20191107-10-lsb5qu failed with code: 1 -> Error: unable to get cib",
        "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Haproxy_bundle/Tripleo::Pacemaker::Haproxy_with_vip[haproxy_and_control_vip]/Pacemaker::Resource::Ip[control_vip]/Pcmk_resource[ip-172.16.0.250]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20191107-10-3k6ivn failed with code: 1 -> Error: unable to get cib",

[heat-admin@lab-controller01 ~]$ sudo podman ps --all | grep "Exited (6)"
32803f7b8932  172.16.0.1:8787/rhosp15-rhel8/openstack-haproxy:15.0-76                  dumb-init --singl...  3 days ago  Exited (6) 3 days ago           haproxy_init_bundle
c0808beb412c  172.16.0.1:8787/rhosp15-rhel8/openstack-redis:15.0-72                    dumb-init --singl...  3 days ago  Exited (6) 3 days ago           redis_init_bundle



The origin of the problem seems to lay on the different pacemaker packages versions we can find in the container images and the controller:

CONTAINER:

[heat-admin@lab-controller01 ~]$ sudo podman run -it --name haproxy_init_test --net host 172.16.0.1:8787/rhosp15-rhel8/openstack-haproxy:15.0-76 bash
()[root@lab-controller01 /]# rpm -qa | grep pcs
pcs-0.10.1-4.el8_0.4.x86_64
()[root@lab-controller01 /]# rpm -qa | grep pacemaker
pacemaker-cli-2.0.1-4.el8_0.4.x86_64
pacemaker-libs-2.0.1-4.el8_0.4.x86_64
pacemaker-remote-2.0.1-4.el8_0.4.x86_64
pacemaker-2.0.1-4.el8_0.4.x86_64
puppet-pacemaker-0.7.3-0.20190807230458.8b30131.el8ost.noarch
pacemaker-schemas-2.0.1-4.el8_0.4.noarch
pacemaker-cluster-libs-2.0.1-4.el8_0.4.x86_64
()[root@lab-controller01 /]# exit
exit

CONTROLLER-0
[heat-admin@lab-controller01 ~]$ sudo rpm -qa | grep pcs
pcs-0.10.2-4.el8.x86_64
[heat-admin@lab-controller01 ~]$ sudo rpm -qa | grep pacemaker
puppet-pacemaker-0.7.3-0.20190807230458.8b30131.el8ost.noarch
pacemaker-schemas-2.0.2-3.el8_1.2.noarch
pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64
pacemaker-2.0.2-3.el8_1.2.x86_64
pacemaker-libs-2.0.2-3.el8_1.2.x86_64
pacemaker-cli-2.0.2-3.el8_1.2.x86_64


However, the job is using the stage CDN and redhat.registry.io to provide with the containers. Digging a little bit further, it looks like the containers are build over RHEL 8.0 while the Controller is alreayd in RHEL8.1:

[heat-admin@lab-controller01 ~]$ sudo podman run -it --name haproxy_init_test --net host 172.16.0.1:8787/rhosp15-rhel8/openstack-haproxy:15.0-76 bash
()[root@lab-controller01 /]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.0 (Ootpa)

[heat-admin@lab-controller01 ~]$ cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.1 (Ootpa)


In order to bypass this issue we require the very same pacemaker packages in the pcmkr managed container images as we have in the host.

This bug blocked us to continue with our Upgrade testing during the EMEA Hackfest in two environments. So it's quite important to get it solved, as an upgrade using CDN + redhat's registry wouldn't be possible at the moment.

Comment 4 Michele Baldessari 2019-11-11 11:32:10 UTC

So, unfortunately, pcmk requires to have the same content across host and HA containers (in order to remove this restriction in pcmk/pcs https://bugzilla.redhat.com/show_bug.cgi?id=1603613 would have to get fixed). So right now the only way to 'fix' this is to either:
A) use a rhel 8.0 overcloud with the rhel-8.0 based containers
B) get some rhel 8.1 HA containers built
B.1) Either we have releng rebuild them based on 8.1 content or
B.2) we build them ourselves.

Now on to B.2) we could probably work on some commands to do exactly that if there is interest/need ?

Comment 5 Jose Luis Franco 2019-11-11 14:15:11 UTC

To avoid tracking this issue in multiple places, let's close this bugzilla as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1769408.

*** This bug has been marked as a duplicate of bug 1769408 ***

Note You need to log in before you can comment on or make changes to this bug.