Bug 1598038 - [RHOSP13] [Instance HA] Instance HA deployment on OSP13 via director fails
Summary: [RHOSP13] [Instance HA] Instance HA deployment on OSP13 via director fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: z2
: 13.0 (Queens)
Assignee: Michele Baldessari
QA Contact: pkomarov
URL:
Whiteboard:
: 1579469 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-04 08:15 UTC by MD Sufiyan
Modified: 2022-08-09 09:27 UTC (History)
16 users (show)

Fixed In Version: puppet-tripleo-8.3.4-2.el7ost puppet-pacemaker-0.7.2-0.20180423212250.el7ost
Doc Type: Bug Fix
Doc Text:
Instance HA deployments failed due to a race condition, generating an error: Error: unable to get cib. The race was a result of pacemaker properties being set on the compute nodes before the pacemaker cluster was fully up and hence failing with the 'unable to get cib' error. This fix results in no errors in the deployment when using IHA.
Clone Of:
Environment:
Last Closed: 2018-12-03 15:42:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 569565 0 'None' MERGED Fix a small race window when setting up pcmk remotes 2021-02-18 18:56:24 UTC
OpenStack gerrit 581017 0 'None' MERGED Make sure remotes are fully up before proceeding 2021-02-18 18:56:25 UTC
Red Hat Issue Tracker OSP-9261 0 None None None 2022-08-09 09:27:29 UTC
Red Hat Product Errata RHBA-2018:2574 0 None None None 2018-08-29 16:38:52 UTC

Description MD Sufiyan 2018-07-04 08:15:48 UTC
Description of problem:
Instance HA deployment on OSP13 via director fails

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. deploy OSP13 (3ctrl+2comp) via IR
2. delete the overcloud
3. re-deploy the overcloud with compute-instanceha.yaml, fencing.yaml & roles_data.yaml ( for Controller and IHA role)

Actual results:

Failed with below error:-

~~~
 Stack overcloud CREATE_FAILED

overcloud.AllNodesDeploySteps.ComputeInstanceHADeployment_Step2.0:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: 3b9d347b-cf80-4dd6-8386-75abfeb065bf
  status: CREATE_FAILED
  status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
            "                    with Stdlib::Compat::Hash. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/ssh/manifests/server.pp\", 12]:[\"/var/lib/tripleo-config/puppet_step_config.pp\", 41]",
            "Error: unable to get cib",
            "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Compute_instanceha/Pacemaker::Property[compute-instanceha-role-node-property]/Pcmk_property[property-overcloud-novacomputeiha-0-compute-instanceha-role]: Could not evaluate: backup_cib: Running: /usr/sbin/pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20180703-18740-1uwe4us failed with code: 1 -> "
        ]
    }
        to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/1b9b1b44-509a-45b0-b2f1-663a8f41bc7d_playbook.retry

    PLAY RECAP *********************************************************************
    localhost                  : ok=4    changed=1    unreachable=0    failed=1

    (truncated, view all with --long)
  deploy_stderr: |

~~~

Expected results:

Deployment should pass

Additional info:

1) Deployment Script

~~~
#!/bin/bash

nohup openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
-r /home/stack/virt/roles_data.yaml -e /home/stack/virt/config_lvm.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/nodes_data.yaml \
-e /home/stack/virt/docker-images.yaml -e /home/stack/virt/fencing.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/compute-instanceha.yaml \
--log-file overcloud_deployment_37.log &
~~~

2) openstack stack failures list overcloud --long >> failure.logs

http://pastebin.test.redhat.com/612917

~~~
[stack@undercloud-0 ~]$ tail -20 failure.logs 
            "   (at /etc/puppet/modules/stdlib/lib/puppet/functions/deprecation.rb:28:in `deprecation')", 
            "Warning: This method is deprecated, please use the stdlib validate_legacy function,", 
            "                    with Stdlib::Compat::Bool. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/ntp/manifests/init.pp\", 54]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/time/ntp.pp\", 29]", 
            "                    with Stdlib::Compat::Absolute_Path. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/ntp/manifests/init.pp\", 55]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/time/ntp.pp\", 29]", 
            "                    with Stdlib::Compat::String. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/ntp/manifests/init.pp\", 56]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/time/ntp.pp\", 29]", 
            "                    with Stdlib::Compat::Array. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/ntp/manifests/init.pp\", 66]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/time/ntp.pp\", 29]", 
            "                    with Pattern[]. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/ntp/manifests/init.pp\", 68]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/time/ntp.pp\", 29]", 
            "                    with Stdlib::Compat::Numeric. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/ntp/manifests/init.pp\", 76]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/time/ntp.pp\", 29]", 
            "                    with Stdlib::Compat::Hash. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/ssh/manifests/server.pp\", 12]:[\"/var/lib/tripleo-config/puppet_step_config.pp\", 41]", 
            "Error: unable to get cib", 
            "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Compute_instanceha/Pacemaker::Property[compute-instanceha-role-node-property]/Pcmk_property[property-overcloud-novacomputeiha-0-compute-instanceha-role]: Could not evaluate: backup_cib: Running: /usr/sbin/pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20180703-18740-1uwe4us failed with code: 1 -> "
        ]
    }
    	to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/1b9b1b44-509a-45b0-b2f1-663a8f41bc7d_playbook.retry
    
    PLAY RECAP *********************************************************************
    localhost                  : ok=4    changed=1    unreachable=0    failed=1   
    
  deploy_stderr: |

[stack@undercloud-0 ~]$ 
~~~

Comment 9 Michele Baldessari 2018-07-20 20:24:30 UTC
*** Bug 1579469 has been marked as a duplicate of this bug. ***

Comment 20 pkomarov 2018-08-01 12:21:37 UTC
Verified , 
Tested : 
https://url.corp.redhat.com/05c732c

Comment 21 Joanne O'Flynn 2018-08-15 08:06:04 UTC
This bug is marked for inclusion in the errata but does not currently contain draft documentation text. To ensure the timely release of this advisory please provide draft documentation text for this bug as soon as possible.

If you do not think this bug requires errata documentation, set the requires_doc_text flag to "-".


To add draft documentation text:

* Select the documentation type from the "Doc Type" drop down field.

* A template will be provided in the "Doc Text" field based on the "Doc Type" value selected. Enter draft text in the "Doc Text" field.

Comment 23 errata-xmlrpc 2018-08-29 16:37:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2574

Comment 26 Joachim von Thadden 2018-12-03 14:41:00 UTC
There are 2 persons having this issue again - so opening the bug again.

Comment 27 Joachim von Thadden 2018-12-03 15:28:07 UTC
it seems that just re-deploying over the failed environment without changing anything heals the stack

Comment 28 Michele Baldessari 2018-12-03 15:42:01 UTC
Let's track this race condition (likely introduced due to another fix related to reconnect_interval) over here https://bugzilla.redhat.com/show_bug.cgi?id=1624441

This BZ has sailed.


Note You need to log in before you can comment on or make changes to this bug.