Bug 1777318 - [RHOSP14][InstanceHA] Deployment fails - backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib(...) failed with code: 1 -> Error: unable to get cib
Summary: [RHOSP14][InstanceHA] Deployment fails - backup_cib: Running: pcs cluster cib...
Keywords:
Status: CLOSED DUPLICATE of bug 1767160
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 14.0 (Rocky)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: Sasha Smolyak
URL:
Whiteboard:
: 1777347 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-27 11:40 UTC by Rafal Szmigiel
Modified: 2023-03-24 16:12 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-27 14:48:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Rafal Szmigiel 2019-11-27 11:40:16 UTC
Description of problem:

Deployment of OSP14 with enabled InstanceHA fails with the following error:

2019-11-26 07:01:18,612 p=1150 u=mistral |  fatal: [overcloud-novacomputeiha-0]: FAILED! => {
    "failed_when_result": true,
    "outputs.stdout_lines | default([]) | union(outputs.stderr_lines | default([]))": [
        "Notice: hiera(): Cannot load backend module_data: cannot load such file -- hiera/backend/module_data_backend",
        "Notice: Scope(Class[Tripleo::Firewall::Post]): At this stage, all network traffic is blocked.",
        "Notice: Compiled catalog for overcloud-novacomputeiha-0.localdomain in environment production in 2.10 seconds",
        "Notice: /Stage[main]/Main/Package_manifest[/var/lib/tripleo/installed-packages/overcloud_ComputeInstanceHA2]/ensure: created",
        "Notice: /Stage[main]/Firewall::Linux::Redhat/Service[firewalld]/ensure: ensure changed 'running' to 'stopped'",
        "Notice: /Stage[main]/Firewall::Linux::Redhat/Service[iptables]/ensure: ensure changed 'stopped' to 'running'",
        "Notice: /Stage[main]/Firewall::Linux::Redhat/Service[ip6tables]/ensure: ensure changed 'stopped' to 'running'",
        "Notice: /Stage[main]/Tripleo::Profile::Base::Kernel/Kmod::Load[nf_conntrack_proto_sctp]/Exec[modprobe nf_conntrack_proto_sctp]/returns: executed successfully",
        "Notice: Applied catalog in 3.96 seconds",
        "Changes:",
        "            Total: 5",
        "Events:",
        "          Failure: 1",
        "          Success: 5",
        "            Total: 6",
        "Resources:",
        "           Failed: 1",
        "            Total: 158",
        "   Corrective change: 4",
        "          Changed: 5",
        "      Out of sync: 6",
        "Time:",
        "       Filebucket: 0.00",
        "   Concat fragment: 0.00",
        "      Concat file: 0.00",
        "         Schedule: 0.00",
        "           Anchor: 0.00",
        "   Package manifest: 0.00",
        "           Sysctl: 0.01",
        "   Sysctl runtime: 0.01",
        "           Augeas: 0.02",
        "             File: 0.05",
        "         Firewall: 0.05",
        "             Exec: 0.27",
        "    Pcmk property: 0.41",
        "          Package: 0.52",
        "          Service: 1.71",
        "         Last run: 1574769670",
        "   Config retrieval: 2.59",
        "            Total: 5.64",
        "Version:",
        "           Config: 1574769664",
        "           Puppet: 4.8.2",
        "Warning: Undefined variable '::deploy_config_name'; ",
        "   (file & line not available)",
        "Warning: Undefined variable 'deploy_config_name'; ",
        "Warning: Unknown variable: '::deployment_type'. at /etc/puppet/modules/tripleo/manifests/profile/base/database/mysql/client.pp:85:31",
        "Warning: This method is deprecated, please use the stdlib validate_legacy function,",
        "                    with Stdlib::Compat::Bool. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/ntp/manifests/init.pp\", 54]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/time/ntp.pp\", 34]",
        "   (at /etc/puppet/modules/stdlib/lib/puppet/functions/deprecation.rb:28:in `deprecation')",
        "                    with Stdlib::Compat::Absolute_Path. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/ntp/manifests/init.pp\", 55]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/time/ntp.pp\", 34]",
        "                    with Stdlib::Compat::String. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/ntp/manifests/init.pp\", 56]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/time/ntp.pp\", 34]",
        "                    with Stdlib::Compat::Array. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/ntp/manifests/init.pp\", 66]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/time/ntp.pp\", 34]",
        "                    with Pattern[]. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/ntp/manifests/init.pp\", 68]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/time/ntp.pp\", 34]",
        "                    with Stdlib::Compat::Numeric. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/ntp/manifests/init.pp\", 76]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/time/ntp.pp\", 34]",
        "Warning: ModuleLoader: module 'pacemaker' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules",
        "Warning: tag is a metaparam; this value will inherit to all contained resources in the tripleo::firewall::rule definition",
        "                    with Stdlib::Compat::Hash. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/tripleo/manifests/firewall/rule.pp\", 148]:",
        "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Compute_instanceha/Pacemaker::Property[compute-instanceha-role-node-property]/Pcmk_property[property-overcloud-novacomputeiha-0-compute-instanceha-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20191126-41449-14tc4hm failed with code: 1 -> Error: unable to get cib"
    ]
}

Version-Release number of selected component (if applicable):

[root@overcloud-novacomputeiha-0 heat-admin]# rpm -qa | grep -E 'openstack|rhosp|pacemaker|tripleo'
puppet-openstacklib-13.3.2-0.20190420090713.05a84dd.el7ost.noarch
python-openstackclient-lang-3.16.2-3.el7ost.noarch
pacemaker-cli-1.1.20-5.el7_7.1.x86_64
python2-openstacksdk-0.17.2-0.20180809182657.3ad9dab.el7ost.noarch
python2-openstackclient-3.16.2-3.el7ost.noarch
pacemaker-1.1.20-5.el7_7.1.x86_64
ansible-pacemaker-1.0.4-0.20180827141254.0e4d7c0.el7ost.noarch
openstack-selinux-0.8.18-1.el7ost.noarch
puppet-pacemaker-0.7.2-0.20181008172522.9a4bc2d.el7ost.noarch
openstack-heat-agents-1.7.1-0.20190420000616.41c7faf.el7ost.noarch
pacemaker-libs-1.1.20-5.el7_7.1.x86_64
pacemaker-remote-1.1.20-5.el7_7.1.x86_64
puppet-tripleo-9.4.1-0.20190508182410.el7ost.noarch
rhosp-openvswitch-2.11-0.6.el7ost.noarch
rhosp-release-14.0.4-1.el7ost.noarch
puppet-openstack_extras-13.3.2-0.20190420072608.d650bd8.el7ost.noarch
pacemaker-cluster-libs-1.1.20-5.el7_7.1.x86_64


(undercloud) [stack@director ~]$ rpm -qa | grep -i tripleo
openstack-tripleo-heat-templates-9.3.1-0.20190513171768.el7ost.noarch
ansible-role-tripleo-modify-image-1.0.1-0.20190419231031.f1dfdc6.el7ost.noarch
openstack-tripleo-common-containers-9.5.0-8.el7ost.noarch
ansible-tripleo-ipsec-9.1.0-2.el7ost.noarch
python2-tripleo-common-9.5.0-8.el7ost.noarch
python-tripleoclient-heat-installer-10.6.2-0.20190425150607.el7ost.noarch
openstack-tripleo-validations-9.3.2-0.20190420045628.361061f.el7ost.noarch
openstack-tripleo-common-9.5.0-8.el7ost.noarch
python-tripleoclient-10.6.2-0.20190425150607.el7ost.noarch
openstack-tripleo-puppet-elements-9.0.1-5.el7ost.noarch
openstack-tripleo-image-elements-9.0.1-0.20181102144447.9f1c800.el7ost.noarch
puppet-tripleo-9.4.1-0.20190508182410.el7ost.noarch

How reproducible:

Always

Steps to Reproduce:
1. Deploy OSP14 with InstanceHA enabled following official docs
2. Observe deployment fails
3.

Actual results:

Deployment fails.


Expected results:

Deployment finishes successfully. 

Additional info:

Comment 1 Luca Miccini 2019-11-27 13:53:04 UTC
Nov 27 10:35:32 overcloud-controller-0 ansible-systemd[38383]: Invoked with no_block=False force=None name=firewalld enabled=True daemon_reload=False state=started user=False masked=None
Nov 27 10:35:32 overcloud-controller-0 systemd[1]: Reloading.
Nov 27 10:35:32 overcloud-controller-0 systemd[1]: Stopping IPv6 firewall with ip6tables...
Nov 27 10:35:32 overcloud-controller-0 systemd[1]: Starting firewalld - dynamic firewall daemon...
Nov 27 10:35:32 overcloud-controller-0 ip6tables.init[38405]: ip6tables: Setting chains to policy ACCEPT: filter [  OK  ]
Nov 27 10:35:32 overcloud-controller-0 ip6tables.init[38405]: ip6tables: Flushing firewall rules: [  OK  ]
Nov 27 10:35:32 overcloud-controller-0 systemd[1]: Stopped IPv6 firewall with ip6tables.
Nov 27 10:35:32 overcloud-controller-0 systemd[1]: Stopping IPv4 firewall with iptables...
Nov 27 10:35:32 overcloud-controller-0 iptables.init[38423]: iptables: Setting chains to policy ACCEPT: filter [  OK  ]
Nov 27 10:35:32 overcloud-controller-0 iptables.init[38423]: iptables: Flushing firewall rules: [  OK  ]
Nov 27 10:35:32 overcloud-controller-0 systemd[1]: Stopped IPv4 firewall with iptables.
Nov 27 10:35:33 overcloud-controller-0 systemd[1]: Started firewalld - dynamic firewall daemon.
Nov 27 10:35:33 overcloud-controller-0 sudo[38378]: pam_unix(sudo:session): session closed for user root
Nov 27 10:35:33 overcloud-controller-0 kernel: Ebtables v2.0 registered
Nov 27 10:35:33 overcloud-controller-0 kernel: Netfilter messages via NETLINK v0.30.
Nov 27 10:35:33 overcloud-controller-0 kernel: ip_set: protocol 7
Nov 27 10:35:33 overcloud-controller-0 sudo[38503]: tripleo-admin : TTY=unknown ; PWD=/home/tripleo-admin ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-epwijyfyortpwxkxzyksvtqxnolxumms; /usr/bin/python
Nov 27 10:35:33 overcloud-controller-0 sudo[38503]: pam_unix(sudo:session): session opened for user root by (uid=0)
Nov 27 10:35:34 overcloud-controller-0 ansible-firewalld[38517]: Invoked with service=ceph-mon zone=public masquerade=None immediate=True source=172.16.1.0/24 state=enabled permanent=True timeout=0 interface=None offline=None port=None rich_rule=None
Nov 27 10:35:34 overcloud-controller-0 sudo[38503]: pam_unix(sudo:session): session closed for user root
Nov 27 10:35:34 overcloud-controller-0 sudo[38572]: tripleo-admin : TTY=unknown ; PWD=/home/tripleo-admin ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-ioghdksyqqggwciqeqibusmixazlscav; /usr/bin/python
Nov 27 10:35:34 overcloud-controller-0 sudo[38572]: pam_unix(sudo:session): session opened for user root by (uid=0)
Nov 27 10:35:34 overcloud-controller-0 ansible-firewalld[38576]: Invoked with service=ceph zone=public masquerade=None immediate=True source=172.16.1.0/24 state=enabled permanent=True timeout=0 interface=None offline=None port=None rich_rule=None
Nov 27 10:35:34 overcloud-controller-0 sudo[38572]: pam_unix(sudo:session): session closed for user root
Nov 27 10:35:35 overcloud-controller-0 sudo[38590]: tripleo-admin : TTY=unknown ; PWD=/home/tripleo-admin ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-pduyziuhzinpowluoycnualeqyouxcwt; /usr/bin/python
Nov 27 10:35:35 overcloud-controller-0 sudo[38590]: pam_unix(sudo:session): session opened for user root by (uid=0)
Nov 27 10:35:35 overcloud-controller-0 ansible-firewalld[38594]: Invoked with zone=public service=ceph masquerade=None immediate=True source=172.16.1.0/24 state=enabled permanent=True timeout=0 interface=None offline=None port=None rich_rule=None
Nov 27 10:35:35 overcloud-controller-0 sudo[38590]: pam_unix(sudo:session): session closed for user root
Nov 27 10:35:41 overcloud-controller-0 sudo[38610]: tripleo-admin : TTY=unknown ; PWD=/home/tripleo-admin ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-dxgtxmtnahrrataytigwpwxmzhwltmbg; /usr/bin/python
Nov 27 10:35:41 overcloud-controller-0 sudo[38610]: pam_unix(sudo:session): session opened for user root by (uid=0)
Nov 27 10:35:41 overcloud-controller-0 ansible-systemd[38614]: Invoked with no_block=False force=None name=firewalld enabled=True daemon_reload=False state=restarted user=False masked=None
Nov 27 10:35:41 overcloud-controller-0 systemd[1]: Stopping firewalld - dynamic firewall daemon...
Nov 27 10:35:42 overcloud-controller-0 kernel: Ebtables v2.0 unregistered
Nov 27 10:35:42 overcloud-controller-0 systemd[1]: Stopped firewalld - dynamic firewall daemon.
Nov 27 10:35:42 overcloud-controller-0 systemd[1]: Starting firewalld - dynamic firewall daemon...
Nov 27 10:35:42 overcloud-controller-0 systemd[1]: Started firewalld - dynamic firewall daemon.
Nov 27 10:35:42 overcloud-controller-0 kernel: ip_tables: (C) 2000-2006 Netfilter Core Team
Nov 27 10:35:42 overcloud-controller-0 sudo[38610]: pam_unix(sudo:session): session closed for user root
Nov 27 10:35:42 overcloud-controller-0 kernel: ip6_tables: (C) 2000-2006 Netfilter Core Team
Nov 27 10:35:42 overcloud-controller-0 kernel: Ebtables v2.0 registered
Nov 27 10:35:44 overcloud-controller-0 sudo[38784]: tripleo-admin : TTY=unknown ; PWD=/home/tripleo-admin ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-gerhftfmiwatqccnakopeaoeohoeuups; /usr/bin/python
Nov 27 10:35:44 overcloud-controller-0 sudo[38784]: pam_unix(sudo:session): session opened for user root by (uid=0)
Nov 27 10:35:44 overcloud-controller-0 ansible-command[38789]: Invoked with warn=True executable=None _uses_shell=False _raw_params=docker ps -q --filter='name=ceph-mon-overcloud-controller-0' removes=None argv=None creates=None chdir=None stdin=None
Nov 27 10:35:44 overcloud-controller-0 sudo[38784]: pam_unix(sudo:session): session closed for user root
Nov 27 10:35:46 overcloud-controller-0 corosync[25607]:  [TOTEM ] A processor failed, forming new configuration.
Nov 27 10:35:46 overcloud-controller-0 sudo[38814]: tripleo-admin : TTY=unknown ; PWD=/home/tripleo-admin ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-qitrlaolsjxdmlisuferqeyyvajrpxhp; /usr/bin/python
Nov 27 10:35:46 overcloud-controller-0 sudo[38814]: pam_unix(sudo:session): session opened for user root by (uid=0)
Nov 27 10:35:46 overcloud-controller-0 ansible-command[38819]: Invoked with warn=True executable=None _uses_shell=False _raw_params=docker ps -q --filter='name=ceph-mgr-overcloud-controller-0' removes=None argv=None creates=None chdir=None stdin=None
Nov 27 10:35:46 overcloud-controller-0 sudo[38814]: pam_unix(sudo:session): session closed for user root
Nov 27 10:35:58 overcloud-controller-0 corosync[25607]:  [TOTEM ] A new membership (172.16.2.6:16) was formed. Members left: 2 3
Nov 27 10:35:58 overcloud-controller-0 corosync[25607]:  [TOTEM ] Failed to receive the leave message. failed: 2 3
Nov 27 10:35:58 overcloud-controller-0 stonith-ng[25632]:   notice: Node overcloud-controller-1 state is now lost
Nov 27 10:35:58 overcloud-controller-0 corosync[25607]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov 27 10:35:58 overcloud-controller-0 corosync[25607]:  [QUORUM] Members[1]: 1
Nov 27 10:35:58 overcloud-controller-0 stonith-ng[25632]:   notice: Purged 1 peer with id=2 and/or uname=overcloud-controller-1 from the membership cache
Nov 27 10:35:58 overcloud-controller-0 corosync[25607]:  [MAIN  ] Completed service synchronization, ready to provide service.
Nov 27 10:35:58 overcloud-controller-0 stonith-ng[25632]:   notice: Node overcloud-controller-2 state is now lost
Nov 27 10:35:58 overcloud-controller-0 stonith-ng[25632]:   notice: Purged 1 peer with id=3 and/or uname=overcloud-controller-2 from the membership cache
Nov 27 10:35:58 overcloud-controller-0 pacemakerd[25630]:  warning: Quorum lost
Nov 27 10:35:58 overcloud-controller-0 crmd[25636]:  warning: Quorum lost
Nov 27 10:35:58 overcloud-controller-0 pacemakerd[25630]:   notice: Node overcloud-controller-1 state is now lost
Nov 27 10:35:58 overcloud-controller-0 crmd[25636]:   notice: Node overcloud-controller-2 state is now lost
Nov 27 10:35:58 overcloud-controller-0 pacemakerd[25630]:   notice: Node overcloud-controller-2 state is now lost
Nov 27 10:35:58 overcloud-controller-0 crmd[25636]:   notice: Node overcloud-controller-1 state is now lost
Nov 27 10:35:58 overcloud-controller-0 crmd[25636]:  warning: Our DC node (overcloud-controller-1) left the cluster
Nov 27 10:35:58 overcloud-controller-0 cib[25631]:   notice: Node overcloud-controller-1 state is now lost
Nov 27 10:35:58 overcloud-controller-0 crmd[25636]:   notice: State transition S_NOT_DC -> S_ELECTION
Nov 27 10:35:58 overcloud-controller-0 cib[25631]:   notice: Purged 1 peer with id=2 and/or uname=overcloud-controller-1 from the membership cache
Nov 27 10:35:58 overcloud-controller-0 cib[25631]:   notice: Node overcloud-controller-2 state is now lost
Nov 27 10:35:58 overcloud-controller-0 cib[25631]:   notice: Purged 1 peer with id=3 and/or uname=overcloud-controller-2 from the membership cache
Nov 27 10:35:58 overcloud-controller-0 crmd[25636]:   notice: State transition S_ELECTION -> S_INTEGRATION


controller-0 is isolated from ~10:35:40

looking at ceph-ansible logs:

2019-11-27 05:35:30,207 p=96066 u=mistral |  TASK [ceph-infra : include_tasks configure_firewall.yml] ***********************
2019-11-27 05:35:31,048 p=96066 u=mistral |  TASK [ceph-infra : check firewalld installation on redhat or suse] *************
2019-11-27 05:35:31,869 p=96066 u=mistral |  TASK [ceph-infra : start firewalld] ********************************************
2019-11-27 05:35:33,303 p=96066 u=mistral |  TASK [ceph-infra : open monitor and manager ports] *****************************
2019-11-27 05:35:35,050 p=96066 u=mistral |  TASK [ceph-infra : open manager ports] *****************************************
2019-11-27 05:35:35,855 p=96066 u=mistral |  TASK [ceph-infra : open osd ports] *********************************************
2019-11-27 05:35:37,547 p=96066 u=mistral |  TASK [ceph-infra : open rgw ports] *********************************************
2019-11-27 05:35:38,019 p=96066 u=mistral |  TASK [ceph-infra : open mds ports] *********************************************
2019-11-27 05:35:38,509 p=96066 u=mistral |  TASK [ceph-infra : open nfs ports] *********************************************
2019-11-27 05:35:39,005 p=96066 u=mistral |  TASK [ceph-infra : open nfs ports (portmapper)] ********************************
2019-11-27 05:35:39,516 p=96066 u=mistral |  TASK [ceph-infra : open rbdmirror ports] ***************************************
2019-11-27 05:35:40,061 p=96066 u=mistral |  TASK [ceph-infra : open iscsi target ports] ************************************
2019-11-27 05:35:40,514 p=96066 u=mistral |  TASK [ceph-infra : open iscsi api ports] ***************************************
2019-11-27 05:35:42,712 p=96066 u=mistral |  TASK [ceph-infra : include_tasks setup_ntp.yml] ********************************

timestamps seem to match

Comment 2 Luca Miccini 2019-11-27 14:00:45 UTC
systemctl restart firewalld triggers the issue:

Node controller-1: UNCLEAN (offline)
Node controller-2: UNCLEAN (offline)
Online: [ controller-0 ]
RemoteOnline: [ compute-1 ]
RemoteOFFLINE: [ compute-0 ]
GuestOnline: [ galera-bundle-1@controller-0 rabbitmq-bundle-1@controller-0 redis-bundle-1@controller-0 ]

Comment 4 Luca Miccini 2019-11-27 14:27:31 UTC
*** Bug 1777347 has been marked as a duplicate of this bug. ***

Comment 6 Luca Miccini 2019-11-27 14:48:26 UTC
we confirmed that patch was actually missing, I am marking this as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1767160.

Thanks
Luca

*** This bug has been marked as a duplicate of bug 1767160 ***


Note You need to log in before you can comment on or make changes to this bug.