Bug 1420432

Summary:

osp-d 11 - ipv6/vlan deployment fails on "unable to get cib" - missing corosync.conf

Product:

Red Hat OpenStack

Reporter:

Pavel Sedlák <psedlak>

Component:

openstack-tripleo-image-elements

Assignee:

Michele Baldessari <michele>

Status:

CLOSED ERRATA

QA Contact:

Amit Ugol <augol>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

11.0 (Ocata)

CC:

aschultz, dbecker, jschluet, mburns, mcornea, michele, mkrcmari, morazi, psedlak, rhel-osp-director-maint, royoung

Target Milestone:

Keywords:

Automation

Target Release:

11.0 (Ocata)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

openstack-tripleo-image-elements-6.0.0-0.20170131024050.8597926

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-05-17 19:57:45 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
/home/stack/virt/debug.yaml	none
/home/stack/virt/hostnames.yml	none
/home/stack/virt/network/network-environment-v6.yaml	none

Description Pavel Sedlák 2017-02-08 15:52:36 UTC

Description of problem:
OSP-D overcloud deploy fails on:

> Error: unable to get cib
> Error: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Pcmk_property[property--stonith-enabled]: Could not evaluate: backup_cib: Running: /usr/sbin/pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20170208-14072-1iiyj1e failed with code: 1 -> 

Seems that unlike with few other configurations (vxlan, ipv4, ...) it happend so far only in ipv6-vlan setup (3controller + 1 or 2 compute, virthost).


When pcs cluster cib is tried on controller-0, it cannot fetch data, seems pacemaker is not running correctly, even /etc/corosync/corosync.conf file is missing completely.


Version-Release number of selected component (if applicable):

> openstack-tripleo-0.0.8-0.2.4de13b3git.el7ost.noarch
> openstack-tripleo-puppet-elements-6.0.0-0.20170126053436.688584c.el7ost.noarch
> puppet-tripleo-6.1.0-0.20170127040716.d427c2a.el7ost.noarch
> python-tripleoclient-6.0.1-0.20170127055753.8ea289c.el7ost.noarch
> rhosp-director-images-11.0-20170201.1.el7ost.noarch
> openstack-tripleo-ui-2.0.1-0.20170126144317.f3bd97e.el7ost.noarch
> openstack-tripleo-validations-5.3.1-0.20170125194508.6b928f1.el7ost.noarch
> openstack-tripleo-heat-templates-6.0.0-0.20170127041112.ce54697.el7ost.1.noarch
> openstack-tripleo-image-elements-6.0.0-0.20170126135810.00b9869.el7ost.noarch
> rhosp-director-images-ipa-11.0-20170201.1.el7ost.noarch
> openstack-tripleo-common-5.7.1-0.20170126235054.c75d3c6.el7ost.noarch

How reproducible:
Always, executed like:
> openstack overcloud deploy --debug \
> --templates \
> --libvirt-type kvm \
> --ntp-server ntp.example.org \
> --control-scale 3 \
> --control-flavor controller \--compute-scale 1 \
> --compute-flavor compute \
> --environment-file /usr/share/openstack-tripleo-heat-templates/environments/services/sahara.yaml \
> --environment-file /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup.yaml \
> -e /home/stack/virt/network/network-environment-v6.yaml \
> -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation-v6.yaml \
> -e /home/stack/virt/hostnames.yml \
> -e /home/stack/virt/debug.yaml \
> --log-file overcloud_deployment_37.log

mkrcmari is trying to get more detailed info (with ConfigDebug enabled) to pin down more specific cause of the failure.

Comment 1 Pavel Sedlák 2017-02-08 15:53:31 UTC

Created attachment 1248632 [details]
/home/stack/virt/debug.yaml

Comment 2 Pavel Sedlák 2017-02-08 15:53:57 UTC

Created attachment 1248633 [details]
/home/stack/virt/hostnames.yml

Comment 3 Pavel Sedlák 2017-02-08 15:55:10 UTC

Created attachment 1248635 [details]
/home/stack/virt/network/network-environment-v6.yaml

original ipv6/ipv4 ranges replaced with dummy example

Comment 5 Marian Krcmarik 2017-02-08 20:04:10 UTC

It seems that It's caused by failed command setting cluster authentication: "/sbin/pcs cluster auth controller-0 controller-1 controller-2 -u hacluster -p ***** --force", because the pacemaker communication is being blocked by iptables at the time of command execution, the rules are added for ipv6 later after the command is being executed. It's reproducible only on ipv6 based deployments because ipv4 iptables rules are empty at the command execution:

[heat-admin@controller-0 ~]$ sudo iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination 

becuase:
[heat-admin@controller-0 ~]$ sudo cat /etc/sysconfig/iptables
# empty ruleset created by tripleo-image-elements


But ipv6 iptables includes some iptable rules at the time of command execution:
[heat-admin@controller-0 ~]$ sudo ip6tables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     all      anywhere             anywhere             state RELATED,ESTABLISHED
ACCEPT     ipv6-icmp    anywhere             anywhere            
ACCEPT     all      anywhere             anywhere            
ACCEPT     tcp      anywhere             anywhere             state NEW tcp dpt:ssh
ACCEPT     udp      anywhere             fe80::/64            udp dpt:dhcpv6-client state NEW
REJECT     all      anywhere             anywhere             reject-with icmp6-adm-prohibited

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         
REJECT     all      anywhere             anywhere             reject-with icmp6-adm-prohibited

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

And

[heat-admin@controller-0 ~]$ sudo cat /etc/sysconfig/ip6tables
# sample configuration for ip6tables service
# you can edit this manually or use system-config-firewall
# please do not ask us to add additional ports/services to this default configuration
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p ipv6-icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT
-A INPUT -d fe80::/64 -p udp -m udp --dport 546 -m state --state NEW -j ACCEPT
-A INPUT -j REJECT --reject-with icmp6-adm-prohibited
-A FORWARD -j REJECT --reject-with icmp6-adm-prohibited
COMMIT

Comment 7 Michele Baldessari 2017-02-09 12:38:30 UTC

So from an initial look this is the downstream ipv6 manifestation of this bug https://bugs.launchpad.net/tripleo/+bug/1657108/. The super short version is that if we start off an image that has prepopulated /etc/sysconfig/ip[6]tables rules (and the iptables package does ship such rules that only allow ssh and icmp), pcs will be executed when the firewall modules has not yet kicked in to open up the pacemaker/pcs ports and so it will fail.

To verify/disprove this theory can you try the following on the undercloud:
echo '' > /tmp/iptables
echo '' > /tmp/ip6tables
virt-copy-in -a overcloud-full.qcow2 /tmp/iptables /etc/sysconfig/
virt-copy-in -a overcloud-full.qcow2 /tmp/ip6tables /etc/sysconfig/
openstack overcloud image upload --image-path . --update-existing

And then try and redeploy? Note that we already have fixes in order to empty these stock rules from the image building process. I assume that they have not yet hit downstream, because if that were the case we would not see the entries in ip[6]tables at comment 5.

Comment 8 Marian Krcmarik 2017-02-09 14:30:58 UTC

(In reply to Michele Baldessari from comment #7)
> And then try and redeploy? Note that we already have fixes in order to empty
> these stock rules from the image building process. I assume that they have
> not yet hit downstream, because if that were the case we would not see the
> entries in ip[6]tables at comment 5.

I am confirming Michele's assumption - The deployment was successful after placing empty iptables rules into overcloud image and relabeling selinux.

Comment 9 Michele Baldessari 2017-02-14 12:53:47 UTC

Mike,

any idea when we will build images that have the following t-i-e patch?
https://review.openstack.org/#/c/426144/

thanks,
Michele

Comment 13 errata-xmlrpc 2017-05-17 19:57:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245