1420432 – osp-d 11 - ipv6/vlan deployment fails on "unable to get cib" - missing corosync.conf

Bug 1420432 - osp-d 11 - ipv6/vlan deployment fails on "unable to get cib" - missing corosync.conf

Summary: osp-d 11 - ipv6/vlan deployment fails on "unable to get cib" - missing corosy...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-image-elements
Sub Component:
Version:	11.0 (Ocata)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	rc
Target Release:	11.0 (Ocata)
Assignee:	Michele Baldessari
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-02-08 15:52 UTC by Pavel Sedlák
Modified:	2017-05-17 19:57 UTC (History)
CC List:	11 users (show)
Fixed In Version:	openstack-tripleo-image-elements-6.0.0-0.20170131024050.8597926
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-05-17 19:57:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
/home/stack/virt/debug.yaml (39 bytes, text/x-vhdl) 2017-02-08 15:53 UTC, Pavel Sedlák	no flags	Details
/home/stack/virt/hostnames.yml (214 bytes, text/plain) 2017-02-08 15:53 UTC, Pavel Sedlák	no flags	Details
/home/stack/virt/network/network-environment-v6.yaml (1.99 KB, text/x-vhdl) 2017-02-08 15:55 UTC, Pavel Sedlák	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Launchpad	1657108	0	None	None	None	2017-02-09 14:30:58 UTC
Red Hat Product Errata	RHEA-2017:1245	0	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 11.0 Bug Fix and Enhancement Advisory	2017-05-17 23:01:50 UTC

Description Pavel Sedlák 2017-02-08 15:52:36 UTC

Description of problem:
OSP-D overcloud deploy fails on:

> Error: unable to get cib
> Error: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Pcmk_property[property--stonith-enabled]: Could not evaluate: backup_cib: Running: /usr/sbin/pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20170208-14072-1iiyj1e failed with code: 1 -> 

Seems that unlike with few other configurations (vxlan, ipv4, ...) it happend so far only in ipv6-vlan setup (3controller + 1 or 2 compute, virthost).


When pcs cluster cib is tried on controller-0, it cannot fetch data, seems pacemaker is not running correctly, even /etc/corosync/corosync.conf file is missing completely.


Version-Release number of selected component (if applicable):

> openstack-tripleo-0.0.8-0.2.4de13b3git.el7ost.noarch
> openstack-tripleo-puppet-elements-6.0.0-0.20170126053436.688584c.el7ost.noarch
> puppet-tripleo-6.1.0-0.20170127040716.d427c2a.el7ost.noarch
> python-tripleoclient-6.0.1-0.20170127055753.8ea289c.el7ost.noarch
> rhosp-director-images-11.0-20170201.1.el7ost.noarch
> openstack-tripleo-ui-2.0.1-0.20170126144317.f3bd97e.el7ost.noarch
> openstack-tripleo-validations-5.3.1-0.20170125194508.6b928f1.el7ost.noarch
> openstack-tripleo-heat-templates-6.0.0-0.20170127041112.ce54697.el7ost.1.noarch
> openstack-tripleo-image-elements-6.0.0-0.20170126135810.00b9869.el7ost.noarch
> rhosp-director-images-ipa-11.0-20170201.1.el7ost.noarch
> openstack-tripleo-common-5.7.1-0.20170126235054.c75d3c6.el7ost.noarch

How reproducible:
Always, executed like:
> openstack overcloud deploy --debug \
> --templates \
> --libvirt-type kvm \
> --ntp-server ntp.example.org \
> --control-scale 3 \
> --control-flavor controller \--compute-scale 1 \
> --compute-flavor compute \
> --environment-file /usr/share/openstack-tripleo-heat-templates/environments/services/sahara.yaml \
> --environment-file /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup.yaml \
> -e /home/stack/virt/network/network-environment-v6.yaml \
> -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation-v6.yaml \
> -e /home/stack/virt/hostnames.yml \
> -e /home/stack/virt/debug.yaml \
> --log-file overcloud_deployment_37.log

mkrcmari is trying to get more detailed info (with ConfigDebug enabled) to pin down more specific cause of the failure.

Comment 1 Pavel Sedlák 2017-02-08 15:53:31 UTC

Created attachment 1248632 [details]
/home/stack/virt/debug.yaml

Comment 2 Pavel Sedlák 2017-02-08 15:53:57 UTC

Created attachment 1248633 [details]
/home/stack/virt/hostnames.yml

Comment 3 Pavel Sedlák 2017-02-08 15:55:10 UTC

Created attachment 1248635 [details]
/home/stack/virt/network/network-environment-v6.yaml

original ipv6/ipv4 ranges replaced with dummy example

Comment 5 Marian Krcmarik 2017-02-08 20:04:10 UTC

It seems that It's caused by failed command setting cluster authentication: "/sbin/pcs cluster auth controller-0 controller-1 controller-2 -u hacluster -p ***** --force", because the pacemaker communication is being blocked by iptables at the time of command execution, the rules are added for ipv6 later after the command is being executed. It's reproducible only on ipv6 based deployments because ipv4 iptables rules are empty at the command execution:

[heat-admin@controller-0 ~]$ sudo iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination 

becuase:
[heat-admin@controller-0 ~]$ sudo cat /etc/sysconfig/iptables
# empty ruleset created by tripleo-image-elements


But ipv6 iptables includes some iptable rules at the time of command execution:
[heat-admin@controller-0 ~]$ sudo ip6tables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     all      anywhere             anywhere             state RELATED,ESTABLISHED
ACCEPT     ipv6-icmp    anywhere             anywhere            
ACCEPT     all      anywhere             anywhere            
ACCEPT     tcp      anywhere             anywhere             state NEW tcp dpt:ssh
ACCEPT     udp      anywhere             fe80::/64            udp dpt:dhcpv6-client state NEW
REJECT     all      anywhere             anywhere             reject-with icmp6-adm-prohibited

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         
REJECT     all      anywhere             anywhere             reject-with icmp6-adm-prohibited

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

And

[heat-admin@controller-0 ~]$ sudo cat /etc/sysconfig/ip6tables
# sample configuration for ip6tables service
# you can edit this manually or use system-config-firewall
# please do not ask us to add additional ports/services to this default configuration
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p ipv6-icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT
-A INPUT -d fe80::/64 -p udp -m udp --dport 546 -m state --state NEW -j ACCEPT
-A INPUT -j REJECT --reject-with icmp6-adm-prohibited
-A FORWARD -j REJECT --reject-with icmp6-adm-prohibited
COMMIT

Comment 7 Michele Baldessari 2017-02-09 12:38:30 UTC

So from an initial look this is the downstream ipv6 manifestation of this bug https://bugs.launchpad.net/tripleo/+bug/1657108/. The super short version is that if we start off an image that has prepopulated /etc/sysconfig/ip[6]tables rules (and the iptables package does ship such rules that only allow ssh and icmp), pcs will be executed when the firewall modules has not yet kicked in to open up the pacemaker/pcs ports and so it will fail.

To verify/disprove this theory can you try the following on the undercloud:
echo '' > /tmp/iptables
echo '' > /tmp/ip6tables
virt-copy-in -a overcloud-full.qcow2 /tmp/iptables /etc/sysconfig/
virt-copy-in -a overcloud-full.qcow2 /tmp/ip6tables /etc/sysconfig/
openstack overcloud image upload --image-path . --update-existing

And then try and redeploy? Note that we already have fixes in order to empty these stock rules from the image building process. I assume that they have not yet hit downstream, because if that were the case we would not see the entries in ip[6]tables at comment 5.

Comment 8 Marian Krcmarik 2017-02-09 14:30:58 UTC

(In reply to Michele Baldessari from comment #7)
> And then try and redeploy? Note that we already have fixes in order to empty
> these stock rules from the image building process. I assume that they have
> not yet hit downstream, because if that were the case we would not see the
> entries in ip[6]tables at comment 5.

I am confirming Michele's assumption - The deployment was successful after placing empty iptables rules into overcloud image and relabeling selinux.

Comment 9 Michele Baldessari 2017-02-14 12:53:47 UTC

Mike,

any idea when we will build images that have the following t-i-e patch?
https://review.openstack.org/#/c/426144/

thanks,
Michele

Comment 13 errata-xmlrpc 2017-05-17 19:57:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245

Note You need to log in before you can comment on or make changes to this bug.