Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1728678

Summary: OSP 14->15: haproxy_init_bundle fails with "unable to get cib"
Product: Red Hat OpenStack Reporter: Jiri Stransky <jstransk>
Component: openstack-tripleo-heat-templatesAssignee: RHOS Maint <rhos-maint>
Status: CLOSED WORKSFORME QA Contact: Sasha Smolyak <ssmolyak>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 15.0 (Stein)CC: bperkins, mburns
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-07-11 12:01:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1727807    

Description Jiri Stransky 2019-07-10 12:23:23 UTC
We already were past this point but recently started hitting a new bug on upgrade of controller-0. Snippet of `openstack upgrade run --limit controller-0 --skip-tags validation`:

http://pastebin.test.redhat.com/778720

But on the bare metal, cluster looks ok:

http://pastebin.test.redhat.com/778722

and i am able to run `pcs cluster cib`.

Comment 1 Jiri Stransky 2019-07-11 09:02:32 UTC
I was able to get this "minimal" reproducer outside the deployment/upgrade tooling.

First i check that the cluster is running on the node:

[root@controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-0 (version 2.0.1-4.el8_0.3-0eb7991564) - partition with quorum
Last updated: Thu Jul 11 08:57:25 2019
Last change: Wed Jul 10 10:20:34 2019 by root via cibadmin on controller-0

1 node configured
0 resources configured

Online: [ controller-0 ]

No resources


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


Then i look at the command used for the haproxy_init_bundle container:

[root@controller-0 ~]# paunch debug --action print-cmd --file /var/lib/tripleo-config/container-startup-config-step_2.json --container haproxy_init_bundle                                                         
podman run --name haproxy_init_bundle-2l2qbc2v --conmon-pidfile=/var/run/haproxy_init_bundle.pid --env=TRIPLEO_DEPLOY_IDENTIFIER=1562750152 --net=host --ipc=host --privileged=true --user=root --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/var/lib/container-config-scripts/container_puppet_apply.sh:/container_puppet_apply.sh:ro --volume=/etc/puppet:/tmp/puppet-etc:ro --volume=/usr/share/openstack-puppet/modules:/usr/share/openstack-puppet/modules:ro --volume=/etc/corosync/corosync.conf:/etc/corosync/corosync.conf:ro brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-haproxy:latest /container_puppet_apply.sh 2 file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ip,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation include ::tripleo::profile::base::pacemaker; include ::tripleo::profile::pacemaker::haproxy_bundle 

And i edit the command to run just `pcs status` instead of puppet:

[root@controller-0 ~]# podman run --rm -ti --name haproxy_init_bundle-test --conmon-pidfile=/var/run/haproxy_init_bundle.pid --env=TRIPLEO_DEPLOY_IDENTIFIER=1562750152 --net=host --ipc=host --privileged=true --u
ser=root --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trus
t/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/c
ert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/var/lib/container-config-scripts/container_puppet_apply.sh:/container_puppet_apply.sh:ro --volume=/etc/puppet:/tmp/puppet-etc:ro --volume=/us
r/share/openstack-puppet/modules:/usr/share/openstack-puppet/modules:ro --volume=/etc/corosync/corosync.conf:/etc/corosync/corosync.conf:ro brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-
haproxy:latest pcs status
Error: cluster is not currently running on this node

Which prints that cluster is not running on this node. But it is running there :).

Comment 2 Jiri Stransky 2019-07-11 09:12:30 UTC
Not sure if it could be related but the env had a workaround for https://bugzilla.redhat.com/show_bug.cgi?id=1726680 applied. So before running the upgrade, i ran `userdel hacluster` to force recreation of the user and re-authentication. The workaround worked, because the hacluster user was recreated and pcmk cluster was correctly formed, before hitting the issue with haproxy and redis init containers.

Comment 3 Jiri Stransky 2019-07-11 10:04:20 UTC
This is likely caused by the workaround for https://bugzilla.redhat.com/show_bug.cgi?id=1726680 which causes the hacluster UID/GID to change from well-known to "random". We need a different workaround -- just changing password instead of deleting the user, to force Puppet to refresh its resources, but not break the UID/GID expectations on hacluster user. I'm testing and if that ^ fixes the problem, i'll close this as WFM.

Comment 4 Jiri Stransky 2019-07-11 12:01:55 UTC
New workarounds for bug 1726680 doesn't cause this issue. Closing.