Bug 1719421 - rhos-15 standalone deployments are failing with ""Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Property[rabbitmq-role-standalone]/Pcmk_property[property-standalone-rabbitmq-role]: Could not evaluate: backup_cib"
Summary: rhos-15 standalone deployments are failing with ""Error: /Stage[main]/Tripleo...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-pacemaker
Version: 15.0 (Stein)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-06-11 19:13 UTC by Ronelle Landy
Modified: 2019-06-24 08:13 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-24 08:13:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Ronelle Landy 2019-06-11 19:13:50 UTC
Description of problem:

rhos-15 standalone deployments have been failing since 06/07 with pacemaker errors:

2019-06-11 19:00:42 |     "Error: Facter: error while resolving custom fact \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass",
2019-06-11 19:00:42 |     "Error: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Pcmk_property[property--stonith-enabled]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20190611-9-2wdpkm failed with code: 1 -> ",
2019-06-11 19:00:42 |     "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Property[rabbitmq-role-standalone]/Pcmk_property[property-standalone-rabbitmq-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20190611-9-1lijsnv failed with code: 1 -> ",
2019-06-11 19:00:42 |     "Warning: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Resource::Bundle[rabbitmq-bundle]/Pcmk_bundle[rabbitmq-bundle]: Skipping because of failed dependencies",
2019-06-11 19:00:42 |     "Warning: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Resource::Ocf[rabbitmq]/Pcmk_resource[rabbitmq]: Skipping because of failed dependencies",
2019-06-11 19:00:42 |     "Warning: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Exec[rabbitmq-ready]: Skipping because of failed dependencies",

Full deployment log is below:

https://sf.hosted.upshift.rdu2.redhat.com/logs/45/171845/3/check/periodic-tripleo-ci-rhel-8-standalone-latest-rhos-15/514702b/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz
Version-Release number of selected component (if applicable):

rpm diff between the failing job and last known good https://www.diffchecker.com/nOTpboiT

How reproducible:

Consistently failing since 06/07 - was passing before then

Steps to Reproduce:
1. Run periodic-tripleo-ci-rhel-8-standalone-latest-rhos-15 (uses needs_images) or  tripleo-ci-rhel-8-standalone-rhos-15
2. See the standalone deployment failures
3. Compare to passing log from https://sf.hosted.upshift.rdu2.redhat.com/logs/periodic/code.engineering.redhat.com/openstack/tripleo-ci-internal-jobs/master/periodic-tripleo-ci-rhel-8-standalone-need-images-rhos-15/2d620ac/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

Suggestion was to escalate to pidone team for assistance.

Actual results:


Expected results:


Additional info:

Comment 1 Michele Baldessari 2019-06-12 05:52:44 UTC
Hi Ronelle,

could it be that we're missing the logs relative to the failure in question?

From https://sf.hosted.upshift.rdu2.redhat.com/logs/45/171845/3/check/periodic-tripleo-ci-rhel-8-standalone-latest-rhos-15/514702b/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz I see:
2019-06-11 19:00:42 |     "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Property[rabbitmq-role-standalone]/Pcmk_property[property-standalone-rabbitmq-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20190611-9-1lijsnv failed with code: 1 -> ",

But in the pacemaker logs the date goes up to 06/06 only: https://sf.hosted.upshift.rdu2.redhat.com/logs/periodic/code.engineering.redhat.com/openstack/tripleo-ci-internal-jobs/master/periodic-tripleo-ci-rhel-8-standalone-need-images-rhos-15/2d620ac/logs/undercloud/var/log/pacemaker/pacemaker.log.txt.gz

Or am I looking in the wrong place?

Thanks,
Michele

Comment 3 Michele Baldessari 2019-06-12 15:14:24 UTC
Hi Ronelle,

thanks a lot for the logs. So this is a bit weird because:
1) We see the failure at https://sf.hosted.upshift.rdu2.redhat.com/logs/periodic/code.engineering.redhat.com/openstack/tripleo-ci-internal-jobs/master/tripleo-ci-rhel-8-standalone-rhos-15/0dca913/logs/undercloud/var/log/containers/stdouts/rabbitmq_init_bundle.log.txt.gz:
2019-06-12T00:33:27.798245617+00:00 stderr F Error: unable to get cib
2019-06-12T00:33:27.827888957+00:00 stderr F Error: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Pcmk_property[property--stonith-enabled]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20190612-9-1mdrtu8 failed with code: 1 ->
2019-06-12T00:33:28.341700609+00:00 stderr F Error: unable to get cib
2019-06-12T00:33:28.370270346+00:00 stderr F Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Property[rabbitmq-role-standalone]/Pcmk_property[property-standalone-rabbitmq-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20190612-9-4lkxnu failed with code: 1 ->


2) Yet pcsd gives us https://sf.hosted.upshift.rdu2.redhat.com/logs/periodic/code.engineering.redhat.com/openstack/tripleo-ci-internal-jobs/master/tripleo-ci-rhel-8-standalone-rhos-15/0dca913/logs/undercloud/var/log/pcsd/pcsd.log.txt.gz:

I, [2019-06-12T00:30:42.859 #00000]     INFO -- : 200 GET /remote/get_configs?cluster_name=tripleo_cluster (192.168.24.1) 416.56ms
I, [2019-06-12T00:30:42.867 #00010]     INFO -- : SRWT Node: standalone Request: get_configs
I, [2019-06-12T00:30:42.867 #00010]     INFO -- : Connecting to: https://standalone:2224/remote/get_configs?cluster_name=tripleo_cluster
I, [2019-06-12T00:31:42.984 #00000]     INFO -- : 200 GET /remote/get_configs?cluster_name=tripleo_cluster (192.168.24.1) 457.94ms
I, [2019-06-12T00:31:42.992 #00011]     INFO -- : SRWT Node: standalone Request: get_configs
I, [2019-06-12T00:31:42.993 #00011]     INFO -- : Connecting to: https://standalone:2224/remote/get_configs?cluster_name=tripleo_cluster
I, [2019-06-12T00:32:42.872 #00000]     INFO -- : 200 GET /remote/get_configs?cluster_name=tripleo_cluster (192.168.24.1) 403.18ms
I, [2019-06-12T00:32:42.880 #00012]     INFO -- : SRWT Node: standalone Request: get_configs
I, [2019-06-12T00:32:42.880 #00012]     INFO -- : Connecting to: https://standalone:2224/remote/get_configs?cluster_name=tripleo_cluster
I, [2019-06-12T00:33:42.895 #00000]     INFO -- : 200 GET /remote/get_configs?cluster_name=tripleo_cluster (192.168.24.1) 420.13ms
I, [2019-06-12T00:33:42.903 #00013]     INFO -- : SRWT Node: standalone Request: get_configs
I, [2019-06-12T00:33:42.903 #00013]     INFO -- : Connecting to: https://standalone:2224/remote/get_configs?cluster_name=tripleo_cluster
I, [2019-06-12T00:34:42.880 #00000]     INFO -- : 200 GET /remote/get_configs?cluster_name=tripleo_cluster (192.168.24.1) 413.35ms
I, [2019-06-12T00:34:42.887 #00014]     INFO -- : SRWT Node: standalone Request: get_configs
I, [2019-06-12T00:34:42.887 #00014]     INFO -- : Connecting to: https://standalone:2224/remote/get_configs?cluster_name=tripleo_cluster

So it seems that at least at 00:32:42 some get_configs call was replied to correctly. Now either we do too many and pcs can't cope or something else is at stake here.

This interesting case made me realize that puppet does not capture stderr when running cib backup/push commands. Fix for that is here https://review.opendev.org/664955

I will try and reproduce locally (note that starting Friday I am on PTO for two weeks), but if you could throw an environment my way where the issue is reproduced, that would help a lot as well.

Comment 4 Michele Baldessari 2019-06-13 07:50:12 UTC
So with the help of lmiccini we tried to reproduce the problem to no avail:
PLAY RECAP *********************************************************************************************************
standalone-0               : ok=273  changed=91   unreachable=0    failed=0    skipped=416  rescued=0    ignored=1  
undercloud                 : ok=11   changed=7    unreachable=0    failed=0    skipped=32   rescued=0    ignored=0  
                                                                                                                    
Not cleaning working directory /home/cloud-user/tripleo-heat-installer-templates                                    
Not cleaning ansible directory /home/cloud-user/undercloud-ansible-l8_41rcl                                         
Install artifact is located at /home/cloud-user/undercloud-install-20190613074647.tar.bzip2                         
                                                                                                                    
########################################################                                                            
                                                                                                                    
Deployment successful!                                                                                              
                                                                                                                    
########################################################                                                            


I suggest that now that https://review.opendev.org/664955 merged, once a new compose with said review will come outm we have a better chance of understanding what is going on. Ideally if you could reproduce it with the following hiera key set, that would be helpful as well:
pacemaker::corosync::pcsd_debug: true

That should be a bit more telling as to what is going on.

Thanks,
Michele

Comment 5 Ronelle Landy 2019-06-21 12:53:16 UTC
We have rhos-15 deployments working again.
Looking at the base image we were using for rhos-15, it was pulling some dev repos - which may have landed us with unsupported versions.
We fixed that problem and have passing jobs now.

Comment 6 Luca Miccini 2019-06-24 08:13:15 UTC
thanks for the feedback, closing this for now then.


Note You need to log in before you can comment on or make changes to this bug.