Description of problem: rhos-15 standalone deployments have been failing since 06/07 with pacemaker errors: 2019-06-11 19:00:42 | "Error: Facter: error while resolving custom fact \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass", 2019-06-11 19:00:42 | "Error: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Pcmk_property[property--stonith-enabled]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20190611-9-2wdpkm failed with code: 1 -> ", 2019-06-11 19:00:42 | "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Property[rabbitmq-role-standalone]/Pcmk_property[property-standalone-rabbitmq-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20190611-9-1lijsnv failed with code: 1 -> ", 2019-06-11 19:00:42 | "Warning: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Resource::Bundle[rabbitmq-bundle]/Pcmk_bundle[rabbitmq-bundle]: Skipping because of failed dependencies", 2019-06-11 19:00:42 | "Warning: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Resource::Ocf[rabbitmq]/Pcmk_resource[rabbitmq]: Skipping because of failed dependencies", 2019-06-11 19:00:42 | "Warning: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Exec[rabbitmq-ready]: Skipping because of failed dependencies", Full deployment log is below: https://sf.hosted.upshift.rdu2.redhat.com/logs/45/171845/3/check/periodic-tripleo-ci-rhel-8-standalone-latest-rhos-15/514702b/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz Version-Release number of selected component (if applicable): rpm diff between the failing job and last known good https://www.diffchecker.com/nOTpboiT How reproducible: Consistently failing since 06/07 - was passing before then Steps to Reproduce: 1. Run periodic-tripleo-ci-rhel-8-standalone-latest-rhos-15 (uses needs_images) or tripleo-ci-rhel-8-standalone-rhos-15 2. See the standalone deployment failures 3. Compare to passing log from https://sf.hosted.upshift.rdu2.redhat.com/logs/periodic/code.engineering.redhat.com/openstack/tripleo-ci-internal-jobs/master/periodic-tripleo-ci-rhel-8-standalone-need-images-rhos-15/2d620ac/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz Suggestion was to escalate to pidone team for assistance. Actual results: Expected results: Additional info:
Hi Ronelle, could it be that we're missing the logs relative to the failure in question? From https://sf.hosted.upshift.rdu2.redhat.com/logs/45/171845/3/check/periodic-tripleo-ci-rhel-8-standalone-latest-rhos-15/514702b/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz I see: 2019-06-11 19:00:42 | "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Property[rabbitmq-role-standalone]/Pcmk_property[property-standalone-rabbitmq-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20190611-9-1lijsnv failed with code: 1 -> ", But in the pacemaker logs the date goes up to 06/06 only: https://sf.hosted.upshift.rdu2.redhat.com/logs/periodic/code.engineering.redhat.com/openstack/tripleo-ci-internal-jobs/master/periodic-tripleo-ci-rhel-8-standalone-need-images-rhos-15/2d620ac/logs/undercloud/var/log/pacemaker/pacemaker.log.txt.gz Or am I looking in the wrong place? Thanks, Michele
Hi Michele, Below is the link to the full failing log: https://sf.hosted.upshift.rdu2.redhat.com/logs/periodic/code.engineering.redhat.com/openstack/tripleo-ci-internal-jobs/master/tripleo-ci-rhel-8-standalone-rhos-15/0dca913/ the pacemaker log: https://sf.hosted.upshift.rdu2.redhat.com/logs/periodic/code.engineering.redhat.com/openstack/tripleo-ci-internal-jobs/master/tripleo-ci-rhel-8-standalone-rhos-15/0dca913/logs/undercloud/var/log/pacemaker/pacemaker.log.txt.gz
Hi Ronelle, thanks a lot for the logs. So this is a bit weird because: 1) We see the failure at https://sf.hosted.upshift.rdu2.redhat.com/logs/periodic/code.engineering.redhat.com/openstack/tripleo-ci-internal-jobs/master/tripleo-ci-rhel-8-standalone-rhos-15/0dca913/logs/undercloud/var/log/containers/stdouts/rabbitmq_init_bundle.log.txt.gz: 2019-06-12T00:33:27.798245617+00:00 stderr F Error: unable to get cib 2019-06-12T00:33:27.827888957+00:00 stderr F Error: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Pcmk_property[property--stonith-enabled]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20190612-9-1mdrtu8 failed with code: 1 -> 2019-06-12T00:33:28.341700609+00:00 stderr F Error: unable to get cib 2019-06-12T00:33:28.370270346+00:00 stderr F Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Property[rabbitmq-role-standalone]/Pcmk_property[property-standalone-rabbitmq-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20190612-9-4lkxnu failed with code: 1 -> 2) Yet pcsd gives us https://sf.hosted.upshift.rdu2.redhat.com/logs/periodic/code.engineering.redhat.com/openstack/tripleo-ci-internal-jobs/master/tripleo-ci-rhel-8-standalone-rhos-15/0dca913/logs/undercloud/var/log/pcsd/pcsd.log.txt.gz: I, [2019-06-12T00:30:42.859 #00000] INFO -- : 200 GET /remote/get_configs?cluster_name=tripleo_cluster (192.168.24.1) 416.56ms I, [2019-06-12T00:30:42.867 #00010] INFO -- : SRWT Node: standalone Request: get_configs I, [2019-06-12T00:30:42.867 #00010] INFO -- : Connecting to: https://standalone:2224/remote/get_configs?cluster_name=tripleo_cluster I, [2019-06-12T00:31:42.984 #00000] INFO -- : 200 GET /remote/get_configs?cluster_name=tripleo_cluster (192.168.24.1) 457.94ms I, [2019-06-12T00:31:42.992 #00011] INFO -- : SRWT Node: standalone Request: get_configs I, [2019-06-12T00:31:42.993 #00011] INFO -- : Connecting to: https://standalone:2224/remote/get_configs?cluster_name=tripleo_cluster I, [2019-06-12T00:32:42.872 #00000] INFO -- : 200 GET /remote/get_configs?cluster_name=tripleo_cluster (192.168.24.1) 403.18ms I, [2019-06-12T00:32:42.880 #00012] INFO -- : SRWT Node: standalone Request: get_configs I, [2019-06-12T00:32:42.880 #00012] INFO -- : Connecting to: https://standalone:2224/remote/get_configs?cluster_name=tripleo_cluster I, [2019-06-12T00:33:42.895 #00000] INFO -- : 200 GET /remote/get_configs?cluster_name=tripleo_cluster (192.168.24.1) 420.13ms I, [2019-06-12T00:33:42.903 #00013] INFO -- : SRWT Node: standalone Request: get_configs I, [2019-06-12T00:33:42.903 #00013] INFO -- : Connecting to: https://standalone:2224/remote/get_configs?cluster_name=tripleo_cluster I, [2019-06-12T00:34:42.880 #00000] INFO -- : 200 GET /remote/get_configs?cluster_name=tripleo_cluster (192.168.24.1) 413.35ms I, [2019-06-12T00:34:42.887 #00014] INFO -- : SRWT Node: standalone Request: get_configs I, [2019-06-12T00:34:42.887 #00014] INFO -- : Connecting to: https://standalone:2224/remote/get_configs?cluster_name=tripleo_cluster So it seems that at least at 00:32:42 some get_configs call was replied to correctly. Now either we do too many and pcs can't cope or something else is at stake here. This interesting case made me realize that puppet does not capture stderr when running cib backup/push commands. Fix for that is here https://review.opendev.org/664955 I will try and reproduce locally (note that starting Friday I am on PTO for two weeks), but if you could throw an environment my way where the issue is reproduced, that would help a lot as well.
So with the help of lmiccini we tried to reproduce the problem to no avail: PLAY RECAP ********************************************************************************************************* standalone-0 : ok=273 changed=91 unreachable=0 failed=0 skipped=416 rescued=0 ignored=1 undercloud : ok=11 changed=7 unreachable=0 failed=0 skipped=32 rescued=0 ignored=0 Not cleaning working directory /home/cloud-user/tripleo-heat-installer-templates Not cleaning ansible directory /home/cloud-user/undercloud-ansible-l8_41rcl Install artifact is located at /home/cloud-user/undercloud-install-20190613074647.tar.bzip2 ######################################################## Deployment successful! ######################################################## I suggest that now that https://review.opendev.org/664955 merged, once a new compose with said review will come outm we have a better chance of understanding what is going on. Ideally if you could reproduce it with the following hiera key set, that would be helpful as well: pacemaker::corosync::pcsd_debug: true That should be a bit more telling as to what is going on. Thanks, Michele
We have rhos-15 deployments working again. Looking at the base image we were using for rhos-15, it was pulling some dev repos - which may have landed us with unsupported versions. We fixed that problem and have passing jobs now.
thanks for the feedback, closing this for now then.