Hello, We seem to have an issue with puppet-pacemaker or its call from within TripleO/OSP deploy. It fails with this error: Error: pcs -f resource op defaults timeout='120s' failed: . Too many tries", Full log is available here: https://sf.hosted.upshift.rdu2.redhat.com/logs/34/441034/2/check/periodic-tripleo-ci-rhel-9-standalone-full-tempest-scenario-rhos-17.1/b5abf3a/logs/undercloud/home/zuul/standalone_deploy.log Other logs of interest may be found by browsing the tree. Thank you for your attention :) Cheers, C.
thanks Cédric, I had a quick look at the journal here https://sf.hosted.upshift.rdu2.redhat.com/logs/34/441034/2/check/periodic-tripleo-ci-rhel-9-standalone-full-tempest-scenario-rhos-17.1/b5abf3a/logs/undercloud/var/log/extra/journal.txt serious trouble seems to start around: Mar 15 12:26:22 standalone.localdomain pacemaker-controld[99452]: notice: High CPU load detected: 8.890000 and the it gets worse: Mar 15 12:26:52 standalone.localdomain pacemaker-controld[99452]: notice: High CPU load detected: 10.180000 Mar 15 12:27:22 standalone.localdomain pacemaker-controld[99452]: notice: High CPU load detected: 11.290000 Mar 15 12:27:37 standalone.localdomain kernel: INFO: task auditd:660 blocked for more than 122 seconds. Mar 15 12:27:37 standalone.localdomain kernel: INFO: task auditd:661 blocked for more than 122 seconds. Mar 15 12:27:37 standalone.localdomain kernel: INFO: task kworker/u8:4:715 blocked for more than 122 seconds. until it impacts pacemaker: Mar 15 12:28:04 standalone.localdomain pacemakerd[99446]: notice: pacemaker-schedulerd[99451] is unresponsive to ipc after 1 tries to the point that it self-terminates in order to recover: Mar 15 12:32:09 standalone.localdomain pacemakerd[99446]: error: pacemaker-schedulerd[99451] is unresponsive to ipc after 21 tries but we found the pid so have it killed that we can restart Mar 15 12:32:09 standalone.localdomain pacemakerd[99446]: notice: Stopping pacemaker-schedulerd Mar 15 12:32:10 standalone.localdomain auditd[660]: Error receiving audit netlink packet (No buffer space available) Mar 15 12:32:11 standalone.localdomain sshd[113067]: Received disconnect from 127.0.0.1 port 38444:11: disconnected by user Mar 15 12:32:11 standalone.localdomain sshd[113067]: Disconnected from user zuul 127.0.0.1 port 38444 Mar 15 12:32:11 standalone.localdomain sshd[113064]: pam_unix(sshd:session): session closed for user zuul Mar 15 12:32:11 standalone.localdomain systemd-logind[702]: Session 885 logged out. Waiting for processes to exit. Mar 15 12:32:11 standalone.localdomain systemd[1]: session-885.scope: Deactivated successfully. Mar 15 12:32:11 standalone.localdomain systemd-logind[702]: Removed session 885. Mar 15 12:32:11 standalone.localdomain sshd[115568]: main: sshd: ssh-rsa algorithm is disabled Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]: warning: pacemaker-schedulerd[99451] terminated with signal 9 (Killed) Mar 15 12:32:11 standalone.localdomain pacemaker-attrd[99450]: notice: Caught 'Terminated' signal Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]: notice: Stopping pacemaker-attrd Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]: notice: Stopping pacemaker-execd Mar 15 12:32:11 standalone.localdomain pacemaker-execd[99449]: notice: Caught 'Terminated' signal Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]: notice: Stopping pacemaker-fenced Mar 15 12:32:11 standalone.localdomain pacemaker-fenced[99448]: notice: Caught 'Terminated' signal Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]: notice: Stopping pacemaker-based Mar 15 12:32:11 standalone.localdomain pacemaker-based[99447]: notice: Caught 'Terminated' signal Mar 15 12:32:11 standalone.localdomain pacemaker-based[99447]: notice: Disconnected from Corosync Mar 15 12:32:11 standalone.localdomain pacemaker-based[99447]: notice: Disconnected from Corosync Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]: notice: Shutdown complete Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]: notice: Shutting down and staying down after fatal error and at this point the task that you mentioned already failed: Mar 15 12:32:23 standalone.localdomain puppet-user[110666]: Error: pcs -f resource op defaults timeout='120s' failed: . Too many tries IMHO it seems like something happened to this vm (network? storage?) to the point it went belly up. Please let me know if you reproduce it so we can maybe have another look. Cheers Luca