Bug 2178923

Summary: [CIX] Unable to deploy standalone for osp-17.1 on rhel9
Product: Red Hat OpenStack Reporter: Cédric Jeanneret <cjeanner>
Component: puppet-pacemakerAssignee: OSP Team <rhos-maint>
Status: CLOSED NOTABUG QA Contact: Nobody <nobody>
Severity: high Docs Contact:
Priority: medium    
Version: 17.1 (Wallaby)CC: jjoyce, jschluet, lmiccini, rhos-maint, slinaber, tvignaud
Target Milestone: ---Keywords: Triaged
Target Release: ---Flags: ifrangs: needinfo? (rhos-maint)
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-20 10:42:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Cédric Jeanneret 2023-03-16 07:45:42 UTC
Hello,

We seem to have an issue with puppet-pacemaker or its call from within TripleO/OSP deploy. It fails with this error:
Error: pcs -f  resource op defaults timeout='120s' failed: . Too many tries",

Full log is available here:
https://sf.hosted.upshift.rdu2.redhat.com/logs/34/441034/2/check/periodic-tripleo-ci-rhel-9-standalone-full-tempest-scenario-rhos-17.1/b5abf3a/logs/undercloud/home/zuul/standalone_deploy.log

Other logs of interest may be found by browsing the tree.

Thank you for your attention :)

Cheers,

C.

Comment 1 Luca Miccini 2023-03-16 08:13:06 UTC
thanks Cédric,

I had a quick look at the journal here https://sf.hosted.upshift.rdu2.redhat.com/logs/34/441034/2/check/periodic-tripleo-ci-rhel-9-standalone-full-tempest-scenario-rhos-17.1/b5abf3a/logs/undercloud/var/log/extra/journal.txt 

serious trouble seems to start around:

Mar 15 12:26:22 standalone.localdomain pacemaker-controld[99452]:  notice: High CPU load detected: 8.890000

and the it gets worse:

Mar 15 12:26:52 standalone.localdomain pacemaker-controld[99452]:  notice: High CPU load detected: 10.180000

Mar 15 12:27:22 standalone.localdomain pacemaker-controld[99452]:  notice: High CPU load detected: 11.290000

Mar 15 12:27:37 standalone.localdomain kernel: INFO: task auditd:660 blocked for more than 122 seconds.
Mar 15 12:27:37 standalone.localdomain kernel: INFO: task auditd:661 blocked for more than 122 seconds.
Mar 15 12:27:37 standalone.localdomain kernel: INFO: task kworker/u8:4:715 blocked for more than 122 seconds.

until it impacts pacemaker:

Mar 15 12:28:04 standalone.localdomain pacemakerd[99446]:  notice: pacemaker-schedulerd[99451] is unresponsive to ipc after 1 tries

to the point that it self-terminates in order to recover:

Mar 15 12:32:09 standalone.localdomain pacemakerd[99446]:  error: pacemaker-schedulerd[99451] is unresponsive to ipc after 21 tries but we found the pid so have it killed that we can restart
Mar 15 12:32:09 standalone.localdomain pacemakerd[99446]:  notice: Stopping pacemaker-schedulerd
Mar 15 12:32:10 standalone.localdomain auditd[660]: Error receiving audit netlink packet (No buffer space available)
Mar 15 12:32:11 standalone.localdomain sshd[113067]: Received disconnect from 127.0.0.1 port 38444:11: disconnected by user
Mar 15 12:32:11 standalone.localdomain sshd[113067]: Disconnected from user zuul 127.0.0.1 port 38444
Mar 15 12:32:11 standalone.localdomain sshd[113064]: pam_unix(sshd:session): session closed for user zuul
Mar 15 12:32:11 standalone.localdomain systemd-logind[702]: Session 885 logged out. Waiting for processes to exit.
Mar 15 12:32:11 standalone.localdomain systemd[1]: session-885.scope: Deactivated successfully.
Mar 15 12:32:11 standalone.localdomain systemd-logind[702]: Removed session 885.
Mar 15 12:32:11 standalone.localdomain sshd[115568]: main: sshd: ssh-rsa algorithm is disabled
Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]:  warning: pacemaker-schedulerd[99451] terminated with signal 9 (Killed)
Mar 15 12:32:11 standalone.localdomain pacemaker-attrd[99450]:  notice: Caught 'Terminated' signal
Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]:  notice: Stopping pacemaker-attrd
Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]:  notice: Stopping pacemaker-execd
Mar 15 12:32:11 standalone.localdomain pacemaker-execd[99449]:  notice: Caught 'Terminated' signal
Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]:  notice: Stopping pacemaker-fenced
Mar 15 12:32:11 standalone.localdomain pacemaker-fenced[99448]:  notice: Caught 'Terminated' signal
Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]:  notice: Stopping pacemaker-based
Mar 15 12:32:11 standalone.localdomain pacemaker-based[99447]:  notice: Caught 'Terminated' signal
Mar 15 12:32:11 standalone.localdomain pacemaker-based[99447]:  notice: Disconnected from Corosync
Mar 15 12:32:11 standalone.localdomain pacemaker-based[99447]:  notice: Disconnected from Corosync
Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]:  notice: Shutdown complete
Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]:  notice: Shutting down and staying down after fatal error

and at this point the task that you mentioned already failed:

Mar 15 12:32:23 standalone.localdomain puppet-user[110666]: Error: pcs -f  resource op defaults timeout='120s' failed: . Too many tries


IMHO it seems like something happened to this vm (network? storage?) to the point it went belly up.

Please let me know if you reproduce it so we can maybe have another look.

Cheers
Luca