Bug 2178923 - [CIX] Unable to deploy standalone for osp-17.1 on rhel9 [NEEDINFO]
Summary: [CIX] Unable to deploy standalone for osp-17.1 on rhel9
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-pacemaker
Version: 17.1 (Wallaby)
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: OSP Team
QA Contact: Nobody
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-16 07:45 UTC by Cédric Jeanneret
Modified: 2023-08-03 15:46 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-03-20 10:42:19 UTC
Target Upstream Version:
Embargoed:
ifrangs: needinfo? (rhos-maint)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-23127 0 None None None 2023-03-16 07:46:16 UTC

Description Cédric Jeanneret 2023-03-16 07:45:42 UTC
Hello,

We seem to have an issue with puppet-pacemaker or its call from within TripleO/OSP deploy. It fails with this error:
Error: pcs -f  resource op defaults timeout='120s' failed: . Too many tries",

Full log is available here:
https://sf.hosted.upshift.rdu2.redhat.com/logs/34/441034/2/check/periodic-tripleo-ci-rhel-9-standalone-full-tempest-scenario-rhos-17.1/b5abf3a/logs/undercloud/home/zuul/standalone_deploy.log

Other logs of interest may be found by browsing the tree.

Thank you for your attention :)

Cheers,

C.

Comment 1 Luca Miccini 2023-03-16 08:13:06 UTC
thanks Cédric,

I had a quick look at the journal here https://sf.hosted.upshift.rdu2.redhat.com/logs/34/441034/2/check/periodic-tripleo-ci-rhel-9-standalone-full-tempest-scenario-rhos-17.1/b5abf3a/logs/undercloud/var/log/extra/journal.txt 

serious trouble seems to start around:

Mar 15 12:26:22 standalone.localdomain pacemaker-controld[99452]:  notice: High CPU load detected: 8.890000

and the it gets worse:

Mar 15 12:26:52 standalone.localdomain pacemaker-controld[99452]:  notice: High CPU load detected: 10.180000

Mar 15 12:27:22 standalone.localdomain pacemaker-controld[99452]:  notice: High CPU load detected: 11.290000

Mar 15 12:27:37 standalone.localdomain kernel: INFO: task auditd:660 blocked for more than 122 seconds.
Mar 15 12:27:37 standalone.localdomain kernel: INFO: task auditd:661 blocked for more than 122 seconds.
Mar 15 12:27:37 standalone.localdomain kernel: INFO: task kworker/u8:4:715 blocked for more than 122 seconds.

until it impacts pacemaker:

Mar 15 12:28:04 standalone.localdomain pacemakerd[99446]:  notice: pacemaker-schedulerd[99451] is unresponsive to ipc after 1 tries

to the point that it self-terminates in order to recover:

Mar 15 12:32:09 standalone.localdomain pacemakerd[99446]:  error: pacemaker-schedulerd[99451] is unresponsive to ipc after 21 tries but we found the pid so have it killed that we can restart
Mar 15 12:32:09 standalone.localdomain pacemakerd[99446]:  notice: Stopping pacemaker-schedulerd
Mar 15 12:32:10 standalone.localdomain auditd[660]: Error receiving audit netlink packet (No buffer space available)
Mar 15 12:32:11 standalone.localdomain sshd[113067]: Received disconnect from 127.0.0.1 port 38444:11: disconnected by user
Mar 15 12:32:11 standalone.localdomain sshd[113067]: Disconnected from user zuul 127.0.0.1 port 38444
Mar 15 12:32:11 standalone.localdomain sshd[113064]: pam_unix(sshd:session): session closed for user zuul
Mar 15 12:32:11 standalone.localdomain systemd-logind[702]: Session 885 logged out. Waiting for processes to exit.
Mar 15 12:32:11 standalone.localdomain systemd[1]: session-885.scope: Deactivated successfully.
Mar 15 12:32:11 standalone.localdomain systemd-logind[702]: Removed session 885.
Mar 15 12:32:11 standalone.localdomain sshd[115568]: main: sshd: ssh-rsa algorithm is disabled
Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]:  warning: pacemaker-schedulerd[99451] terminated with signal 9 (Killed)
Mar 15 12:32:11 standalone.localdomain pacemaker-attrd[99450]:  notice: Caught 'Terminated' signal
Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]:  notice: Stopping pacemaker-attrd
Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]:  notice: Stopping pacemaker-execd
Mar 15 12:32:11 standalone.localdomain pacemaker-execd[99449]:  notice: Caught 'Terminated' signal
Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]:  notice: Stopping pacemaker-fenced
Mar 15 12:32:11 standalone.localdomain pacemaker-fenced[99448]:  notice: Caught 'Terminated' signal
Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]:  notice: Stopping pacemaker-based
Mar 15 12:32:11 standalone.localdomain pacemaker-based[99447]:  notice: Caught 'Terminated' signal
Mar 15 12:32:11 standalone.localdomain pacemaker-based[99447]:  notice: Disconnected from Corosync
Mar 15 12:32:11 standalone.localdomain pacemaker-based[99447]:  notice: Disconnected from Corosync
Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]:  notice: Shutdown complete
Mar 15 12:32:11 standalone.localdomain pacemakerd[99446]:  notice: Shutting down and staying down after fatal error

and at this point the task that you mentioned already failed:

Mar 15 12:32:23 standalone.localdomain puppet-user[110666]: Error: pcs -f  resource op defaults timeout='120s' failed: . Too many tries


IMHO it seems like something happened to this vm (network? storage?) to the point it went belly up.

Please let me know if you reproduce it so we can maybe have another look.

Cheers
Luca


Note You need to log in before you can comment on or make changes to this bug.