Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1394856

Summary:	High CPU usage for corosync in a OSP10 deployment
Product:	Red Hat Enterprise Linux 7	Reporter:	Luigi Toscano <ltoscano>
Component:	corosync	Assignee:	Jan Friesse <jfriesse>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.3	CC:	ccaulfie, cluster-maint, ltoscano, michele
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-10-17 07:01:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Luigi Toscano 2016-11-14 15:18:55 UTC

Description of problem:
The environment is a pre-release of RH-OSP10 over RHEL7.3, with
corosync-2.4.0-4.el7.x86_64
The system has been deployed in a virtual environment with one controller and one compute node.

After few days of usage, corosync on the controller is now using the 100% of the CPU. 
$ strace -p <corosync_pid> -y
shows a loop of:
write(7<pipe:[82267]>, "\v\0\0\0", 4)   = -1 EAGAIN (Resource temporarily unavailable)

No apparently relevant logs around.

Comment 2 Luigi Toscano 2016-11-14 15:21:19 UTC

Additional note: the machine shows enough memory available, even the status could have been different when the issue happened:

# free -m
              total        used        free      shared  buff/cache   available
Mem:          11854        9305         309          42        2239        1958
Swap:             0           0           0

Comment 4 Jan Friesse 2016-11-14 16:12:45 UTC

The strace log look interesting. Any chance I can get access to that machine (ideally with debug info installed) when corosync gets into 100% CPU usage loop?

Comment 5 Jan Friesse 2016-11-14 16:22:19 UTC

Also to reduce problem area as quickly as possible. Does problem happens regularly (= we have reproducer) or it's something whats happened for first time? If it is happening regularly, would it be possible to install (I know it's ugly but can help) rhel 7.2 LibQB (latest update so 0.17.1-2.1)?

Comment 6 Jan Friesse 2016-11-14 17:11:09 UTC

My findings:

- Program stuck in libqb write to pipe in handling signal. This is definitively something to "improve", but it's not root cause of the problem
- Root cause is call of trie_node_next which results in segfault. It's impossible to say if problem was hidden in caller code (corosync icmap), libqb trie implementation or memory was just overwritten sooner (so whole icmap fail is unrelated).
- In production cluster, this problem is "masked" by power fencing so user just note reset of node.

For now there is not too much I can do with bug. I would recommend to try running whatever tests you were running and if problem appears again, please try to collect backtrace (gcore PID_OF_COROSYNC; gdb corosync core.PID_OF_COROSYNC and in gdb cli, thread apply all bt; all with debug information installed) and contact me.

Complete BT:

#0  0x00007fd470f1743d in write () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007fd4711317bd in _handle_real_signal_ (signal_num=<optimized out>, si=<optimized out>, context=<optimized out>)
    at loop_poll.c:475
#2  <signal handler called>
#3  0x00007fd470f1743d in write () at ../sysdeps/unix/syscall-template.S:81
#4  0x00007fd4711317bd in _handle_real_signal_ (signal_num=<optimized out>, si=<optimized out>, context=<optimized out>)
    at loop_poll.c:475
#5  <signal handler called>
#6  0x00007fd47113edb8 in trie_node_next (node=0x7fd47217e7c0, root=0x7fd471cf7980, all=<optimized out>) at trie.c:116
#7  0x00007fd47113eec1 in trie_iter_next (i=0x7fd471d79900, value=0x7ffc3a165680) at trie.c:757
#8  0x00007fd471a019b4 in icmap_iter_next (iter=<optimized out>, value_len=value_len@entry=0x7ffc3a1656d0,.
    type=type@entry=0x7ffc3a1656b0) at icmap.c:1108
#9  0x00007fd4719f0c4d in message_handler_req_lib_cmap_iter_next (conn=0x7fd4721818a0, message=0x7fd4665ee820) at cmap.c:611
#10 0x00007fd471a026ca in cs_ipcs_msg_process (c=0x7fd4721818a0, data=<optimized out>, size=<optimized out>) at ipc_glue.c:647
#11 0x00007fd471134c61 in _process_request_ (ms_timeout=10, c=0x7fd4721818a0) at ipcs.c:700
#12 qb_ipcs_dispatch_connection_request (fd=<optimized out>, revents=<optimized out>, data=0x7fd4721818a0) at ipcs.c:802
#13 0x00007fd47113183f in _poll_dispatch_and_take_back_ (item=0x7fd472181e70, p=<optimized out>) at loop_poll.c:109
#14 0x00007fd4711313d0 in qb_loop_run_level (level=0x7fd471d039b0) at loop.c:43
#15 qb_loop_run (lp=<optimized out>) at loop.c:210
#16 0x00007fd4719e67d0 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at main.c:1405

Comment 9 Jan Friesse 2017-10-17 07:01:11 UTC

Because we don't have reproducer and logs are not too much helpful I've decided to close this BZ for now. If bug appears again, please reopen it.