Hide Forgot
Description of problem: The environment is a pre-release of RH-OSP10 over RHEL7.3, with corosync-2.4.0-4.el7.x86_64 The system has been deployed in a virtual environment with one controller and one compute node. After few days of usage, corosync on the controller is now using the 100% of the CPU. $ strace -p <corosync_pid> -y shows a loop of: write(7<pipe:[82267]>, "\v\0\0\0", 4) = -1 EAGAIN (Resource temporarily unavailable) No apparently relevant logs around.
Additional note: the machine shows enough memory available, even the status could have been different when the issue happened: # free -m total used free shared buff/cache available Mem: 11854 9305 309 42 2239 1958 Swap: 0 0 0
The strace log look interesting. Any chance I can get access to that machine (ideally with debug info installed) when corosync gets into 100% CPU usage loop?
Also to reduce problem area as quickly as possible. Does problem happens regularly (= we have reproducer) or it's something whats happened for first time? If it is happening regularly, would it be possible to install (I know it's ugly but can help) rhel 7.2 LibQB (latest update so 0.17.1-2.1)?
My findings: - Program stuck in libqb write to pipe in handling signal. This is definitively something to "improve", but it's not root cause of the problem - Root cause is call of trie_node_next which results in segfault. It's impossible to say if problem was hidden in caller code (corosync icmap), libqb trie implementation or memory was just overwritten sooner (so whole icmap fail is unrelated). - In production cluster, this problem is "masked" by power fencing so user just note reset of node. For now there is not too much I can do with bug. I would recommend to try running whatever tests you were running and if problem appears again, please try to collect backtrace (gcore PID_OF_COROSYNC; gdb corosync core.PID_OF_COROSYNC and in gdb cli, thread apply all bt; all with debug information installed) and contact me. Complete BT: #0 0x00007fd470f1743d in write () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007fd4711317bd in _handle_real_signal_ (signal_num=<optimized out>, si=<optimized out>, context=<optimized out>) at loop_poll.c:475 #2 <signal handler called> #3 0x00007fd470f1743d in write () at ../sysdeps/unix/syscall-template.S:81 #4 0x00007fd4711317bd in _handle_real_signal_ (signal_num=<optimized out>, si=<optimized out>, context=<optimized out>) at loop_poll.c:475 #5 <signal handler called> #6 0x00007fd47113edb8 in trie_node_next (node=0x7fd47217e7c0, root=0x7fd471cf7980, all=<optimized out>) at trie.c:116 #7 0x00007fd47113eec1 in trie_iter_next (i=0x7fd471d79900, value=0x7ffc3a165680) at trie.c:757 #8 0x00007fd471a019b4 in icmap_iter_next (iter=<optimized out>, value_len=value_len@entry=0x7ffc3a1656d0,. type=type@entry=0x7ffc3a1656b0) at icmap.c:1108 #9 0x00007fd4719f0c4d in message_handler_req_lib_cmap_iter_next (conn=0x7fd4721818a0, message=0x7fd4665ee820) at cmap.c:611 #10 0x00007fd471a026ca in cs_ipcs_msg_process (c=0x7fd4721818a0, data=<optimized out>, size=<optimized out>) at ipc_glue.c:647 #11 0x00007fd471134c61 in _process_request_ (ms_timeout=10, c=0x7fd4721818a0) at ipcs.c:700 #12 qb_ipcs_dispatch_connection_request (fd=<optimized out>, revents=<optimized out>, data=0x7fd4721818a0) at ipcs.c:802 #13 0x00007fd47113183f in _poll_dispatch_and_take_back_ (item=0x7fd472181e70, p=<optimized out>) at loop_poll.c:109 #14 0x00007fd4711313d0 in qb_loop_run_level (level=0x7fd471d039b0) at loop.c:43 #15 qb_loop_run (lp=<optimized out>) at loop.c:210 #16 0x00007fd4719e67d0 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at main.c:1405
Because we don't have reproducer and logs are not too much helpful I've decided to close this BZ for now. If bug appears again, please reopen it.