Description of problem: After a customer upgraded RHEV-H to a 6.5 version, the main vdsm process is using a lot of cpu time on an otherwise idle system. Two hosts were upgraded and both experienced this. One was downgraded to 6.4 and the problem no longer occurs. The other hosts in the cluster are still on 6.4 and do not exhibit this problem. Version-Release number of selected component (if applicable): RHEV-H 6.5 20140213 and 20140217. How reproducible: Only reproducible so far in this customer's environment. Steps to Reproduce: 1. 2. 3. Actual results: On RHEV-H 6.5 with no active VMs, the main vdsm process is using a high amount of cpu time. Expected results: As seen on our lab systems, when the system is otherwise idle the main vdsm process uses something in the range of 6% cpu. Additional info: Details will be provided in subsequent updates.
Created attachment 876092 [details] vgs output on a hypervisor with high CPU
Created attachment 876093 [details] vgs output on 6.4 hypervisor - normal CPU
Marina, Can you please also attach the actual vdsm logs? Thanks
Hello Based on straces taken, we see most of the time in vdsm spent in futex waits. See below. Are you maybe looking for an ltrace to profile here. [root@do-rhevh3 vdsm]# strace -p 25815 -f -v [pid 25876] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 25861] <... futex resumed> ) = 0 [pid 13289] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4176] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4174] <... futex resumed> ) = 0 [pid 4166] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4163] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4157] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4153] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4150] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4147] futex(0x1ef9420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 4138] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4137] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4132] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4129] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4122] futex(0x1ef9420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 4120] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4114] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4073] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4068] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4067] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4064] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4063] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4061] futex(0x1ef9420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 4058] futex(0x1ef9420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 4056] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 25876] futex(0x1ef9420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 25861] futex(0x1ef9420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 13289] <... futex resumed> ) = 0 [pid 4176] <... futex resumed> ) = 0 [pid 4174] futex(0x1ef9420, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 4166] <... futex resumed> ) = 0 [pid 4163] <... futex resumed> ) = 0 [pid 4157] <... futex resumed> ) = 0 [pid 4153] <... futex resumed> ) = 0 [pid 4150] <... futex resumed> ) = 0 [pid 4147] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 4138] <... futex resumed> ) = 0 [pid 4137] <... futex resumed> ) = 0 [pid 4132] <... futex resumed> ) = 0 [pid 4129] <... futex resumed> ) = 0 [pid 4122] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 4120] <... futex resumed> ) = 0 [pid 4114] <... futex resumed> ) = 0 [pid 4073] <... futex resumed> ) = 0 [pid 4068] <... futex resumed> ) = 0
Hello I called the customer in Dubai this morning early. Theo (the Red Hat consultant) has provisioned and installed two new RHEVH hosts using the 3PAR storage and the problem is no longer apparent. The high CPU spin VDSM issue on RHEVH 6.5 is no longer an issue in these new RHEVH hosts. He has one more old configuration left, on which he was attempting to get RHEL installed on so he could then stage a "fat" RHEV so we could perform further debugging, but he has had some issue there. We may never get root cause here, and we know we could never reproduce in-house. If he is able to get RHEL and RHEVH installed I have asked him to capture a vmcore and try the older kernel for 6.4 but that may not happen. So for now I would change the severity of this issue and accept that we had some anomaly in the original installation that we may never get to root cause for. Thank You Laurence
This seem to be a duplicate of bug 1090664. While this bug was opened first, we still have insufficient data, while in the other bug we have profiling resutls. I suggest to close this as dupicate and continue tracking this issue in bug 1090664.