1141456 – Fedora 19 & 20 (64bit HOST): Idle Fedora LXC guests causes immediate HIGH CPU temps. / Fan Speeds. Why?

Bug 1141456 - Fedora 19 & 20 (64bit HOST): Idle Fedora LXC guests causes immediate HIGH CPU temps. / Fan Speeds. Why?

Summary: Fedora 19 & 20 (64bit HOST): Idle Fedora LXC guests causes immediate HIGH CPU...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	lxc
Sub Component:
Version:	20
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Thomas Moschny
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1195945 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-09-13 18:58 UTC by nmvega
Modified:	2015-06-30 01:09 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-06-30 01:09:01 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description nmvega 2014-09-13 18:58:46 UTC

Hello:

This bug, originally openened for Fedora 19 and closed without resolution (when an entire batch of bugs were closed requesting that 'if the problem persisted, please re-open it').

This problem is *100%* reproducible no mater what the computer type (from high-end laptops to super high-end servers), and across all YUM updated versions of Fedora 19 and 20, including all kernels that have been released for them.

This issue is *urgent* because it has -- for a long time now -- prohibited Fedora from being used as a data center HOST to guest LXC Containers. 

In fact, many months ago we abandoned our intention to use more efficient LXCs and reverted back to heavier KVMs because of this issue, hoping that by Fedora-20 (and certainly by now) this would be resolved; but it isn't.

============================================================
Again the issue (as originally stated):
============================================================
(1) Starting a basic LXC container, which is not configured to do anything at all, *immediately* (and without delay) raises the temperature *substantially* of one of the cores.

(2) Starting a second LXC container (also not configured to do anything), does the same as (1), but on a different core (i.e. the one that that LXC uses).

(3) and so on ...
============================================================



===========================================================
Demonstration Output:
===========================================================
dstorm$ # No LXCs running.
dstorm$ sensors -f  (All is normal).
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +77.0°F  (high = +176.0°F, crit = +194.0°F)
Core 0:         +71.6°F  (high = +176.0°F, crit = +194.0°F)
Core 1:         +73.4°F  (high = +176.0°F, crit = +194.0°F)
Core 2:         +75.2°F  (high = +176.0°F, crit = +194.0°F)
Core 3:         +69.8°F  (high = +176.0°F, crit = +194.0°F)
Core 4:         +73.4°F  (high = +176.0°F, crit = +194.0°F)
Core 5:         +73.4°F  (high = +176.0°F, crit = +194.0°F)

dstorm$ sensors -f (All is normal).
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +80.6°F  (high = +176.0°F, crit = +194.0°F)
Core 0:         +73.4°F  (high = +176.0°F, crit = +194.0°F)
Core 1:         +73.4°F  (high = +176.0°F, crit = +194.0°F)
Core 2:         +75.2°F  (high = +176.0°F, crit = +194.0°F)
Core 3:         +66.2°F  (high = +176.0°F, crit = +194.0°F)
Core 4:         +71.6°F  (high = +176.0°F, crit = +194.0°F)
Core 5:         +73.4°F  (high = +176.0°F, crit = +194.0°F)

dstorm$ sudo lxc-start -d -n vps00 (Start a container).
dstorm$ sensors -f (**Immediate 27-degree jump for Core-1**).
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0: +100.4°F  (high = +176.0°F, crit = +194.0°F)  <-- spike
Core 0:         +84.2°F  (high = +176.0°F, crit = +194.0°F)
Core 1:        +100.4°F  (high = +176.0°F, crit = +194.0°F)  <-- spike
Core 2:         +82.4°F  (high = +176.0°F, crit = +194.0°F)
Core 3:         +71.6°F  (high = +176.0°F, crit = +194.0°F)
Core 4:         +75.2°F  (high = +176.0°F, crit = +194.0°F)
Core 5:         +80.6°F  (high = +176.0°F, crit = +194.0°F)

dstorm$ sensors -f            
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0: +100.4°F  (high = +176.0°F, crit = +194.0°F)  <-- spike
Core 0:         +86.0°F  (high = +176.0°F, crit = +194.0°F)
Core 1:        +100.4°F  (high = +176.0°F, crit = +194.0°F)  <-- spike
Core 2:         +84.2°F  (high = +176.0°F, crit = +194.0°F)
Core 3:         +71.6°F  (high = +176.0°F, crit = +194.0°F)
Core 4:         +77.0°F  (high = +176.0°F, crit = +194.0°F)
Core 5:         +80.6°F  (high = +176.0°F, crit = +194.0°F)

dstorm$ sudo lxc-start -d -n vps01 (Start a second container).
dstorm$ sensors -f  (Temperatures are even higher now).
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0: +109.4°F  (high = +176.0°F, crit = +194.0°F) <-- spike
Core 0:         +89.6°F  (high = +176.0°F, crit = +194.0°F)
Core 1:        +111.2°F  (high = +176.0°F, crit = +194.0°F) <-- spike
Core 2:        +107.6°F  (high = +176.0°F, crit = +194.0°F) <-- spike
Core 3:         +75.2°F  (high = +176.0°F, crit = +194.0°F)
Core 4:         +80.6°F  (high = +176.0°F, crit = +194.0°F)
Core 5:         +84.2°F  (high = +176.0°F, crit = +194.0°F)

dstorm$ sensors -f
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0: +111.2°F  (high = +176.0°F, crit = +194.0°F) <-- spike
Core 0:         +91.4°F  (high = +176.0°F, crit = +194.0°F)
Core 1:        +109.4°F  (high = +176.0°F, crit = +194.0°F) <-- spike
Core 2:        +111.2°F  (high = +176.0°F, crit = +194.0°F) <-- spike
Core 3:         +75.2°F  (high = +176.0°F, crit = +194.0°F)
Core 4:         +78.8°F  (high = +176.0°F, crit = +194.0°F)
Core 5:         +84.2°F  (high = +176.0°F, crit = +194.0°F)
=====================

At this point the fans are noticeably faster, and temperature LED read-out on the "Digital Storm" computer reats ~410 degrees, where it normally reads ~320.

Here is what each LXC container is doing (not much); and btw. they are also running Fedora-20 with the same kernel:


dstorm$ lxc-ps --lxc
CONTAINER   PID TTY          TIME CMD
vps00      9616 ?        00:00:00 systemd
vps00      9646 ?        00:35:36 systemd-journal
vps00      9654 ?        00:00:00 systemd-udevd
vps00      9976 ?        00:00:00 firewalld
vps00      9979 ?        00:00:00 rsyslogd
vps00      9983 ?        00:00:00 dbus-daemon
vps00      9986 ?        00:00:00 systemd-logind
vps00      9993 pts/4    00:00:00 agetty
vps00      9995 pts/2    00:00:00 agetty
vps00      9998 pts/5    00:00:00 agetty
vps00      9999 pts/3    00:00:00 agetty
vps00     10006 pts/6    00:00:00 agetty
vps00     10012 ?        00:00:00 sshd

vps01     10754 ?        00:00:00 systemd
vps01     10784 ?        00:35:05 systemd-journal
vps01     10789 ?        00:00:00 systemd-udevd
vps01     11204 ?        00:00:00 firewalld
vps01     11206 ?        00:00:00 rsyslogd
vps01     11207 ?        00:00:00 dbus-daemon
vps01     11211 ?        00:00:00 systemd-logind
vps01     11232 pts/10   00:00:00 agetty
vps01     11233 pts/8    00:00:00 agetty
vps01     11234 pts/11   00:00:00 agetty
vps01     11235 pts/9    00:00:00 agetty
vps01     11236 pts/12   00:00:00 agetty
vps01     11264 ?        00:00:00 sshd
vps00     11908 ?        00:00:00 systemd
vps00     11910 ?        00:00:00 (sd-pam)
vps01     11965 ?        00:00:00 systemd
vps01     11967 ?        00:00:00 (sd-pam)

Try to launch a LXC and you will see the issue. It's easily reproducible.

Can team Fedora provide help us with this? (please and thank you). I am happy
to work to provide additional information,... although, again, you will be able reproduce this problem on your computers (laptops even), too.

Again, the impact of this long running issue is that we are not able to use
Fedora as a HOST to Fedora LXC Container Guests; which has wide implications on
having to rebuild (and test, and operate) servers a different HOST O/S
distribution order to be able to safely use LXCs (not trivial). From all
indications, Fedora 19 & 20 (with any kernel) will burn out our systems, therefore urgent. Thank you again!
 

Thank you.

Comment 1 Thomas Moschny 2014-09-14 07:29:22 UTC

Just for completeness, can you please specify the LXC package versions you are using?

And I might be missing something, but is 110°F (43°C) really to be considered a high CPU temperature?

Admittedly that's a substantial increase compared to 70°F (21°C), yes. However, cores in my workstation here (Core2 Quad Q9550, that's an old, but low TDP CPU) are never colder than 34°C (93°F).

Anyway, I can contact upstream about this, although I am not sure they can anything do about it, as LXC is the userspace part. You filed bug 1050106 against the kernel component, which is in principle the right thing to do...

Comment 2 nmvega 2014-09-14 16:42:02 UTC

Hello Thomas:

(1) Here are the LXC RPMs (latest of them), although as mentioned the issue has persisted across many iterations of RPMS (lxc, kernels, etc).

user@linux$ rpm -qa | egrep 'lxc'
lxc-doc-1.0.5-5.fc20.noarch
lxc-templates-1.0.5-5.fc20.x86_64
libvirt-daemon-driver-lxc-1.1.3.5-2.fc20.x86_64
python3-lxc-1.0.5-5.fc20.x86_64
clxclient-3.6.1-9.fc20.x86_64
lxc-1.0.5-5.fc20.x86_64
lxc-extra-1.0.5-5.fc20.x86_64
lxc-devel-1.0.5-5.fc20.x86_64
lua-lxc-1.0.5-5.fc20.x86_64
lxc-libs-1.0.5-5.fc20.x86_64

(2) No one paid attention to the former bug, though I pleaded. Also, on this bug, the dropdown did not let me select 'kernel'. I think a collaborative effort (LXC and kernel) is optimal.

(3) Every computer (a wide variety of them) we tried to run even just one or two LXCs, all jump drastically in temperature (as you see) -- for an LXC or two that are essentially idle. Correspondingly, all of those computer's FANS think there is a problem because, in each case, they speed up and get noticeably loud.

Take this well equipped server:
- 64GB RAM @ 2600Mhz
- i7 x 3Ghz x 12 Cores
- 2TB SSD (RAID-O H/W stripe of 1TB pair)
No monitor
Host Fedora O/S is optimized to run only what is necessary. Everything
is disabled (no 'sendmail', no 'cron'). It's very tight.

Running one idle LXC causes a spike in temperature; run two, and the fans start
increasing. Yet nothing is really happening.

On the other hand, on that very same machine I can run 5 *fully virtualized* CentOS6 KVM guests (on Fedora-20 Host), each with 11GB RAM assigned to them; and on them run distributed Apache Hadoop/HDFS, Apache Spark and Apache Kafka to perform Real-Time distributed Machine Learning -- so those KVMs are truly doing a lot! Yet for the amount of real-time work that that KVM-based cluster is doing, (again, full virtualization now) there is very little increase in temperature, and zero increase in fan speed.

Also note that there is an 'overall' temperature LED on the front of that computer. It reads ~320 when Fedora Host is booted up and idle. I can launch those 5 KVMs, and it goes up to about ~340; but launching 1 or 2 *idle* LXCs causes a jump to above ~410 immediately. Why? So it's not just 'sensors -f'
output. There are LED and FAN increase indications, too.

So something is definitely going on with LXC & Kernel, and because there is, we're assuming the possibility that the temperature jump can be even higher than shown. We have to... -- to protect the systems.

I think one of the underlying components used in createing the virtual container is causing a problem (kernel iptables, chroot, resource management, etc.) or maybe a kernel mutex is spinning, or something. But this behavior is definitely problemmatic.

Again, we really want to use LXC because we can get better utilization from every server that way. But we are stuck.

Thank you again!

Comment 3 Michael H. Warfield 2014-09-14 18:23:50 UTC

This is a known existing problem with systemd-journald in a containers.  If you look at the CPU time in those container processes, you will notice systemd-journald is in a runaway condition and consuming 100% CPU.  If you were to run "top" you would see your load average has shot through the roof and multiple systemd-journald processes are camped out on the CPUs consuming the processors.

The problem relates to having /dev/kmsg symlinked to /dev/console in the containers, which is common in a lot of cases with sysvinit or upstart but causes problems with systemd-journald because journald is reading from kmsg and writing to console thus creating a messaging loop which it is then failing to detect.

This problem is going to be addressed in some patches to be released shortly for templates supporting systemd based distros and also attempting to intercept the affected containers at startup with default settings.  Existing containers running systemd-journald will need to be updated with a couple of minor changes...

To address this problem in an affected container...

1) Shut down the container.

2) Edit the container config file and add the following line...

lxc.kmsg = 0

3) Remove the existing symlink for the container /dev.  Because, for systemd, this is a persistent subdirectory under the /dev/.lxc in the host devtmpfs area, it should be removed like this:

rm -f /var/lib/lxc/{container-name}/rootfs.dev/kmsg

4) Restart the container.

Comment 4 nmvega 2014-09-14 20:24:17 UTC

Hi Michael:

Thank you for taking to time to articulate the issue as you did (appreciated!).

And there is good new, too. I made the adjustments you prescribed above to each of the 5 LXC containers, started them, and everything looks as expected, including the front-display LED temperature reading (only ~330).

root@linux# lxc-ls --active
vps00 vps01 vps02 vps03 vps04

root@linux# sensors -f
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +77.0°F  (high = +176.0°F, crit = +194.0°F)
Core 0:         +73.4°F  (high = +176.0°F, crit = +194.0°F)
Core 1:         +77.0°F  (high = +176.0°F, crit = +194.0°F)
Core 2:         +77.0°F  (high = +176.0°F, crit = +194.0°F)
Core 3:         +66.2°F  (high = +176.0°F, crit = +194.0°F)
Core 4:         +71.6°F  (high = +176.0°F, crit = +194.0°F)
Core 5:         +75.2°F  (high = +176.0°F, crit = +194.0°F)

This is finally SOLVED. \o/

Thank you very much Michael & Thomas.

Comment 5 Thomas Moschny 2015-03-03 20:24:34 UTC

*** Bug 1195945 has been marked as a duplicate of this bug. ***

Comment 6 Thomas Moschny 2015-03-03 20:25:40 UTC

Fixed in commit e8a16654, will be in 1.0.8.

Comment 7 Fedora End Of Life 2015-05-29 12:52:16 UTC

This message is a reminder that Fedora 20 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 20. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '20'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 20 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 8 Fedora End Of Life 2015-06-30 01:09:01 UTC

Fedora 20 changed to end-of-life (EOL) status on 2015-06-23. Fedora 20 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.