Bug 593339

Summary: xenconsoled dies randomly and makes it impossible to create new guests
Product: Red Hat Enterprise Linux 5 Reporter: Pasi Karkkainen <pasik>
Component: xenAssignee: Michal Novotny <minovotn>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 5.5CC: alexander, areis, dallan, eblake, herrold, jentrena, laine, leiwang, lsc, minovotn, mjenner, mrezanin, pbonzini, peak, rfv781, slords, virt-maint, wolfram, w.schlich, xen-maint, yuzhang
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: xen-3.0.3-126.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-07-21 09:16:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 590073, 606919    
Bug Blocks: 514500    
Attachments:
Description Flags
/v/l/messages when the OOM is invokde through a memory leaking libvirtd
none
script: /root/bin/load-snap.sh
none
tar ball of last couple of days showing steadily increasing mem usage the daemon process
none
unit 5 eating all ram in the libbvirtd -- drop is after a manual restart
none
xenconsoled: Fix rate-limit calculation overflow leading to console freezeTha' none

Description Pasi Karkkainen 2010-05-18 14:46:06 UTC
Description of problem:
After a while (a month or two of uptime) xenconsoled in dom0 dies. The process just disappears, or another option, it's there, but it's stuck doing nothing. I've seen both variations happening over time.

This has happened with every RHEL 5.x version.

Version-Release number of selected component (if applicable):
xen in rhel5.x

How reproducible:
Randomly.

Steps to Reproduce:
1. run RHEL5 Xen dom0 with some PV guests.
2. wait for a month or two
3. Notice that xenconsoled has died or crashed.
  
Actual results:
xenconsoled dies on it's own. 

Expected results:
xenconsoled works normally.

Additional info:

Xen PV guests don't have vfb enabled. They use the text serial console.

When xenconsoled is dead the following things happen:

- Cannot start new PV guests. The guest boot process hangs at some point and doesn't succeed.

- Existing guests, which have a lot of messages printed on their console, get stuck and go into some kind of pause mode. all the network connections die. When xenconsoled is restarted and the console is accessed again the guest starts to work normally again. The guest doesn't crash though.

Comment 1 Miroslav Rezanina 2010-05-19 05:55:21 UTC
Can you provide any logs from corrupted state?

Comment 2 Pasi Karkkainen 2010-05-19 07:14:00 UTC
Most often xenconsoled just crashes and the process disappears - nothing in the logs as far as I know.

Should there be something in some log? 

Sometimes (not often) it has gotten stuck.. then when I strace on it it doesn't show any activity really. Iirc it's stuck on select()..

Comment 3 Michal Novotny 2010-06-01 15:52:00 UTC
(In reply to comment #2)
> Most often xenconsoled just crashes and the process disappears - nothing in the
> logs as far as I know.
> 
> Should there be something in some log? 
> 
> Sometimes (not often) it has gotten stuck.. then when I strace on it it doesn't
> show any activity really. Iirc it's stuck on select()..    

Well, this issue seems really strange and hard to reproduce since the machine (dom0) should be running for a really long time.You say that it's being stuck on select() when you strace to it? I'm not sure what could cause it but isn't it possible that there was a memory leak and it was using more and more memory (can be checked using top for xenstored) and than host machine memory limit has been reached or something like that?

What configuration did you see this? Was it only one machine you were seeing this one or multiple machines?

There's an option to turn on the debugging of xenstored but it would take too much space if running for that long and I'm not sure it would help at all. Also, according to the xenstore_core.c source code there is a call to open a connection to a syslog facility using the xenstored identification and log it to LOG_DAEMON facility so you may be able to see it in the system log. Could you please have a look to the system log when it happens to find some evidence of xenstored ?

Thanks,
Michal

Comment 4 Pasi Karkkainen 2010-06-09 16:03:33 UTC
I've seen it on multiple machines.. and there are reports about it on xen mailinglists aswell, on rhel5/centos5 xen hosts.

I'll have to investigate it in more detail when it happens the next time.. one host where I see it "often" is 32bit PAE, and it has 512 MB of memory for dom0 (dom0_mem=512M), if that makes a difference.

Comment 5 Luke Crawford 2010-06-11 16:04:07 UTC
first, check dmesg;  make sure the oom-killer didn't get it.  that'd do it.

Next, make sure the userland is upgraded to a reasonable version.   I had this problem in the bad old days of RHEL 5.1 and 5.2, and because of that and other issues, I switched prgmr.com to using the xen.org kernels.  But my old box that  is still stock centos?  I think xenconsoled hasn't crashed since RHEL5.3.

Comment 6 Alexander Lindqvist 2010-06-11 17:19:03 UTC
I have seen it through all the CentOS 5.x releases and usally happens randomly after 2-4 weeks of uptime. xenconsoled just disappears. Domu's keeps running like nothing happened and everything works except attaching a console from dom0 to domU. Therefore I have not seen this as a big problem as it only requires a restart of xenconsoled and it doesn't affect the running domU's. This Xen host has had an uptime of about 200 days before with xenconsoled restarted several times during this period.


Dom0 running minimal install of CentOS 5.5
HW: HP Proliant ML530 G2 running 7 PV guests (all CentOS 5.5), 8 GB RAM. Dom0 set to use 1024 MB. Bonded Intel server nic.

I have also not found anything in the logs about this.

Comment 7 Pasi Karkkainen 2010-06-11 17:24:42 UTC
Alexander: xenconsoled has the responsibility to clear the domU console buffers.. afaik. So when xenconsoled is dead or stuck it doesn't do it - and it WILL create problems for the domUs when the domUs have enough stuff printed on the console - then they get freeze until xenconsoled is started and the console buffer is cleared.

I've seen this happening multiple times.

Comment 8 Alexander Lindqvist 2010-06-11 18:32:31 UTC
Ok. My domU's rarely print anything to the console so that might be why I have not seen this problem yet. As I run "xm console" against them fairly often it might just be that I notice it and restarts xenconsoled before the domU's freeze...

Comment 9 Michal Novotny 2010-06-15 10:14:15 UTC
Well, could you give me more information from that incident when it happens? Also, I was thinking of options you could do but I'm not sure it will work. Nevertheless, giving it a shot could be a good idea:

when you start up Dom0, start xend as usual and before creating any guest get the PID of xenconsoled process, e.g. using `ps aux | grep xenconsoled` (I'll refer to it as to $(PID) for short). Connect to this process using the `gdb -p $(PID)` and press 'c' to continue the process when loaded. Do nothing in this console and it should be running as usual. In the second console or any other console do the steps to reproduce it with the gdb on xenconsoled process running.

When you find out that xenconsoled dies, let's have a look to the console that using having gdb on xenconsoled process running and check for the signal that terminated the xenconsoled. Get the backtrace using the `bt` command inside of gdb and send those results to this bugzilla.

Also, please make sure you have xen-debuginfo package (and most probably some other debuginfos from debuginfo repository of RHEL-5) installed, otherwise it may not be giving us enough information from this incident.

Thanks,
Michal

Comment 11 Michal Novotny 2010-06-25 14:43:14 UTC
Well, I did try to do some analysis using valgrind and I've got some interesting results. The xenconsole daemon may die because of leaking memory in xenstore according to this valgrind output. Patch for memory leaks in xenstore has been created already and it's available in bug 606919 . There are some still reachable statements in the valgrind output but also some possibly lost blocks of memory:

==15103== Searching for pointers to 4 not-freed blocks
==15103== Checked 10,770,576 bytes
==15103==
==15103== 21 bytes in 1 blocks are still reachable in loss record 1 of 4
==15103==    at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
==15103==    by 0x3D3AC79521: strdup (strdup.c:43)
==15103==    by 0x404538: ??? (in /usr/sbin/xenconsoled)
==15103==    by 0x3D3AC1D993: (below main) (libc-start.c:231)
==15103==
==15103== 40 bytes in 1 blocks are still reachable in loss record 2 of 4
==15103==    at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
==15103==    by 0x39F7E0330B: ??? (in /usr/lib64/libxenstore.so.3.0.0)
==15103==    by 0x39F7E036B1: ??? (in /usr/lib64/libxenstore.so.3.0.0)
==15103==    by 0x3D3B80673C: start_thread (pthread_create.c:301)
==15103==    by 0x3D3ACD3D1C: clone (in /lib64/libc-2.5.so)
==15103==
==15103== 280 bytes in 1 blocks are still reachable in loss record 3 of 4
==15103==    at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
==15103==    by 0x39F7E01CE8: ??? (in /usr/lib64/libxenstore.so.3.0.0)
==15103==    by 0x39F7E01E37: xs_daemon_open (in /usr/lib64/libxenstore.so.3.0.0)
==15103==    by 0x404A35: ??? (in /usr/sbin/xenconsoled)
==15103==    by 0x4047F4: ??? (in /usr/sbin/xenconsoled)
==15103==    by 0x3D3AC1D993: (below main) (libc-start.c:231)
==15103==
==15103== 288 bytes in 1 blocks are possibly lost in loss record 4 of 4
==15103==    at 0x4A05140: calloc (vg_replace_malloc.c:418)
==15103==    by 0x3D3A80FF32: _dl_allocate_tls (dl-tls.c:304)
==15103==    by 0x3D3B806EB8: pthread_create@@GLIBC_2.2.5 (allocatestack.c:574)
==15103==    by 0x39F7E02B77: xs_watch (in /usr/lib64/libxenstore.so.3.0.0)
==15103==    by 0x404AD2: ??? (in /usr/sbin/xenconsoled)
==15103==    by 0x4047F4: ??? (in /usr/sbin/xenconsoled)
==15103==    by 0x3D3AC1D993: (below main) (libc-start.c:231)
==15103==
==15103== LEAK SUMMARY:
==15103==    definitely lost: 0 bytes in 0 blocks
==15103==    indirectly lost: 0 bytes in 0 blocks
==15103==      possibly lost: 288 bytes in 1 blocks
==15103==    still reachable: 341 bytes in 3 blocks
==15103==         suppressed: 0 bytes in 0 blocks

Maybe I'm wrong but I think this may be the issue since this is what I've been running for just few minutes and this bug is about running xenstore for weeks or months. If there are memory leaks the memory that's being possibly lost could be getting still bigger resulting into the xenconsoled crash and according to the following output I think this may be related to bug 606919 since this bug was about memory leaking because of cancelled read_thread:

==15103== 288 bytes in 1 blocks are possibly lost in loss record 4 of 4
==15103==    at 0x4A05140: calloc (vg_replace_malloc.c:418)
==15103==    by 0x3D3A80FF32: _dl_allocate_tls (dl-tls.c:304)
==15103==    by 0x3D3B806EB8: pthread_create@@GLIBC_2.2.5 (allocatestack.c:574)
==15103==    by 0x39F7E02B77: xs_watch (in /usr/lib64/libxenstore.so.3.0.0)
==15103==    by 0x404AD2: ??? (in /usr/sbin/xenconsoled)
==15103==    by 0x4047F4: ??? (in /usr/sbin/xenconsoled)
==15103==    by 0x3D3AC1D993: (below main) (libc-start.c:231)

and there's the instance of xs_watch. The memory leaks were present there within the xs_watch functionality.

Nevertheless it's just an idea that it *may* be related to this one and maybe I'm wrong. Of course, the steps in comment #9 could be giving us some more information about what's going on there.

Thanks,
Michal

Comment 12 Alexander Lindqvist 2010-06-25 15:14:38 UTC
Interesting find.
Im currently unable to shutdown the guests on this production server as you requested in comment 9 to check this as the guests really needs to be running right now. Will we see updated packages soon with the patch incorporated ? Otherwise I'll try to get a window where we can restart dom0 to check this for you.

Regards / Alex

Comment 13 Michal Novotny 2010-06-25 15:19:15 UTC
(In reply to comment #12)
> Interesting find.
> Im currently unable to shutdown the guests on this production server as you
> requested in comment 9 to check this as the guests really needs to be running
> right now. Will we see updated packages soon with the patch incorporated ?
> Otherwise I'll try to get a window where we can restart dom0 to check this for
> you.
> 
> Regards / Alex    

Alex,
what about reinstalling the xen packages and trying to restart xend after xenconsoled process in no longer running and then connecting to it using gdb -p $(XENCONSOLED_PID)?

Thanks,
Michal

Comment 14 Alexander Lindqvist 2010-06-25 15:54:43 UTC
That still needs the guests to be shutdown for a brief moment and that is a problem but Ill see what I can do.

Comment 15 Michal Novotny 2010-06-28 08:47:28 UTC
Alex(In reply to comment #14)
> That still needs the guests to be shutdown for a brief moment and that is a
> problem but Ill see what I can do.    

Alex, that shouldn't need the guests to be shutdown AFAIK since you just open a new terminal, type `ps aux | grep xenconsoled` to get the $(XENCONSOLED_PID) and then just connect to gdb using the `gdb -p $(XENCONSOLED_PID)` and also, I forgot to mention this, there you have to press `c` to make the xenconsole daemon continue it's operation. The guests should stay running with no shutdown AFAIK.

Michal

Comment 16 Alexander Lindqvist 2010-07-03 22:59:40 UTC
Just to let you now that Im running gdb now and the guests seems to be running fine.. xenconsoled has not died yet since reboot on this host.

21 days of uptime at the moment.

Will post back as soon as xenconsoled dies. A wild guess is 1-2 weeks from now..

Comment 17 Alexander Lindqvist 2010-07-05 15:35:14 UTC
xenconsoled has now hanged.. unable to get any console output from any domu's via dom0. Domu's are up and running. Our usual solution to this is restart xenconsoled.

gdb window just says "Continuing."

I have not restarted xenconsoled incase you want me to check something in its current state.

Comment 18 Michal Novotny 2010-07-13 13:31:38 UTC
Alex, sorry for the long delay, I overlooked you commented this one already. So there's nothing else than "Continuing" on the gdb window? Oh, that's strange. When you did `ps aux | grep xenconsoled` was it still running but doing nothing? Or did it die? If it died, there should be something else in the gdb window. E.g. something about SIGSEGV or some other signal with prompt to run gdb debugging commands.

Michal

Comment 19 Alexander Lindqvist 2010-07-13 14:17:00 UTC
xenconsoled was still running but doing nothing and I think gdb was stopping it from just silently die as it usually dies and I just start xenconsoled again but this time I had to exit gdb and then xenconsoled disappeared. DomU's was running fine however. 

When I started gdb I got various varnings that debugging might not be possible but choosed to continue. This time after restart of xenconsoled gdb did not give me any warnings so I will run this again and keep you posted when it fails. As I can't xm console in to a guest after xenconsoled dies you have to be fairly quick giving me info on what to run before I have to restart xenconsoled.

Comment 20 Michal Novotny 2010-07-13 15:07:39 UTC
Alex, that way I don't know what may be going on since this is strange. There should be at least some signal (I was thinking that there's SIGSEGV) providing the debugging assistance. If there's nothing like this I don't know what may be going on there. I'm not having any idea by now.

What version of Xend and kernel-xen are you running right now? Maybe reinstalling the xen daemon and to those available at http://people.redhat.com/mrezanin/xen could help. Could you please try to upgrade to this version ? There's no need to shutdown the guests or reboot the machine for upgrading the tools.

Michal

Comment 21 Alexander Lindqvist 2010-07-13 18:18:30 UTC
I was running the Xend and kernel-xen from stock 5.5 install. I have now updated to the following:

xen-3.0.3-113.el5virttest30.i386.rpm
xen-libs-3.0.3-113.el5virttest30.i386.rpm
kernel-xen-2.6.18-194.8.1.el5

Comment 22 Michal Novotny 2010-07-14 09:58:44 UTC
Ok Alex, thanks. Please let me know when having the same issue. This one should be having the fix I was writing about applied.

Michal

Comment 23 Alexander Lindqvist 2010-07-15 08:27:36 UTC
Will do. 

Just FYI. Running xm top on Dom0 now errors out and gives me: 
*** Stack smashing detected. Aborted *** 

And then quits after running for about 15 minutes in a ssh connection.

Comment 24 Michal Novotny 2010-07-16 13:54:06 UTC
Interesting Alex. This is the first time I saw such a message however I was googling for the message for a while now. Did it do the core dump ? Or did it write this from within the gdb debugger? If it was in the debugger (gdb) maybe it could help if you issue the backtrace using "bt" in the gdb window and then if you write it to the new comment. Maybe this could help.

Michal

Comment 25 Michal Novotny 2010-07-16 14:01:21 UTC
(In reply to comment #23)
> Will do. 
> 
> Just FYI. Running xm top on Dom0 now errors out and gives me: 
> *** Stack smashing detected. Aborted *** 
> 
> And then quits after running for about 15 minutes in a ssh connection.    

Ok, sorry. You wrote `xm top` and not `xentop`. I overlooked it this now and I noticed this now. Since xm top is a subcommand of xm command this is the python code and this didn't help much then. On the contrary, xentop is written in C language AFAIK so this could help running gdb on xentop of the affected system if it happens with xentop as well (and not just with xm top) ? Although I honestly don't think this could be relevant to this bugzilla, it *theoretically* may be somehow connected since something strange could be going on with your xen tools.

Michal

Comment 26 Alexander Lindqvist 2010-07-16 14:59:09 UTC
OK. xm top has always worked before I installed xen-3.0.3-113.el5virttest30.i386.rpm and xen-libs-3.0.3-113.el5virttest30.i386.rpm but I will not dig into that issue at the moment. Will report back status on this if something happens. It's still running fine at the moment.

Comment 27 Michal Novotny 2010-07-16 15:42:52 UTC
Great Alex, I don't think this is related to -113 version of the package since I never saw that message at all so I guess this may be either connected with restart of dom0 needed to get rid of those messages but maybe not. Nevertheless if it's doing no harm it's good and you can ignore it now.

Michal

Comment 28 Alexander Lindqvist 2010-07-26 11:30:55 UTC
Xenconsoled died silently during the weekend. Before updating to xen-3.0.3-113.el5virttest30.i386.rpm and xen-libs-3.0.3-113.el5virttest30.i386.rpm it always hanged. But the update did not solve the problem. I did not run gdb this time.

Comment 29 Michal Novotny 2010-08-23 10:48:56 UTC
(In reply to comment #28)
> Xenconsoled died silently during the weekend. Before updating to
> xen-3.0.3-113.el5virttest30.i386.rpm and
> xen-libs-3.0.3-113.el5virttest30.i386.rpm it always hanged. But the update did
> not solve the problem. I did not run gdb this time.

Sorry for long delay with this one but I overlooked your reply last time I was looking at this BZ (if this happens feel free to write a direct e-mail to me), could you please try running under gdb again? We need to know what going on - i.e. what signal did it crash with etc.

Thanks,
Michal

Comment 30 Wolfram Schlich (LVM) 2010-09-08 17:09:29 UTC
We are also experiencing this problem on several of our virtualization hosts.
Looking forward to updates to this bug :)

Comment 31 Alexander Lindqvist 2010-09-16 07:21:01 UTC
Im now running this again with gdb. Will report back next time it silently dies...
The system is also fully updated and is no longer running the xen virttest rpm's.

Comment 32 Michal Novotny 2010-09-22 10:50:10 UTC
Any news Alexander ?

Michal

Comment 33 Alexander Lindqvist 2010-09-22 11:08:56 UTC
No it's still running without problems. Probably needs a few more days

Comment 34 Michal Novotny 2010-09-22 11:13:09 UTC
Ok, let us know when it fails but hopefully it won't. Also, some good steps to reproduce would be appreciated.

Michal

Comment 35 Alexander Lindqvist 2010-09-22 13:34:45 UTC
I has failed one time even after the updates so the problem is still there but I lost that info due to loss of power at nearly the same time :( Im now running gdb in a screen session again... Wolfram Schlich who wrote above maybe could do the same as he is also seeing this and then compare the results ?

Comment 36 Michal Novotny 2010-09-22 13:42:35 UTC
Right Alexander. That could help.

Michal

Comment 37 Shad L. Lords 2010-11-01 17:11:46 UTC
We are seeing this same problem on a cluster of 3 boxes.  Most of the time they boxes in the cluster have almost identical uptime.  However not all of them will loose xenconsoled process at the same time.  When the process does die it seem to happen between 30 and 45 days of uptime.

This set of machine is currently the backup of our primary cluster.  It used to be the primary set of machines and when it was we noticed the loss of console many times over the course of a few years.  We upgraded the primary cluster to 64-bit machines and since doing this we have had 0 (zero) issues with loss of console.  However these backup machines that are 32-bit still exibit the loss of console.  Hopefully this will help in some way.

I've looked everywhere I can thing of to find any evidence that xenconsoled has crashed or terminated and can't find anything.  I'll attempt to run gdb and see if I can produce the output you require.

Comment 38 Pasi Karkkainen 2010-11-01 17:32:20 UTC
Yep, I've seen it on 32bit hosts aswell! 30-45 days also matches what I've seen..

Comment 39 Paolo Bonzini 2010-11-01 17:34:35 UTC
Thanks for running the experiment; if xenconsoled hangs as reported in comment #19, you can try killing it with SIGINT from another terminal.

Comment 40 Alexander Lindqvist 2010-11-01 19:59:23 UTC
Xenconsoled silently died today after 44 days of uptime on this Dom0 (32bit).
Below is the output from gdb.

[root@alpha3 ~]# gdb -p 3385
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.2)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 3385
Reading symbols from /usr/sbin/xenconsoled...(no debugging symbols found)...done.
Reading symbols from /usr/lib/libxenctrl.so.3.0...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libxenctrl.so.3.0
Reading symbols from /usr/lib/libxenstore.so.3.0...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libxenstore.so.3.0
Reading symbols from /lib/i686/nosegneg/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/i686/nosegneg/libc.so.6
Reading symbols from /lib/i686/nosegneg/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 0xb7f3db90 (LWP 3386)]
Loaded symbols for /lib/i686/nosegneg/libpthread.so.0
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libgcc_s.so.1
0x00d17402 in __kernel_vsyscall ()
(gdb) c
Continuing.
[Thread 0xb7f3db90 (LWP 3386) exited]

Program exited normally.
(gdb) 

Guess this doesn't help much. But the problem is still there.

Comment 41 Michal Novotny 2010-11-02 09:21:56 UTC
(In reply to comment #40)
> Xenconsoled silently died today after 44 days of uptime on this Dom0 (32bit).
> Below is the output from gdb.
> 
> [root@alpha3 ~]# gdb -p 3385
> GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.2)
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "i386-redhat-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>.
> Attaching to process 3385
> Reading symbols from /usr/sbin/xenconsoled...(no debugging symbols
> found)...done.
> Reading symbols from /usr/lib/libxenctrl.so.3.0...(no debugging symbols
> found)...done.
> Loaded symbols for /usr/lib/libxenctrl.so.3.0
> Reading symbols from /usr/lib/libxenstore.so.3.0...(no debugging symbols
> found)...done.
> Loaded symbols for /usr/lib/libxenstore.so.3.0
> Reading symbols from /lib/i686/nosegneg/libc.so.6...(no debugging symbols
> found)...done.
> Loaded symbols for /lib/i686/nosegneg/libc.so.6
> Reading symbols from /lib/i686/nosegneg/libpthread.so.0...(no debugging symbols
> found)...done.
> [Thread debugging using libthread_db enabled]
> [New Thread 0xb7f3db90 (LWP 3386)]
> Loaded symbols for /lib/i686/nosegneg/libpthread.so.0
> Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
> Loaded symbols for /lib/ld-linux.so.2
> Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done.
> Loaded symbols for /lib/libgcc_s.so.1
> 0x00d17402 in __kernel_vsyscall ()
> (gdb) c
> Continuing.
> [Thread 0xb7f3db90 (LWP 3386) exited]
> 
> Program exited normally.
> (gdb) 
> 
> Guess this doesn't help much. But the problem is still there.

That's right this doesn't help much Alex. Did you observe this only on 32-bit dom0 or on 64-bit dom0 with the same behaviour as well? Since this really doesn't help much I'm thinking of using some debugging in the xenconsoled context.

There are some environment variables that you could try to set before running the XenD itself (note that if you restart XenD it won't take effect so you need to put it directly before the first XenD start - also you could disable the XenD service using `chkconfig xend off`, setting up those env. variables and then running using `service xend start`).

You need to setup those variables:

XENCONSOLED_LOG_HYPERVISOR=yes
XENCONSOLED_TIMESTAMP_HYPERVISOR_LOG=yes
XENCONSOLED_LOG_GUESTS=yes
XENCONSOLED_TIMESTAMP_GUEST_LOG=yes
XENCONSOLED_LOG_DIR=/root/xenconsoled-log

to log to the xenconsoled. The guest logging (LOG_GUESTS and TIMESTAMP_GUEST_LOG variables probably won't be necessary so you can skip setting up those). This will save the log files into the /root/xenconsoled-log directory of your installation. Do you have hardware that's not production hardware where you could reproduce this issue, Alex? 44 days is a pretty long time for our testing purposes and unfortunately it doesn't seem to be leaking the descriptor according to valgrind output.

When the problem reproduces having the logging enabled please create a gzipped tarball and attach it to this BZ for investigation.

Thanks,
Michal

Comment 42 Alexander Lindqvist 2010-11-02 10:10:47 UTC
I have two 32bit Dom0 in production now (was three) and I had the problem on all three (the third one used different hw). No 64bit hosts at the moment.

I have added those variables and will try to make a scheduled restart tonight of these production systems.

Comment 43 Michal Novotny 2010-11-02 10:16:32 UTC
Thanks Alex! It's OK with 32-bit systems if you have it set. Also, just to clarify the right procedure it to set the variables even before the first start of XenD (first start of XenD service is starting xenconsoled if not started yet but xend restart does not restart the xenconsoled so you need to set those variables before the first start of XenD service).

When you reproduce the issue please attach the gzipped tarball of the logs for investigation.

Thanks,
Michal

Comment 44 Paolo Bonzini 2010-11-02 10:31:36 UTC
Also please after attaching to xenconsoled with gdb, please put a breakpoint with "b exit".

Comment 45 Michal Novotny 2010-11-02 10:48:06 UTC
(In reply to comment #44)
> Also please after attaching to xenconsoled with gdb, please put a breakpoint
> with "b exit".

Well, the log files can be very big for the guests so please enable just hypervisor logs and not guest logs. You can disable the guest logs because they can get pretty big after 45 days.

Also, the putting the breakpoint to exit ("b exit") in gdb is a good thing like Paolo advices in comment #44 so please run this as well.

Also, if there's something in the gdb output it may not show up correctly so please install the *-debuginfo package for xen component before running gdb. It should print out strings like:
...
Reading symbols from /usr/sbin/xenconsoled...done.
...

Please make sure you'll have the same version of xen package and the debuginfo package to provide the valid symbols.

Thanks,
Michal

Comment 46 Alexander Lindqvist 2010-11-02 20:31:34 UTC
I have now added this to /etc/sysconfig/xend

XENCONSOLED_LOG_HYPERVISOR=yes
XENCONSOLED_TIMESTAMP_HYPERVISOR_LOG=yes
XENCONSOLED_LOG_DIR=/root/xenconsoled-log

After that Dom0 was fully updated and restarted.
xen-debuginfo-3.0.3-105.el5_5.5.i386.rpm was installed before reboot.
gdb running in a screen session.

I wonder if I did this correctly as no log files has been created so far.

Comment 47 Michal Novotny 2010-11-03 10:02:11 UTC
(In reply to comment #46)
> I have now added this to /etc/sysconfig/xend
> 
> XENCONSOLED_LOG_HYPERVISOR=yes
> XENCONSOLED_TIMESTAMP_HYPERVISOR_LOG=yes
> XENCONSOLED_LOG_DIR=/root/xenconsoled-log
> 
> After that Dom0 was fully updated and restarted.
> xen-debuginfo-3.0.3-105.el5_5.5.i386.rpm was installed before reboot.
> gdb running in a screen session.
> 
> I wonder if I did this correctly as no log files has been created so far.

The hypervisor log should be created immediately AFAIK and the file /etc/sysconfig/xend doesn't exist in my installation. Those should be environment variables instead of some edits done to the file that doesn't exist yet but that's fine. There's a very low chance those full logs would help through so no need to restart. But the most important thing to ask now is: Did you put a breakpoint on exit function, i.e. did you issue "b exit" in the gdb window before continuing using "c" gdb command?

If yes we need just wait and when it finishes in GDB (no matter whether using some signal or exiting normally) could you please get the backtrace using "bt" gdb command and post the GDB output to the bugzilla?

Thanks,
Michal

Comment 48 Alexander Lindqvist 2010-11-03 10:36:57 UTC
break done.

(gdb) b exit
Breakpoint 1 at 0xb39016
(gdb) c
Continuing.

So another 40+ days to go...

Comment 49 Michal Novotny 2010-11-03 10:39:14 UTC
(In reply to comment #48)
> break done.
> 
> (gdb) b exit
> Breakpoint 1 at 0xb39016
> (gdb) c
> Continuing.
> 
> So another 40+ days to go...

Ok, good. Thanks and now we have to wait.

Michal

Comment 50 Michal Novotny 2010-11-08 13:30:47 UTC
Alex,
could you please get me the output of the `ps u -C xenstored` after 10, 20, 30 and then 40 days and put the output per all those 4 days into this bugzilla when it dies?

It could help us reveal whether it's the memory leak issue reproducible after some time or not. I've been checking the memory usage after 3 days but I was unable to see anything so if you can provide me those information it could help us (hopefully).

Thanks,
Michal

Comment 51 R P Herrold 2010-11-12 21:05:08 UTC
We note this happening only on 'long running' 32 bit dom0 -- we get a email notification when it goes non-responsive, and the interval of thirty to 45 days looks like what we are seeing.   

We test, but have seen no such failures on 64 bit dom0

Comment 52 R P Herrold 2010-11-12 21:07:09 UTC
I might add that the email is triggered by an incrond rule, watching the PID file of the process, and kicking out the email when it disappears from the system

Comment 53 Michal Novotny 2010-11-29 10:25:17 UTC
(In reply to comment #52)
> I might add that the email is triggered by an incrond rule, watching the PID
> file of the process, and kicking out the email when it disappears from the
> system

Unfortunately I am afraid that the PID file doesn't disappear when it silently dies.

Michal

Comment 54 Julio Entrena Perez 2010-12-07 10:04:31 UTC
Customer had similar issue in one of their 32-bit dom0.

Several attempts to start a domU, resulted in the domU kernel being hung at different points:

--8<--[ 1st try with "debug initcall_debug" ]--8<--

Started domain dom123
[...]
SMP alternatives: switching to UP code
Freeing SMP alternatives: 13k freed
Brought up 1 CPUs
sizeof(vma)=88 bytes
sizeof(page)=32 bytes
sizeof(inode)=340 bytes
sizeof(dentry)=136 bytes
sizeof(ext3inode)=492 bytes
sizeof(buffer_head)=52 bytes
sizeof(skbuff)=172 bytes
checking if image is initramfs... it is
Freeing initrd memory: 6392k freed
Calling initcall 0xc06e5915: helper_init+0x0/0x23()
Calling initcall 0xc06e5b45

--8<--[ 2nd try with "debug initcall_debug" ]--8<--

Started domain dom123
[...]
SMP alternatives: switching to UP code
Freeing SMP alternatives: 13k freed
Brought up 1 CPUs
sizeof(vma)=88 bytes
sizeof(page)=32 bytes
sizeof(inode)=340 bytes
sizeof(dentry)=136 bytes
sizeof(ext3inode)=492 bytes
sizeof(buffer_head)=52 bytes
sizeof(skbuff)=172 bytes
checking if image is initramfs... it is
Freeing initrd memory: 6392k freed
Calling initcall 0xc06e5915: helper_init+0x0/0x23()
Calling initcall 0xc06e5b45: pm_init+0x0/0x1e()
Calling initcall 0xc06e5ee8: ksysfs_init+0x0/0x1e()
Calling initcall 0xc06e78a4: filelock_init+0x0/0x28()
Calling initcall 0xc06e7f21: init_misc_binfmt+0x0/0x2f()
Calling initcall 0xc06e7f50: init_script_binfmt+0x0/0xa()
Calling initcall 0xc06e7f5a: init_elf_binfmt+0x0/0xa()
Calling initcall 0xc06e8fb8: debugfs_init+0x0/0x39()
Calling initcall 0xc06e93da: securityfs_init+0x0/0x39()

--8<--[ 3rd try with "debug initcall_debug" ]--8<--

Started domain dom123
[...]
SMP alternatives: switching to UP code
Freeing SMP alternatives: 13k freed
Brought up 1 CPUs
sizeof(vma)=88 bytes
sizeof(page)=32 bytes
sizeof(inode)=340 bytes
sizeof(dentry)=136 bytes
sizeof(ext3inode)=492 bytes
sizeof(buffer_head)=52 bytes
sizeof(skbuff)=172 bytes
checking if image is initramfs... it is
Freeing initrd memory: 6392k freed

--8<--

At this point, 'xenconsoled' was still running. An strace on it revealed that, while it was apparently running fine and processing the previous messages from the domU console, after manually restarting 'xenconsoled' the domU booted up fine.

Comment 55 Michal Novotny 2010-12-08 08:55:38 UTC
(In reply to comment #54)
> Customer had similar issue in one of their 32-bit dom0.
> 
> Several attempts to start a domU, resulted in the domU kernel being hung at
> different points:
> 
> --8<--[ 1st try with "debug initcall_debug" ]--8<--
> 
> Started domain dom123
> [...]
> SMP alternatives: switching to UP code
> Freeing SMP alternatives: 13k freed
> Brought up 1 CPUs
> sizeof(vma)=88 bytes
> sizeof(page)=32 bytes
> sizeof(inode)=340 bytes
> sizeof(dentry)=136 bytes
> sizeof(ext3inode)=492 bytes
> sizeof(buffer_head)=52 bytes
> sizeof(skbuff)=172 bytes
> checking if image is initramfs... it is
> Freeing initrd memory: 6392k freed
> Calling initcall 0xc06e5915: helper_init+0x0/0x23()
> Calling initcall 0xc06e5b45
> 
> --8<--[ 2nd try with "debug initcall_debug" ]--8<--
> 
> Started domain dom123
> [...]
> SMP alternatives: switching to UP code
> Freeing SMP alternatives: 13k freed
> Brought up 1 CPUs
> sizeof(vma)=88 bytes
> sizeof(page)=32 bytes
> sizeof(inode)=340 bytes
> sizeof(dentry)=136 bytes
> sizeof(ext3inode)=492 bytes
> sizeof(buffer_head)=52 bytes
> sizeof(skbuff)=172 bytes
> checking if image is initramfs... it is
> Freeing initrd memory: 6392k freed
> Calling initcall 0xc06e5915: helper_init+0x0/0x23()
> Calling initcall 0xc06e5b45: pm_init+0x0/0x1e()
> Calling initcall 0xc06e5ee8: ksysfs_init+0x0/0x1e()
> Calling initcall 0xc06e78a4: filelock_init+0x0/0x28()
> Calling initcall 0xc06e7f21: init_misc_binfmt+0x0/0x2f()
> Calling initcall 0xc06e7f50: init_script_binfmt+0x0/0xa()
> Calling initcall 0xc06e7f5a: init_elf_binfmt+0x0/0xa()
> Calling initcall 0xc06e8fb8: debugfs_init+0x0/0x39()
> Calling initcall 0xc06e93da: securityfs_init+0x0/0x39()
> 
> --8<--[ 3rd try with "debug initcall_debug" ]--8<--
> 
> Started domain dom123
> [...]
> SMP alternatives: switching to UP code
> Freeing SMP alternatives: 13k freed
> Brought up 1 CPUs
> sizeof(vma)=88 bytes
> sizeof(page)=32 bytes
> sizeof(inode)=340 bytes
> sizeof(dentry)=136 bytes
> sizeof(ext3inode)=492 bytes
> sizeof(buffer_head)=52 bytes
> sizeof(skbuff)=172 bytes
> checking if image is initramfs... it is
> Freeing initrd memory: 6392k freed
> 
> --8<--
> 
> At this point, 'xenconsoled' was still running. An strace on it revealed that,
> while it was apparently running fine and processing the previous messages from
> the domU console, after manually restarting 'xenconsoled' the domU booted up
> fine.

Thanks Julio. Is this a PV guest or HVM guest? I don't know whether it can be connected or not but if you give me some more information how to reproduce the behaviour (including the guest information) we can try to check.

Thanks,
Michal

Comment 57 Julio Entrena Perez 2010-12-08 09:36:50 UTC
Michal, it's a PV guest, that's all I know.

Originally, 'xm console foo' gave the following error:
xenconsole: Could not read tty from store: No such file or directory

However, this seems a bit different from the original behaviour.

Comment 63 Alexander Lindqvist 2010-12-13 20:31:42 UTC
Here is the output so far. Im now checking a few times per day to spot any difference. Anything more to look for ? Uptime is now 41 days. If I remember correctly the pid of xenstore disappears on my system just like R P Herrold above describes.

# w
 22:28:54 up 11 days,  1:38,  1 user,  load average: 0.00, 0.00, 0.00
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
root     pts/0    192.168.0.1      22:28    0.00s  0.06s  0.03s w
# ps u -C xenstored
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      3163  0.0  0.1   2364  1128 ?        S    Nov02   0:02 xenstored --pid-file /var/run/xenstore.pid
# pmap 3163
3163:   xenstored --pid-file /var/run/xenstore.pid
00139000     88K r-x--  /lib/i686/nosegneg/libpthread-2.5.so
0014f000      4K r-x--  /lib/i686/nosegneg/libpthread-2.5.so
00150000      4K rwx--  /lib/i686/nosegneg/libpthread-2.5.so
00151000      8K rwx--    [ anon ]
006d1000     80K r-x--  /usr/lib/libxenctrl.so.3.0.0.#prelink#.2P7fpe (deleted)
006e5000      4K rwx--  /usr/lib/libxenctrl.so.3.0.0.#prelink#.2P7fpe (deleted)
006e6000     88K rwx--    [ anon ]
00958000   1368K r-x--  /lib/i686/nosegneg/libc-2.5.so
00aae000      8K r-x--  /lib/i686/nosegneg/libc-2.5.so
00ab0000      4K rwx--  /lib/i686/nosegneg/libc-2.5.so
00ab1000     12K rwx--    [ anon ]
00ced000    108K r-x--  /lib/ld-2.5.so
00d08000      4K r-x--  /lib/ld-2.5.so
00d09000      4K rwx--  /lib/ld-2.5.so
00ef3000      4K r-x--    [ anon ]
08047000     72K r-x--  /usr/sbin/xenstored
08059000      4K rw---  /usr/sbin/xenstored
0805a000     12K rw---    [ anon ]
08358000    388K rw---    [ anon ]
b7ff5000      4K rw-s-  /proc/xen/privcmd
b7ff6000      4K rw-s-  /proc/xen/privcmd
b7ff7000      4K rw-s-  /proc/xen/privcmd
b7ff8000      8K rw---    [ anon ]
b7ffa000      4K rw-s-  /proc/xen/privcmd
b7ffb000      4K rw-s-  /proc/xen/privcmd
b7ffc000      4K rw-s-  /proc/xen/privcmd
b7ffd000      4K rw-s-  /proc/xen/privcmd
b7ffe000      4K rw-s-  /proc/xen/xsd_kva
b7fff000      4K rw---    [ anon ]
bfc4f000     84K rw---    [ stack ]
 total     2392K


# w
 22:31:34 up 22 days,  1:41,  1 user,  load average: 0.00, 0.00, 0.00
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
root     pts/0    192.168.0.1      22:31    0.00s  0.03s  0.02s w
# ps u -C xenstored
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      3163  0.0  0.1   2364  1128 ?        S    Nov02   0:02 xenstored --pid
# pmap 3163
3163:   xenstored --pid-file /var/run/xenstore.pid
00139000     88K r-x--  /lib/i686/nosegneg/libpthread-2.5.so
0014f000      4K r-x--  /lib/i686/nosegneg/libpthread-2.5.so
00150000      4K rwx--  /lib/i686/nosegneg/libpthread-2.5.so
00151000      8K rwx--    [ anon ]
006d1000     80K r-x--  /usr/lib/libxenctrl.so.3.0.0.#prelink#.2P7fpe (deleted)
006e5000      4K rwx--  /usr/lib/libxenctrl.so.3.0.0.#prelink#.2P7fpe (deleted)
006e6000     88K rwx--    [ anon ]
00958000   1368K r-x--  /lib/i686/nosegneg/libc-2.5.so
00aae000      8K r-x--  /lib/i686/nosegneg/libc-2.5.so
00ab0000      4K rwx--  /lib/i686/nosegneg/libc-2.5.so
00ab1000     12K rwx--    [ anon ]
00ced000    108K r-x--  /lib/ld-2.5.so
00d08000      4K r-x--  /lib/ld-2.5.so
00d09000      4K rwx--  /lib/ld-2.5.so
00ef3000      4K r-x--    [ anon ]
08047000     72K r-x--  /usr/sbin/xenstored
08059000      4K rw---  /usr/sbin/xenstored
0805a000     12K rw---    [ anon ]
08358000    388K rw---    [ anon ]
b7ff5000      4K rw-s-  /proc/xen/privcmd
b7ff6000      4K rw-s-  /proc/xen/privcmd
b7ff7000      4K rw-s-  /proc/xen/privcmd
b7ff8000      8K rw---    [ anon ]
b7ffa000      4K rw-s-  /proc/xen/privcmd
b7ffb000      4K rw-s-  /proc/xen/privcmd
b7ffc000      4K rw-s-  /proc/xen/privcmd
b7ffd000      4K rw-s-  /proc/xen/privcmd
b7ffe000      4K rw-s-  /proc/xen/xsd_kva
b7fff000      4K rw---    [ anon ]
bfc4f000     84K rw---    [ stack ]
 total     2392K


# w
 21:24:39 up 31 days, 34 min,  1 user,  load average: 0.00, 0.00, 0.00
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
root     pts/7    192.168.0.1      21:24    0.00s  0.04s  0.02s w
# ps u -C xenstored
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      3163  0.0  0.1   2428  1196 ?        S    Nov02   0:07 xenstored --pid
# pmap 3163
3163:   xenstored --pid-file /var/run/xenstore.pid
00139000     88K r-x--  /lib/i686/nosegneg/libpthread-2.5.so
0014f000      4K r-x--  /lib/i686/nosegneg/libpthread-2.5.so
00150000      4K rwx--  /lib/i686/nosegneg/libpthread-2.5.so
00151000      8K rwx--    [ anon ]
006d1000     80K r-x--  /usr/lib/libxenctrl.so.3.0.0.#prelink#.2P7fpe (deleted)
006e5000      4K rwx--  /usr/lib/libxenctrl.so.3.0.0.#prelink#.2P7fpe (deleted)
006e6000     88K rwx--    [ anon ]
00958000   1368K r-x--  /lib/i686/nosegneg/libc-2.5.so
00aae000      8K r-x--  /lib/i686/nosegneg/libc-2.5.so
00ab0000      4K rwx--  /lib/i686/nosegneg/libc-2.5.so
00ab1000     12K rwx--    [ anon ]
00ced000    108K r-x--  /lib/ld-2.5.so
00d08000      4K r-x--  /lib/ld-2.5.so
00d09000      4K rwx--  /lib/ld-2.5.so
00ef3000      4K r-x--    [ anon ]
08047000     72K r-x--  /usr/sbin/xenstored
08059000      4K rw---  /usr/sbin/xenstored
0805a000     12K rw---    [ anon ]
08358000    452K rw---    [ anon ]
b7ff5000      4K rw-s-  /proc/xen/privcmd
b7ff6000      4K rw-s-  /proc/xen/privcmd
b7ff7000      4K rw-s-  /proc/xen/privcmd
b7ff8000      8K rw---    [ anon ]
b7ffa000      4K rw-s-  /proc/xen/privcmd
b7ffb000      4K rw-s-  /proc/xen/privcmd
b7ffc000      4K rw-s-  /proc/xen/privcmd
b7ffd000      4K rw-s-  /proc/xen/privcmd
b7ffe000      4K rw-s-  /proc/xen/xsd_kva
b7fff000      4K rw---    [ anon ]
bfc4f000     84K rw---    [ stack ]
 total     2456K


# w
 22:00:42 up 40 days,  1:10,  1 user,  load average: 0.00, 0.00, 0.00
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
root     pts/7    192.168.0.1      22:00    0.00s  0.04s  0.02s w
# ps u -C xenstored
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      3163  0.0  0.1   2428  1196 ?        S    Nov02   0:07 xenstored --pid
# pmap 3163
3163:   xenstored --pid-file /var/run/xenstore.pid
00139000     88K r-x--  /lib/i686/nosegneg/libpthread-2.5.so
0014f000      4K r-x--  /lib/i686/nosegneg/libpthread-2.5.so
00150000      4K rwx--  /lib/i686/nosegneg/libpthread-2.5.so
00151000      8K rwx--    [ anon ]
006d1000     80K r-x--  /usr/lib/libxenctrl.so.3.0.0.#prelink#.2P7fpe (deleted)
006e5000      4K rwx--  /usr/lib/libxenctrl.so.3.0.0.#prelink#.2P7fpe (deleted)
006e6000     88K rwx--    [ anon ]
00958000   1368K r-x--  /lib/i686/nosegneg/libc-2.5.so
00aae000      8K r-x--  /lib/i686/nosegneg/libc-2.5.so
00ab0000      4K rwx--  /lib/i686/nosegneg/libc-2.5.so
00ab1000     12K rwx--    [ anon ]
00ced000    108K r-x--  /lib/ld-2.5.so
00d08000      4K r-x--  /lib/ld-2.5.so
00d09000      4K rwx--  /lib/ld-2.5.so
00ef3000      4K r-x--    [ anon ]
08047000     72K r-x--  /usr/sbin/xenstored
08059000      4K rw---  /usr/sbin/xenstored
0805a000     12K rw---    [ anon ]
08358000    452K rw---    [ anon ]
b7ff5000      4K rw-s-  /proc/xen/privcmd
b7ff6000      4K rw-s-  /proc/xen/privcmd
b7ff7000      4K rw-s-  /proc/xen/privcmd
b7ff8000      8K rw---    [ anon ]
b7ffa000      4K rw-s-  /proc/xen/privcmd
b7ffb000      4K rw-s-  /proc/xen/privcmd
b7ffc000      4K rw-s-  /proc/xen/privcmd
b7ffd000      4K rw-s-  /proc/xen/privcmd
b7ffe000      4K rw-s-  /proc/xen/xsd_kva
b7fff000      4K rw---    [ anon ]
bfc4f000     84K rw---    [ stack ]
 total     2456K

Comment 64 Michal Novotny 2010-12-14 09:19:24 UTC
(In reply to comment #63)
> Here is the output so far. Im now checking a few times per day to spot any
> difference. Anything more to look for ? Uptime is now 41 days. If I remember
> correctly the pid of xenstore disappears on my system just like R P Herrold
> above describes.
> 

That's right, the pid disappears but not of xenstore rather than console daemon. This bug is about console daemon dying, right? It *may* be connected to xenstore since it's accessing xenstore using the xs library but I don't see anything bad in your comment #63. Could you please try to monitor both xenstored and xenconsoled (no need to start over now, you could just start doing the same for xenconsole too).

Thanks,
Michal

Comment 65 R P Herrold 2010-12-14 16:57:30 UTC
Comment 50 states in part: It could help us reveal whether it's the memory leak issue reproducible after some time or not.

Looking at the process table, there is little question that this is the case.  


[root@xen-n005 ~]# top -p25114 -p4477 -p4486
top - 11:50:44 up 42 days, 17:55,  2 users,  load average: 0.02, 0.03, 0.01
Tasks:   3 total,   0 running,   3 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1048576k total,   889972k used,   158604k free,    32440k buffers
Swap:  8388600k total,   248892k used,  8139708k free,   155132k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
25114 root      16   0  656m 273m 2116 S  0.3 26.7   5:40.72 libvirtd
 4477 root      16   0  9024  980  488 S  0.0  0.1  11:05.22 xenstored
 4486 root      15   0 88324  420  416 S  0.0  0.0   0:00.40 xenconsoled

and it leaves a shambles when it get too big, and the OOM killer gets invoked -- race file of one minute of such when it locked up on a box last time in an attachment in a moment

To provoke more quickly, simply set up polls from several xm processes in repeating loops ... we do all ours across the lan, and this may be exercising some networking code, compared to testing just on a single localhost

-- Russ herrold

Comment 66 R P Herrold 2010-12-14 16:59:05 UTC
Created attachment 468645 [details]
/v/l/messages when the OOM is invokde through a memory leaking libvirtd

Comment 67 Alexander Lindqvist 2010-12-14 20:26:29 UTC
My bad. I meant xenconsoled. Not xenstore. Below is new stats at the moment. Im now also running gdb against libvirtd in a separate screen session. Also noticing that libvirtd is only at 60M at the moment compared to Russ at 656M.
# top -p3163 -p2646 -p10504

top - 21:02:10 up 42 days, 11 min,  1 user,  load average: 0.00, 0.00, 0.00
Tasks:   3 total,   0 running,   3 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.8%id,  0.1%wa,  0.0%hi,  0.0%si,  0.1%st
Mem:   1048576k total,   649188k used,   399388k free,   223728k buffers
Swap:   746936k total,        0k used,   746936k free,   248424k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 3163 root      23   0  2460 1196  600 S  0.0  0.1   0:08.11 xenstored          
 2646 root      15   0 60392 2892 2384 S  0.0  0.3   0:00.15 libvirtd           
10504 root      16   0 12400  704  500 S  0.0  0.1   0:00.28 xenconsoled

# pmap 10504
10504:   xenconsoled
00110000     44K r-x--  /lib/libgcc_s-4.1.2-20080825.so.1
0011b000      4K rwx--  /lib/libgcc_s-4.1.2-20080825.so.1
00aee000    108K r-x--  /lib/ld-2.5.so
00b09000      4K r-x--  /lib/ld-2.5.so
00b0a000      4K rwx--  /lib/ld-2.5.so
00b0d000   1368K r-x--  /lib/i686/nosegneg/libc-2.5.so
00c63000      8K r-x--  /lib/i686/nosegneg/libc-2.5.so
00c65000      4K rwx--  /lib/i686/nosegneg/libc-2.5.so
00c66000     12K rwx--    [ anon ]
00c72000     88K r-x--  /lib/i686/nosegneg/libpthread-2.5.so
00c88000      4K r-x--  /lib/i686/nosegneg/libpthread-2.5.so
00c89000      4K rwx--  /lib/i686/nosegneg/libpthread-2.5.so
00c8a000      8K rwx--    [ anon ]
00cb9000     80K r-x--  /usr/lib/libxenctrl.so.3.0.0
00ccd000      4K rwx--  /usr/lib/libxenctrl.so.3.0.0
00cce000     88K rwx--    [ anon ]
00d3b000      4K r-x--    [ anon ]
00dc4000     16K r-x--  /usr/lib/libxenstore.so.3.0.0
00dc8000      4K rwx--  /usr/lib/libxenstore.so.3.0.0
00dc9000     12K rwx--    [ anon ]
08048000     20K r-x--  /usr/sbin/xenconsoled
0804d000      8K rw---  /usr/sbin/xenconsoled
09f51000    132K rw---    [ anon ]
b75bf000      4K rw-s-  /proc/xen/privcmd
b75c0000      4K rw-s-  /proc/xen/privcmd
b75c1000      4K -----    [ anon ]
b75c2000  10248K rw---    [ anon ]
b7fc4000      4K rw-s-  /proc/xen/privcmd
b7fc5000      4K rw-s-  /proc/xen/privcmd
b7fc6000      4K rw-s-  /proc/xen/privcmd
b7fc7000      4K rw-s-  /proc/xen/privcmd
b7fc8000      4K rw-s-  /proc/xen/privcmd
b7fc9000      4K rw---    [ anon ]
bf9bc000     84K rw---    [ stack ]
 total    12396K

pmap 2646
2646:   libvirtd --daemon
00110000     72K r-x--  /usr/lib/libz.so.1.2.3
00122000      4K rwx--  /usr/lib/libz.so.1.2.3
00123000     16K r-x--  /usr/lib/libnuma.so.1
00127000      4K rwx--  /usr/lib/libnuma.so.1
00128000     96K r-x--  /usr/lib/libsasl2.so.2.0.22
00140000      4K rwx--  /usr/lib/libsasl2.so.2.0.22
00141000     16K r-x--  /usr/lib/libxenstore.so.3.0.0.#prelink#.5Ow8hp (deleted)
00145000      4K rwx--  /usr/lib/libxenstore.so.3.0.0.#prelink#.5Ow8hp (deleted)
00146000     12K rwx--    [ anon ]
00149000     44K r-x--  /usr/lib/libavahi-common.so.3.4.3.#prelink#.i277u8 (deleted)
00154000      4K rwx--  /usr/lib/libavahi-common.so.3.4.3.#prelink#.i277u8 (deleted)
00155000     60K r-x--  /usr/lib/libavahi-client.so.3.2.1.#prelink#.Quy449 (deleted)
00164000      4K rwx--  /usr/lib/libavahi-client.so.3.2.1.#prelink#.Quy449 (deleted)
00165000     88K r-x--  /lib/i686/nosegneg/libpthread-2.5.so
0017b000      4K r-x--  /lib/i686/nosegneg/libpthread-2.5.so
0017c000      4K rwx--  /lib/i686/nosegneg/libpthread-2.5.so
0017d000      8K rwx--    [ anon ]
0017f000     12K r-x--  /lib/libcap.so.1.10
00182000      4K rwx--  /lib/libcap.so.1.10
00183000    236K r-x--  /lib/libsepol.so.1
001be000      4K rwx--  /lib/libsepol.so.1
001bf000     40K rwx--    [ anon ]
001c9000     44K r-x--  /lib/libgcc_s-4.1.2-20080825.so.1.#prelink#.0yZASk (deleted)
001d4000      4K rwx--  /lib/libgcc_s-4.1.2-20080825.so.1.#prelink#.0yZASk (deleted)
001d5000     16K r-x--  /usr/lib/sasl2/libplain.so.2.0.22
001d9000      4K rwx--  /usr/lib/sasl2/libplain.so.2.0.22
002e3000     16K r-x--  /usr/lib/sasl2/libanonymous.so.2.0.22
002e7000      4K rwx--  /usr/lib/sasl2/libanonymous.so.2.0.22
003a5000      4K r-x--    [ anon ]
003da000    876K r-x--  /usr/lib/sasl2/libsasldb.so.2.0.22
004b5000      8K rwx--  /usr/lib/sasl2/libsasldb.so.2.0.22
00543000     16K r-x--  /usr/lib/sasl2/libcrammd5.so.2.0.22
00547000      4K rwx--  /usr/lib/sasl2/libcrammd5.so.2.0.22
005e5000     12K r-x--  /lib/libdl-2.5.so
005e8000      4K r-x--  /lib/libdl-2.5.so
005e9000      4K rwx--  /lib/libdl-2.5.so
006d1000     88K r-x--  /lib/libselinux.so.1
006e7000      8K rwx--  /lib/libselinux.so.1
006ff000   1200K r-x--  /usr/lib/libxml2.so.2.6.26
0082b000     20K rwx--  /usr/lib/libxml2.so.2.6.26
00830000      4K rwx--    [ anon ]
0083b000    244K r-x--  /lib/libdbus-1.so.3.4.0
00878000      8K rwx--  /lib/libdbus-1.so.3.4.0
0087a000    156K r-x--  /lib/i686/nosegneg/libm-2.5.so
008a1000      4K r-x--  /lib/i686/nosegneg/libm-2.5.so
008a2000      4K rwx--  /lib/i686/nosegneg/libm-2.5.so
008d8000     40K r-x--  /usr/lib/libhal.so.1.0.0
008e2000      4K rwx--  /usr/lib/libhal.so.1.0.0
00972000     12K r-x--  /usr/lib/libgpg-error.so.0.3.0.#prelink#.nouCri (deleted)
00975000      4K rwx--  /usr/lib/libgpg-error.so.0.3.0.#prelink#.nouCri (deleted)
00978000    504K r-x--  /usr/lib/libgcrypt.so.11.5.2.#prelink#.BArMt5 (deleted)
009f6000     12K rwx--  /usr/lib/libgcrypt.so.11.5.2.#prelink#.BArMt5 (deleted)
009fc000    108K r-x--  /lib/ld-2.5.so
00a17000      4K r-x--  /lib/ld-2.5.so
00a18000      4K rwx--  /lib/ld-2.5.so
00a55000     44K r-x--  /usr/lib/sasl2/libdigestmd5.so.2.0.22
00a60000      4K rwx--  /usr/lib/sasl2/libdigestmd5.so.2.0.22
00aef000     64K r-x--  /lib/libresolv-2.5.so
00aff000      4K r-x--  /lib/libresolv-2.5.so
00b00000      4K rwx--  /lib/libresolv-2.5.so
00b01000      8K rwx--    [ anon ]
00b2f000    656K r-x--  /usr/lib/libvirt.so.0.6.3
00bd3000     12K rwx--  /usr/lib/libvirt.so.0.6.3
00bd6000     60K rwx--    [ anon ]
00c32000    484K r-x--  /usr/lib/libgnutls.so.13.0.6.#prelink#.SZhLzv (deleted)
00cab000     24K rwx--  /usr/lib/libgnutls.so.13.0.6.#prelink#.SZhLzv (deleted)
00cf1000     36K r-x--  /lib/libcrypt-2.5.so
00cfa000      4K r-x--  /lib/libcrypt-2.5.so
00cfb000      4K rwx--  /lib/libcrypt-2.5.so
00cfc000    156K rwx--    [ anon ]
00def000   1368K r-x--  /lib/i686/nosegneg/libc-2.5.so
00f45000      8K r-x--  /lib/i686/nosegneg/libc-2.5.so
00f47000      4K rwx--  /lib/i686/nosegneg/libc-2.5.so
00f48000     12K rwx--    [ anon ]
00f5b000     16K r-x--  /usr/lib/sasl2/liblogin.so.2.0.22
00f5f000      4K rwx--  /usr/lib/sasl2/liblogin.so.2.0.22
00f60000   1192K r-x--  /lib/libcrypto.so.0.9.8e.#prelink#.e0gSuf (deleted)
0108a000     76K rwx--  /lib/libcrypto.so.0.9.8e.#prelink#.e0gSuf (deleted)
0109d000     16K rwx--    [ anon ]
08048000    360K r-x--  /usr/sbin/libvirtd
080a2000     20K rw---  /usr/sbin/libvirtd
09c66000    232K rw---    [ anon ]
b4d4a000      4K -----    [ anon ]
b4d4b000  10240K rw---    [ anon ]
b574b000      4K -----    [ anon ]
b574c000  10240K rw---    [ anon ]
b614c000      4K -----    [ anon ]
b614d000  10240K rw---    [ anon ]
b6b4d000      4K -----    [ anon ]
b6b4e000  10240K rw---    [ anon ]
b754e000      4K -----    [ anon ]
b754f000  10264K rw---    [ anon ]
b7f5a000      4K rw---    [ anon ]
bf855000     84K rw---    [ stack ]
 total    60388K

Comment 68 Paolo Bonzini 2010-12-14 21:17:09 UTC
Thanks Russ, that's indeed the likely cause.  We had a memory leak in xenstore (bug 606919) and one in libvirtd (bug 590073).

Both will be fixed on 5.6.0 and 5.5.z.  Russ and Alexander, what versions do you have of the xen and libvirt packages?

Comment 69 R P Herrold 2010-12-14 21:57:15 UTC
For x86_64 units ...

The control head unit has:

[herrold@centos-5 pmman]$ ssh -l root xyz.pmman.com rpm -qa  libvirt\\*
libvirt-python-0.6.3-33.el5_5.3
libvirt-0.6.3-33.el5_5.3
libvirt-0.6.3-33.el5_5.3
[herrold@centos-5 pmman]$ ssh -l root xyz.pmman.com rpm -qa  \\*xen\\*
kernel-xen-2.6.18-128.1.10.el5
xen-libs-3.0.3-105.el5_5.5
xen-3.0.3-105.el5_5.5
kernel-xen-2.6.18-128.el5
xen-libs-3.0.3-105.el5_5.5
kernel-xen-2.6.18-194.17.1.el5
kernel-xen-2.6.18-128.2.1.el5
kernel-xen-2.6.18-128.7.1.el5
[herrold@centos-5 pmman]$   

A representative dom0 has: 

[herrold@centos-5 pmman]$ ssh -l root xen-xyz.pmman.com rpm -qa  libvirt\\*
libvirt-0.6.3-33.el5_5.3
libvirt-0.6.3-33.el5_5.3
libvirt-python-0.6.3-33.el5_5.3
[herrold@centos-5 pmman]$ ssh -l root xen-xyz.pmman.com rpm -qa  \\*xen\\*
kernel-xen-2.6.18-128.el5
kernel-xen-2.6.18-128.7.1.el5
xen-3.0.3-105.el5_5.5
xen-libs-3.0.3-105.el5_5.5
xen-libs-3.0.3-105.el5_5.5
kernel-xen-2.6.18-194.17.1.el5
[herrold@centos-5 pmman]$   


for i386 units acting as dom0

[herrold@centos-5 pmman]$ ssh -l root xen-zyx.pmman.com rpm -qa  libvirt\\*
libvirt-0.6.3-33.el5_5.3
libvirt-python-0.6.3-33.el5_5.3
[herrold@centos-5 pmman]$ ssh -l root xen-zyx.pmman.com rpm -qa   \\*xen\\*
xen-libs-3.0.3-105.el5_5.5
kernel-xen-2.6.18-194.17.1.el5
kernel-xen-2.6.18-194.17.4.el5
xen-3.0.3-105.el5_5.5
kernel-xen-2.6.18-194.26.1.el5
[herrold@centos-5 pmman]$     

I'll rebuild the 5.6 SRPMs and deploy to a test unit -- I 'cc'd' into the two referenced bugs already

Comment 70 Alexander Lindqvist 2010-12-15 07:33:55 UTC
#rpm -qa | grep -E "xen|libvirt"
xen-3.0.3-105.el5_5.5
libvirt-0.6.3-33.el5_5.3
kernel-xen-2.6.18-194.17.4.el5
xen-debuginfo-3.0.3-105.el5_5.5
xen-libs-3.0.3-105.el5_5.5
libvirt-python-0.6.3-33.el5_5.3
kernel-xen-2.6.18-194.11.1.el5
kernel-xen-2.6.18-194.11.3.el5

Comment 71 Michal Novotny 2010-12-15 19:15:51 UTC
(In reply to comment #65)
> Comment 50 states in part: It could help us reveal whether it's the memory leak
> issue reproducible after some time or not.
> 
> Looking at the process table, there is little question that this is the case.  
> 
> 
> [root@xen-n005 ~]# top -p25114 -p4477 -p4486
> top - 11:50:44 up 42 days, 17:55,  2 users,  load average: 0.02, 0.03, 0.01
> Tasks:   3 total,   0 running,   3 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.0%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:   1048576k total,   889972k used,   158604k free,    32440k buffers
> Swap:  8388600k total,   248892k used,  8139708k free,   155132k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 25114 root      16   0  656m 273m 2116 S  0.3 26.7   5:40.72 libvirtd
>  4477 root      16   0  9024  980  488 S  0.0  0.1  11:05.22 xenstored
>  4486 root      15   0 88324  420  416 S  0.0  0.0   0:00.40 xenconsoled
> 
> and it leaves a shambles when it get too big, and the OOM killer gets invoked
> -- race file of one minute of such when it locked up on a box last time in an
> attachment in a moment
> 
> To provoke more quickly, simply set up polls from several xm processes in
> repeating loops ... we do all ours across the lan, and this may be exercising
> some networking code, compared to testing just on a single localhost
> 
> -- Russ herrold

We need some data grabbed on the same system before that to compare there's really a memory leak. I can't find the older statistics of libvirtd/xenstored and xenconsoled in those comments. Are you having some results of 10 days uptime, 20 days uptime, 30 days uptime etc. ? Can you post them in a comment ? No need to attach a file. Just a simple form of:

After 10 days:
...

After 20 days:
...

etc.

Thanks,
Michal

Comment 72 R P Herrold 2010-12-15 20:25:06 UTC
as to Comment 71 , I have sysstat running I imagine ... checking

Comment 73 R P Herrold 2010-12-15 20:36:11 UTC
It looks as though the default is only seven days 

I have rolled a tarball (and it may also contain confidential data I would rather not have in the public bugzilla) As it is over 3M, I will place it at a private location, and and advice where Michal where it might be retrieved by private email

For those following this bug, I have added to our setup outline a Deployment Doco fix

on a dom0:
   When installing sysstat, edit the configs to keep 30 day's sysstat logs (this is similar to the logrotate fixup)

[root@xen-n005 log]# cat /etc/sysconfig/sysstat
# How long to keep log files (days), maximum is a month
HISTORY=7
[root@xen-n005 log]#

to be able to look at system load history for debugging  -- see RH bug 593339

- Russ herrold

Comment 74 R P Herrold 2010-12-15 20:54:20 UTC
I have send a private email sent to minovotn with the URL for the sar data tarball


A couple days ago at the time of my post the stats were:

25114 root      16   0  656m 273m 2116 S  0.3 26.7   5:40.72 libvirtd

Today they are:

25114 root      15   0  793m 220m 1528 S  0.0 21.5   7:23.83 libvirtd

Then I run:

/sbin/service libvirtd restart

and the stats drop to:

26682 root      17   0  126m 4856 3104 S  0.0  0.5   0:00.57 libvirtd

which is 'normal'

Is there any residual doubt here that a leak in the libvirtd is in play?

======================

What is happening is that the lost (leaked) memory is being moved out to the swap file, and never called back in (not surprising, it is lost, and would not properly be used any more)

Finally the swap partition fills up and so lost memory has to remain in RAM

Then some other process calls for a malloc, and cannot be satisfied, and the OOM killer kicks in

-- Russ herrold

Comment 75 Alexander Lindqvist 2010-12-15 21:47:20 UTC
I rechecked the memory consumption again today (roughly 26 hours since last time) to compare with what Russ is seeing and it has not increased at all on any of the processes below on my dom0. And xenconsoled is still running strong.

# top -p3163 -p2646 -p10504

top - 22:41:12 up 43 days,  1:39,  1 user,  load average: 0.00, 0.00, 0.00
Tasks:   3 total,   0 running,   3 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1048576k total,   657436k used,   391140k free,   223764k buffers
Swap:   746936k total,        0k used,   746936k free,   248592k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 3163 root      15   0  2460 1196  600 S  0.0  0.1   0:08.25 xenstored          
 2646 root      18   0 60392 2892 2372 S  0.0  0.3   0:00.15 libvirtd           
10504 root      16   0 12400  704  500 S  0.0  0.1   0:00.30 xenconsoled

Comment 76 Alexander Lindqvist 2010-12-21 12:00:31 UTC
Xenconsoled has now stopped working. Here is the output. The symptom is the same every time. SSh'ed in to dom0. xm console into domU. Console output hangs when trying to input password for login. Xenconsoled still appears to be running but using xm console does not produce any more output. All domU's working fine so the dom0 is left in this state for now if you want me to check anything. gdb against libvirtd is still running fine.

# pmap 2646
2646:   libvirtd --daemon
00110000     72K r-x--  /usr/lib/libz.so.1.2.3
00122000      4K rwx--  /usr/lib/libz.so.1.2.3
00123000     16K r-x--  /usr/lib/libnuma.so.1
00127000      4K rwx--  /usr/lib/libnuma.so.1
00128000     96K r-x--  /usr/lib/libsasl2.so.2.0.22
00140000      4K rwx--  /usr/lib/libsasl2.so.2.0.22
00141000     16K r-x--  /usr/lib/libxenstore.so.3.0.0.#prelink#.5Ow8hp (deleted)
00145000      4K rwx--  /usr/lib/libxenstore.so.3.0.0.#prelink#.5Ow8hp (deleted)
00146000     12K rwx--    [ anon ]
00149000     44K r-x--  /usr/lib/libavahi-common.so.3.4.3.#prelink#.i277u8 (deleted)
00154000      4K rwx--  /usr/lib/libavahi-common.so.3.4.3.#prelink#.i277u8 (deleted)
00155000     60K r-x--  /usr/lib/libavahi-client.so.3.2.1.#prelink#.Quy449 (deleted)
00164000      4K rwx--  /usr/lib/libavahi-client.so.3.2.1.#prelink#.Quy449 (deleted)
00165000     88K r-x--  /lib/i686/nosegneg/libpthread-2.5.so
0017b000      4K r-x--  /lib/i686/nosegneg/libpthread-2.5.so
0017c000      4K rwx--  /lib/i686/nosegneg/libpthread-2.5.so
0017d000      8K rwx--    [ anon ]
0017f000     12K r-x--  /lib/libcap.so.1.10
00182000      4K rwx--  /lib/libcap.so.1.10
00183000    236K r-x--  /lib/libsepol.so.1
001be000      4K rwx--  /lib/libsepol.so.1
001bf000     40K rwx--    [ anon ]
001c9000     44K r-x--  /lib/libgcc_s-4.1.2-20080825.so.1.#prelink#.0yZASk (deleted)
001d4000      4K rwx--  /lib/libgcc_s-4.1.2-20080825.so.1.#prelink#.0yZASk (deleted)
001d5000     16K r-x--  /usr/lib/sasl2/libplain.so.2.0.22
001d9000      4K rwx--  /usr/lib/sasl2/libplain.so.2.0.22
002e3000     16K r-x--  /usr/lib/sasl2/libanonymous.so.2.0.22
002e7000      4K rwx--  /usr/lib/sasl2/libanonymous.so.2.0.22
003a5000      4K r-x--    [ anon ]
003da000    876K r-x--  /usr/lib/sasl2/libsasldb.so.2.0.22
004b5000      8K rwx--  /usr/lib/sasl2/libsasldb.so.2.0.22
00543000     16K r-x--  /usr/lib/sasl2/libcrammd5.so.2.0.22
00547000      4K rwx--  /usr/lib/sasl2/libcrammd5.so.2.0.22
005e5000     12K r-x--  /lib/libdl-2.5.so
005e8000      4K r-x--  /lib/libdl-2.5.so
005e9000      4K rwx--  /lib/libdl-2.5.so
006d1000     88K r-x--  /lib/libselinux.so.1
006e7000      8K rwx--  /lib/libselinux.so.1
006ff000   1200K r-x--  /usr/lib/libxml2.so.2.6.26
0082b000     20K rwx--  /usr/lib/libxml2.so.2.6.26
00830000      4K rwx--    [ anon ]
0083b000    244K r-x--  /lib/libdbus-1.so.3.4.0
00878000      8K rwx--  /lib/libdbus-1.so.3.4.0
0087a000    156K r-x--  /lib/i686/nosegneg/libm-2.5.so
008a1000      4K r-x--  /lib/i686/nosegneg/libm-2.5.so
008a2000      4K rwx--  /lib/i686/nosegneg/libm-2.5.so
008d8000     40K r-x--  /usr/lib/libhal.so.1.0.0
008e2000      4K rwx--  /usr/lib/libhal.so.1.0.0
00972000     12K r-x--  /usr/lib/libgpg-error.so.0.3.0.#prelink#.nouCri (deleted)
00975000      4K rwx--  /usr/lib/libgpg-error.so.0.3.0.#prelink#.nouCri (deleted)
00978000    504K r-x--  /usr/lib/libgcrypt.so.11.5.2.#prelink#.BArMt5 (deleted)
009f6000     12K rwx--  /usr/lib/libgcrypt.so.11.5.2.#prelink#.BArMt5 (deleted)
009fc000    108K r-x--  /lib/ld-2.5.so
00a17000      4K r-x--  /lib/ld-2.5.so
00a18000      4K rwx--  /lib/ld-2.5.so
00a55000     44K r-x--  /usr/lib/sasl2/libdigestmd5.so.2.0.22
00a60000      4K rwx--  /usr/lib/sasl2/libdigestmd5.so.2.0.22
00aef000     64K r-x--  /lib/libresolv-2.5.so
00aff000      4K r-x--  /lib/libresolv-2.5.so
00b00000      4K rwx--  /lib/libresolv-2.5.so
00b01000      8K rwx--    [ anon ]
00b2f000    656K r-x--  /usr/lib/libvirt.so.0.6.3
00bd3000     12K rwx--  /usr/lib/libvirt.so.0.6.3
00bd6000     60K rwx--    [ anon ]
00c32000    484K r-x--  /usr/lib/libgnutls.so.13.0.6.#prelink#.SZhLzv (deleted)
00cab000     24K rwx--  /usr/lib/libgnutls.so.13.0.6.#prelink#.SZhLzv (deleted)
00cf1000     36K r-x--  /lib/libcrypt-2.5.so
00cfa000      4K r-x--  /lib/libcrypt-2.5.so
00cfb000      4K rwx--  /lib/libcrypt-2.5.so
00cfc000    156K rwx--    [ anon ]
00def000   1368K r-x--  /lib/i686/nosegneg/libc-2.5.so
00f45000      8K r-x--  /lib/i686/nosegneg/libc-2.5.so
00f47000      4K rwx--  /lib/i686/nosegneg/libc-2.5.so
00f48000     12K rwx--    [ anon ]
00f5b000     16K r-x--  /usr/lib/sasl2/liblogin.so.2.0.22
00f5f000      4K rwx--  /usr/lib/sasl2/liblogin.so.2.0.22
00f60000   1192K r-x--  /lib/libcrypto.so.0.9.8e.#prelink#.e0gSuf (deleted)
0108a000     76K rwx--  /lib/libcrypto.so.0.9.8e.#prelink#.e0gSuf (deleted)
0109d000     16K rwx--    [ anon ]
08048000    360K r-x--  /usr/sbin/libvirtd
080a2000     20K rw---  /usr/sbin/libvirtd
09c66000    232K rw---    [ anon ]
b4d4a000      4K -----    [ anon ]
b4d4b000  10240K rw---    [ anon ]
b574b000      4K -----    [ anon ]
b574c000  10240K rw---    [ anon ]
b614c000      4K -----    [ anon ]
b614d000  10240K rw---    [ anon ]
b6b4d000      4K -----    [ anon ]
b6b4e000  10240K rw---    [ anon ]
b754e000      4K -----    [ anon ]
b754f000  10264K rw---    [ anon ]
b7f5a000      4K rw---    [ anon ]
bf855000     84K rw---    [ stack ]
 total    60388K

# pmap 3163
3163:   xenstored --pid-file /var/run/xenstore.pid
00139000     88K r-x--  /lib/i686/nosegneg/libpthread-2.5.so
0014f000      4K r-x--  /lib/i686/nosegneg/libpthread-2.5.so
00150000      4K rwx--  /lib/i686/nosegneg/libpthread-2.5.so
00151000      8K rwx--    [ anon ]
006d1000     80K r-x--  /usr/lib/libxenctrl.so.3.0.0.#prelink#.2P7fpe (deleted)
006e5000      4K rwx--  /usr/lib/libxenctrl.so.3.0.0.#prelink#.2P7fpe (deleted)
006e6000     88K rwx--    [ anon ]
00958000   1368K r-x--  /lib/i686/nosegneg/libc-2.5.so
00aae000      8K r-x--  /lib/i686/nosegneg/libc-2.5.so
00ab0000      4K rwx--  /lib/i686/nosegneg/libc-2.5.so
00ab1000     12K rwx--    [ anon ]
00ced000    108K r-x--  /lib/ld-2.5.so
00d08000      4K r-x--  /lib/ld-2.5.so
00d09000      4K rwx--  /lib/ld-2.5.so
00ef3000      4K r-x--    [ anon ]
08047000     72K r-x--  /usr/sbin/xenstored
08059000      4K rw---  /usr/sbin/xenstored
0805a000     12K rw---    [ anon ]
08358000    452K rw---    [ anon ]
b7ff5000      4K rw-s-  /proc/xen/privcmd
b7ff6000      4K rw-s-  /proc/xen/privcmd
b7ff7000      4K rw-s-  /proc/xen/privcmd
b7ff8000      8K rw---    [ anon ]
b7ffa000      4K rw-s-  /proc/xen/privcmd
b7ffb000      4K rw-s-  /proc/xen/privcmd
b7ffc000      4K rw-s-  /proc/xen/privcmd
b7ffd000      4K rw-s-  /proc/xen/privcmd
b7ffe000      4K rw-s-  /proc/xen/xsd_kva
b7fff000      4K rw---    [ anon ]
bfc4f000     84K rw---    [ stack ]
 total     2456K

# pmap 10504
10504:   xenconsoled
00110000     44K r-x--  /lib/libgcc_s-4.1.2-20080825.so.1
0011b000      4K rwx--  /lib/libgcc_s-4.1.2-20080825.so.1
00aee000    108K r-x--  /lib/ld-2.5.so
00b09000      4K r-x--  /lib/ld-2.5.so
00b0a000      4K rwx--  /lib/ld-2.5.so
00b0d000   1368K r-x--  /lib/i686/nosegneg/libc-2.5.so
00c63000      8K r-x--  /lib/i686/nosegneg/libc-2.5.so
00c65000      4K rwx--  /lib/i686/nosegneg/libc-2.5.so
00c66000     12K rwx--    [ anon ]
00c72000     88K r-x--  /lib/i686/nosegneg/libpthread-2.5.so
00c88000      4K r-x--  /lib/i686/nosegneg/libpthread-2.5.so
00c89000      4K rwx--  /lib/i686/nosegneg/libpthread-2.5.so
00c8a000      8K rwx--    [ anon ]
00cb9000     80K r-x--  /usr/lib/libxenctrl.so.3.0.0
00ccd000      4K rwx--  /usr/lib/libxenctrl.so.3.0.0
00cce000     88K rwx--    [ anon ]
00d3b000      4K r-x--    [ anon ]
00dc4000     16K r-x--  /usr/lib/libxenstore.so.3.0.0
00dc8000      4K rwx--  /usr/lib/libxenstore.so.3.0.0
00dc9000     12K rwx--    [ anon ]
08048000     20K r-x--  /usr/sbin/xenconsoled
0804d000      8K rw---  /usr/sbin/xenconsoled
09f51000    132K rw---    [ anon ]
b75bf000      4K rw-s-  /proc/xen/privcmd
b75c0000      4K rw-s-  /proc/xen/privcmd
b75c1000      4K -----    [ anon ]
b75c2000  10248K rw---    [ anon ]
b7fc4000      4K rw-s-  /proc/xen/privcmd
b7fc5000      4K rw-s-  /proc/xen/privcmd
b7fc6000      4K rw-s-  /proc/xen/privcmd
b7fc7000      4K rw-s-  /proc/xen/privcmd
b7fc8000      4K rw-s-  /proc/xen/privcmd
b7fc9000      4K rw---    [ anon ]
bf9bc000     84K rw---    [ stack ]
 total    12396K

top - 12:48:28 up 48 days, 15:57,  1 user,  load average: 0.00, 0.00, 0.00
Tasks:   3 total,   0 running,   2 sleeping,   1 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1048576k total,   657124k used,   391452k free,   223956k buffers
Swap:   746936k total,        0k used,   746936k free,   248596k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                
 3163 root      15   0  2460 1196  600 S  0.0  0.1   0:08.53 xenstored                                              
 2646 root      18   0 60392 2892 2372 S  0.0  0.3   0:00.15 libvirtd                                               
10504 root      15   0 12400  740  528 T  0.0  0.1   0:00.33 xenconsoled 

# gdb -p 10504
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.2)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 10504
Reading symbols from /usr/sbin/xenconsoled...Reading symbols from /usr/lib/debug/usr/sbin/xenconsoled.debug...
warning: section .gnu.liblist not found in /usr/lib/debug/usr/sbin/xenconsoled.debug

warning: section .gnu.conflict not found in /usr/lib/debug/usr/sbin/xenconsoled.debug

warning: section .dynbss not found in /usr/lib/debug/usr/sbin/xenconsoled.debug
done.
done.

warning: .dynamic section for "/lib/libgcc_s.so.1" is not at the expected address

warning: difference appears to be caused by prelink, adjusting expectations
Reading symbols from /usr/lib/libxenctrl.so.3.0...Reading symbols from /usr/lib/debug/usr/lib/libxenctrl.so.3.0.0.debug...done.
done.
Loaded symbols for /usr/lib/libxenctrl.so.3.0
Reading symbols from /usr/lib/libxenstore.so.3.0...Reading symbols from /usr/lib/debug/usr/lib/libxenstore.so.3.0.0.debug...done.
done.
Loaded symbols for /usr/lib/libxenstore.so.3.0
Reading symbols from /lib/i686/nosegneg/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/i686/nosegneg/libc.so.6
Reading symbols from /lib/i686/nosegneg/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 0xb7fc1b90 (LWP 10505)]
Loaded symbols for /lib/i686/nosegneg/libpthread.so.0
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libgcc_s.so.1

warning: section .gnu.liblist not found in /usr/lib/debug/usr/sbin/xenconsoled.debug

warning: section .gnu.conflict not found in /usr/lib/debug/usr/sbin/xenconsoled.debug

warning: section .dynbss not found in /usr/lib/debug/usr/sbin/xenconsoled.debug
0x00d3b402 in __kernel_vsyscall ()
(gdb) b exit
Breakpoint 1 at 0xb39016
(gdb) c
Continuing.

Breakpoint 1, 0x00b39016 in exit () from /lib/i686/nosegneg/libc.so.6
(gdb)

Comment 77 R P Herrold 2010-12-21 18:20:36 UTC
... a extract from a private email

I have written a custom script to probe and record the memort and related process table data for specified process names, at per minute intervals in a form susceptible to visualization by gnuplot or such

As to attaching stuff to the bugzilla, I had privacy concerns on that previously provided tarball -- thus, another reason to write a custom data gathering script so I can control what is present in the results, and so I might attach future raw detail load and memory use tarballs

[root@elided memleak]# tail -f load-20101221.csv
1292954881, xenstored, 4477, 4477, root, 16, 0, 9020, 1020,
        496, S, 0.0, 0.1, 13:08.05, xenstored,
1292954881, xenconsoled, 4486, 4486, root, 15, 0, 88320, 460,
        448, S, 0.0, 0.0, 0:00.51, xenconsoled,
1292954881, libvirtd, 26682, 26682, root, 16, 0, 956000000,
        237000000, 1928, S, 0.0, 23.2, 8:14.83, libvirtd,

I shall attach the script to the bugzilla momentarily.  It will take a few days to gather useful data

-- Russ herrold

Comment 78 R P Herrold 2010-12-21 18:22:44 UTC
Created attachment 470044 [details]
script: /root/bin/load-snap.sh

Comment 79 R P Herrold 2010-12-27 20:32:28 UTC
I had my snapshot code running and will attach a tarball with it in a moment

The persistent leak is quite pronounced

Comment 80 R P Herrold 2010-12-27 20:34:10 UTC
Created attachment 470880 [details]
tar ball of last couple of days showing steadily increasing mem usage the daemon process

Comment 81 Michal Novotny 2011-01-03 11:59:16 UTC
(In reply to comment #75)
> I rechecked the memory consumption again today (roughly 26 hours since last
> time) to compare with what Russ is seeing and it has not increased at all on
> any of the processes below on my dom0. And xenconsoled is still running strong.
> 
> # top -p3163 -p2646 -p10504
> 
> top - 22:41:12 up 43 days,  1:39,  1 user,  load average: 0.00, 0.00, 0.00
> Tasks:   3 total,   0 running,   3 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:   1048576k total,   657436k used,   391140k free,   223764k buffers
> Swap:   746936k total,        0k used,   746936k free,   248592k cached
> 
>PID USER    PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
>3163 root   15   0  2460 1196  600 S  0.0  0.1   0:08.25 xenstored          
>2646 root   18   0 60392 2892 2372 S  0.0  0.3   0:00.15 libvirtd           
>10504 root  16   0 12400  704  500 S  0.0  0.1   0:00.30 xenconsoled

In comment 67 you wrote the values were:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 3163 root      23   0  2460 1196  600 S  0.0  0.1   0:08.11 xenstored          
 2646 root      15   0 60392 2892 2384 S  0.0  0.3   0:00.15 libvirtd           
10504 root      16   0 12400  704  500 S  0.0  0.1   0:00.28 xenconsoled

So there's really no change in the memory consumption. Just the priority of xenstored changed for some reason (PR column in the comment 67 and comment 75. Also, xenstored shared memory size decreased by 12 but no other changes were there.

When comparing this to the information in comment 76 I can see the xenconsoled shared memory usage has increased by 28 to 528M but this shouldn't be the issue there.

The gdb running on xenconsoled is having many warning messages like:

warning: section .gnu.liblist not found in
/usr/lib/debug/usr/sbin/xenconsoled.debug

so I'm afraid the symbols may not be correct but you can try to use it. In the comment you had a breakpoint on exit function and then you continued as can be seen in this excerpt:

warning: section .dynbss not found in /usr/lib/debug/usr/sbin/xenconsoled.debug
0x00d3b402 in __kernel_vsyscall ()
(gdb) b exit
Breakpoint 1 at 0xb39016
(gdb) c
Continuing.

After continuing, did the xenconsoled die and did gdb detach from the process because of that? If not and the gdb is still running there, could you please try issuing "bt" and provide me the backtrace output to this bugzilla? It may help to see what path did it go before the exit call.

Thanks,
Michal

Comment 82 Paolo Bonzini 2011-01-03 12:20:12 UTC
Michal,

I suggest we continue with Russ's data since it confirms a leak in libvirtd.

Russ, could you tell us what your typical usage of it is?  In particular:

1) are you using virt-manager or virsh? (or something else?)

2) via which URL are you connecting?

3) which kind of command do you do the most?

CCing Laine since he worked on the others libvirtd leaks.

Comment 83 Michal Novotny 2011-01-03 12:23:15 UTC
(In reply to comment #80)
> Created attachment 470880 [details]
> tar ball of last couple of days showing steadily increasing mem usage the
> daemon process

Ok, according too the logs there's:

1293387001, xenconsoled, 4646, 4646, root, 15, 0, 88896, 472, 364, S, 0.0, 0.0, 0:00.77, xenconsoled
1293425941, xenconsoled, 4646, 4646, root, 15, 0, 88896, 476, 364, S, 0.0, 0.0, 0:00.77, xenconsoled
1293433621, xenconsoled, 4646, 4646, root, 15, 0, 88896, 464, 352, S, 0.0, 0.0, 0:00.77, xenconsoled,
1293468901, xenconsoled, 4646, 4646, root, 15, 0, 88896, 472, 364, S, 0.0, 0.0, 0:00.77, xenconsoled,

so it doesn't look like it's having the memory leak since the memory is being freed as can be seen on the third line (RES 476M dropped to RES 464M and SHR 364M dropped to SHR 352M there but VIRT is the same, i.e. 88896M).

I did check the xenstore daemon too:

1293387001, xenstored, 4638, 4638, root, 16, 0, 9388, 1016, 468, S, 0.0, 0.1, 12:20.98, xenstored,
1293425941, xenstored, 4638, 4638, root, 16, 0, 9388, 1016, 468, S, 0.0, 0.1, 12:26.63, xenstored,
1293433621, xenstored, 4638, 4638, root, 15, 0, 9388, 1016, 468, S, 0.0, 0.1, 12:27.83, xenstored
1293468901, xenstored, 4638, 4638, root, 17, 0, 9388, 1016, 468, S, 0.0, 0.1, 12:33.61, xenstored,

As you can see, the priority changed from 16 to 15 but the virtual and shared memory are the same in all of the cases.

For the libvirt the results are as follows:

1293387001, libvirtd, 29280, 29280, root, 15, 0, 2066000000, 169000000, 1760, S, 0.0, 16.5, 16:23.90, libvirtd
1293425941, libvirtd, 29280, 29280, root, 15, 0, 2119000000, 222000000, 1760, S, 0.0, 21.7, 16:51.12, libvirtd,
1293433621, libvirtd, 29280, 29280, root, 16, 0, 2129000000, 179000000, 1752, S, 0.0, 17.5, 16:56.05, libvirtd,
1293468901, libvirtd, 7246, 7246, root, 15, 0, 119000000, 4752, 2788, S, 0.0, 0.5, 0:00.79, libvirtd,

so it seems there's a memory leak in the libvirt (2066000000 is VIRT, 169000000 is RES and 1760 is SHR).

Russ, based on that it seems that bug you're hitting is rather in libvirt than in xen itself. Just to be sure, what command are you using? Virsh or xm?

Thanks,
Michal

Comment 84 Michal Novotny 2011-01-03 12:27:28 UTC
(In reply to comment #82)
> Michal,
> 
> I suggest we continue with Russ's data since it confirms a leak in libvirtd.
> 
> Russ, could you tell us what your typical usage of it is?  In particular:
> 
> 1) are you using virt-manager or virsh? (or something else?)
> 
> 2) via which URL are you connecting?
> 
> 3) which kind of command do you do the most?
> 
> CCing Laine since he worked on the others libvirtd leaks.

Good Paolo. It may be in libvirt and that's why I was asking Russ whether he's using virsh (libvirt) or xm. Since we can't reproduce it locally and although I did try to run my machine for some longer time some time ago, I was unable to reproduce it because of the power outage in the office (my machine was off when I went to the work that day but I don't know when exactly it happened) so it's hard to reproduce in our environment. Therefore we can't confirm locally whether it's issue in libvirt or xen so we need those information too.

I think that it may be 2 different bugs that Alex and Russ are hitting, therefore I need info from both of them.

Michal

Comment 85 Paolo Bonzini 2011-01-03 13:37:23 UTC
In comment #46 Alex states that he fully updated dom0, so it may be that he didn't have the fix before.  Russ, just to be sure can you do an "ls -l /proc/<pid-of-libvirtd>"?

Comment 87 Alexander Lindqvist 2011-01-03 20:13:25 UTC
Xenconsoled was running when I logged in to Dom0 and the memory consumption was exactly the same as before. Upon running xm console domU it stopped accepting input. Back in Dom0 Xenconsoled had silently died and was not running anymore. Jumping into gdb revealed the input I inserted to this bugzilla earlier and was back at the gdb prompt. So running bt at that stage is not an option I guess?

I have since then restarted Xenconsoled.

Comment 88 R P Herrold 2011-01-04 01:40:11 UTC
Responding to comment 84, I feel the issue is in the dom0 libvirtd s

We connect to the libvirtds (there are several hardware boxes running in a prodcution farm cluster of dom0) usually under the control across the internal control LAN, either across a virsh mediated network connection or by sshing to a given do given operations.  We rarely use the xm series commands

I do not think my issue is in the xenconsold, and frankly think that a dead xenconsoled or problems with the xenstore is a symptom of a dead libvirtd ---

I said early only that were were only seeing this problem expressed on 32 bit hardware - this is not the case as 64 bit dom0 also express the issue with a runaway memory libvirtd -- I had the issue over the weekend on one box

trace traball of mem use over time in a moment

Comment 89 R P Herrold 2011-01-04 01:41:56 UTC
Created attachment 471566 [details]
unit 5 eating all ram in the libbvirtd -- drop is after a manual restart

Comment 90 Paolo Bonzini 2011-01-04 08:53:29 UTC
> frankly think that a dead
> xenconsoled or problems with the xenstore is a symptom of a dead libvirtd

Not really, libvirt is a client of all the xen packages.  The dead xen daemons are (as already guessed) most likely a byproduct of the OOM killer.

We really need to see where the leaks are.  Any chance you can run libvirtd through valgrind, or mtrace?  We can provide debug packages for the latter.

Comment 92 Michal Novotny 2011-01-04 10:42:26 UTC
(In reply to comment #90)
> > frankly think that a dead
> > xenconsoled or problems with the xenstore is a symptom of a dead libvirtd
> 
> Not really, libvirt is a client of all the xen packages.  The dead xen daemons
> are (as already guessed) most likely a byproduct of the OOM killer.
>

Agreed Paolo. If there's some OOM killing in progress, the xen daemons could be killed in this matter so the first step we should do about this is to reassign to libvirt and have libvirt guys have a look at it (already done).

> 
> We really need to see where the leaks are.  Any chance you can run libvirtd
> through valgrind, or mtrace?  We can provide debug packages for the latter.

That's right. Since it's almost irreproducible in our environment, we have to ask Russ for running it under those tools himself. I don't know about mtrace but for discovering memory leaks the valgrind is good I think. I recommend running libvirtd through valgrind for about 10 days at least and then terminating using SIGINT to get the output. 

Running libvirt using some syntax like:

valgrind --show-reachable=yes --leak-check=full libvirtd

could be good AFAIK.

Michal

Comment 93 Shad L. Lords 2011-01-04 14:52:42 UTC
(In reply to comment #88)
> Responding to comment 84, I feel the issue is in the dom0 libvirtd s

Your issue might be related to a memory leak in libvirtd but many of us aren't experiencing the huge memory footprint like you are.  I believe the memory issue and the xenconsoled dying are two different issues.  I think it would be best to take the discussion of the memory issue to a different bug and if they end up being related the bugs can always be marked as dup.

> I do not think my issue is in the xenconsold, and frankly think that a dead
> xenconsoled or problems with the xenstore is a symptom of a dead libvirtd ---

What evidence do you have to support this?
 
> I said early only that were were only seeing this problem expressed on 32 bit
> hardware - this is not the case as 64 bit dom0 also express the issue with a
> runaway memory libvirtd -- I had the issue over the weekend on one box

As a different issue it very possibly could occur on both 32 and 64 bit systems.

Comment 94 R P Herrold 2011-01-04 15:21:04 UTC
as to comment 92 the component this assigned to as I read the ticket is already libvirt   If you are saying there is a sub-team that needs to be focussed on this but within Red Hat's side that watches libvitrd specific matters that seems useful


As to running  production units under valgrind, this will impose huge load, and I will need to take care to isolate customer units off such a test unit.  What that would leave are less active developmental and largely idle standby units behind, and may or may not show the problem

As to comment 93, 2G is not wanted for the libvirtd -- that just happens to be a use limit based on physical and swap ram allocated to the dom0   Setting the boot time param for max mem for the dom0 should permit those running smaller boxes to see the issue

As to me ruling out the xenconsoled and xenstored, I was describing our use pattern, where we are not using such as management tools -- ymmv

similarly as I reread my comments, they could have been taken as implying 32 bit only, which as I systematically kept records of failures,  was not the case and I was extending my report with later data

Comment 95 Laine Stump 2011-01-04 19:58:54 UTC
According to the comments, the machine in question is running libvirt-0.6.3-33. There is at least one known xen-related libvirt memory leak that was fixed in 0.6.3-37 (see Bug 590073), and likely many others fixed in the rebase to libvirt-0.8.2 (I know there have been a lot of leaks fixed in upstream libvirt lately, although many of those were in qemu code, so wouldn't be applicable here).

If it's possible to run libvirtd under valgrind, as Paolo suggests, the output could possibly confirm whether or not the leak you're seeing is the same as the leak in Bug 590073 (it looks like this leak could be triggered by constantly opening and closing connections to Xen from libvirtd, possibly as part of some sort of stats-gathering script?).

If running libvirtd under valgrind is problematic, I would suggest upgrading libvirt to either 0.6.3-37, or to 0.8.2-x if possible. After all, there are known issues with the version running now that potentially cause exactly the problem you're seeing.

Comment 96 Paolo Bonzini 2011-01-05 08:13:03 UTC
Laine, note that 0.6.3-33.el5_5.3 is the z-Stream package and has the fix for libvirt (bug 619711 comment 5).

The idea that this could be caused by some kind of stats script is a good one---Russ, do you have anything that we could use to reproduce?  Alternatively, libvirtd has relatively few things to do when running under Xen (it acts mostly as an RPC broker), so running it under valgrind should not impose a huge load.  Doing so for 3-4 hours should be enough.

(btw you _are_ using xenstored, since all PV drivers and libvirtd are relying on it.  libvirtd is at the top of the "food chain", if it dies nobody will notice except you.  The problems arise when/if the OOM killer does something strange).

Comment 97 R P Herrold 2011-01-06 03:35:57 UTC
I need to strip down some code to get a clean minimum reproducer.  I have cleaned off a spare dom0 that is safe to test in as well

Comment 99 Pavel Kankovsky 2011-02-24 19:22:32 UTC
I have experienced many xenconsoled crashes similar to those described by Pasi Karkkainen. I was monitoring the processes with gdb and strace for several months to catch that elusive bug.

Gdb was not much useful. Everything I have found out was that xenconsoled always returned from handle_io() & main() and terminated.

Strace provided the missing piece of the puzzle. xenconsoled returned from handle_io() because it called select() with a bad timeout parameter and got EINVAL. Here is one of the traces I have collected (others are similar):

[pid  8118] gettimeofday({1298035287, 656354}, NULL) = 0
[pid  8118] select(9, [6 7 8], [], NULL, NULL) = 1 (in [6])
[pid  8118] read(6, "\33\0\0\0", 4)     = 4
[pid  8118] ioctl(6, EVIOCGKEYCODE, 0xbff3c070) = 0
[pid  8118] gettimeofday({1298035287, 803152}, NULL) = 0
[pid  8118] select(9, [7 8], [7], NULL, {4293022154, 4294093296}) = -1 EINVAL (Invalid argument)
[pid  8118] write(2, "Failure in select: 22 (Invalid a"..., 41) = 41
[pid  8118] exit_group(0)               = ?

(The error message ("Failure in select...") would have been very helpful--if it was not sent to /dev/null.)

This piece of code in xen-console-ratelimit.patch appears to be the place where the failure occurs:

              if (next_timeout) {
                      long long duration = (next_timeout - now);
                      tv.tv_sec = duration / 1000;
                      tv.tv_usec = (duration - (tv.tv_sec * 1000)) * 1000;
              }

              ret = select(max_fd + 1, &readfds, &writefds, 0, next_timeout ? &tv : NULL);

Quite obviously, the value of duration is not always calculated correctly and the result can be negative, leading to bad value of tv and select() returning EINVAL.

The upstream has already "fixed" it:
<http://xenbits.xen.org/xen-3.4-testing.hg/diff/955ee4fa1345/tools/console/daemon/io.c#l1.124>

Comment 100 Paolo Bonzini 2011-02-25 11:27:18 UTC
Thanks Pavel!

There are several overflow problems in that patch, mostly arising from timeval computations.  time_t is a "signed long", which might explain why the problem wasn't seen on 64-bit.

Comment 102 Michal Novotny 2011-02-25 15:01:17 UTC
Paolo, yI've been looking to this and we're having:

                /* If any domain has been rate limited, we need to work
                   out what timeout to supply to select */
                if (next_timeout) {
                        long long duration = (next_timeout - now);
                        tv.tv_sec = duration / 1000;
                        tv.tv_usec = (duration - (tv.tv_sec * 1000)) * 1000;
                }

		ret = select(max_fd + 1, &readfds, &writefds, 0,
			     next_timeout ? &timeout : NULL);


Xen-3.4-testing tree is having this instead:

		/* If any domain has been rate limited, we need to work
		   out what timeout to supply to select */
		if (next_timeout) {
			long long duration = (next_timeout - now);
			if (duration <= 0) /* sanity check */
				duration = 1;
			timeout.tv_sec = duration / 1000;
			timeout.tv_usec = ((duration - (timeout.tv_sec * 1000))
					   * 1000);
		}

		ret = select(max_fd + 1, &readfds, &writefds, 0,
			     next_timeout ? &timeout : NULL);

I'm just thinking... What if we change it to unsigned long long instead of long ? As I was reading Pavel's comment it seems that the issue is caused by invalid value coming to the select() function so I guess if we make it always positive and implement the sanity check, we should be no longer running to this issue AFAIK.

Also, I've been looking to the distcvs and the code is the same like I wrote above and actually it makes sense to me that the failure is caused by the negative value coming in timeval elements to the select() function so we should implement the sanity check and that should fix the issue.

Michal

Comment 104 Pavel Kankovsky 2011-02-25 15:34:07 UTC
Do not change type of any of variables to unsigned l.l. (in fact, all relevant variables are already signed l.l.). It would only turn into absurdly large positive values and, in the case of duration, prevent the sanity check from sanitizing them.

It could probably help if at least one of multipliers in all expressions like (tv.tv_sec * 1000) was cast to long long (or unsigned l.l., it does not matter much) before the multiplication.

Comment 105 Michal Novotny 2011-02-28 11:08:13 UTC
(In reply to comment #104)
> Do not change type of any of variables to unsigned l.l. (in fact, all relevant
> variables are already signed l.l.). It would only turn into absurdly large
> positive values and, in the case of duration, prevent the sanity check from
> sanitizing them.
> 
> It could probably help if at least one of multipliers in all expressions like
> (tv.tv_sec * 1000) was cast to long long (or unsigned l.l., it does not matter
> much) before the multiplication.

Do you think defining it as:

tv.tv_usec = (duration - (long long)(tv.tv_sec * 1000)) * (long long)1000;

or just one of those? I don't know whether the sanity check isn't better since the failure is caused by invalid timeval value coming to the tv.tv_usec so if we enforce tv_usec is always positive or zero it could work AFAIK.

The definition of timeval (according to /usr/include/bits/time.h) is:

struct timeval
  {
    __time_t tv_sec;            /* Seconds.  */
    __suseconds_t tv_usec;      /* Microseconds.  */
  };

According to comment #99 the timeval value coming to select() is having pretty big values:

[pid  8118] select(9, [7 8], [7], NULL, {4293022154, 4294093296}) = -1 EINVAL
(Invalid argument)

But on i386 systems the minimum value of (signed) long (and also time_t as Paolo mentioned) is -2147483648 (-0x80000000) which means that there is overflow which results into -EINVAL.

I was looking to the code for xen-3.4-testing and honestly I fail to understand why following likes make no such issue there:

timeout.tv_usec = ((duration - (timeout.tv_sec * 1000)) * 1000);

If the duration is negative or zero, the value of duration is set to 1 by the sanity check and let's say the timeout value is tv.sec = 0 and tv.usec = 186. Now if we calculate it using the formula above, we're getting:

timeout.tv_usec = ((1 - (0 * 1000)) * 1000);

which is 1000. That's fine for now but if the tv.sec is e.g. 2 and tv.usec is 289 then it's:

timeout.tv_usec = ((1 - (2 * 1000)) * 1000);

which is -1999000 (the negative value) coming to tv_usec so this should cause issues in some cases IMHO.

Also, when you write a simple test app like:

#include <stdio.h>
#include <sys/time.h>

int main()
{
   struct timeval timeout;
   long long next_timeout = (1298035287 * 1000) + (804154 / 100);
   long long now = (1298035287 * 1000) + (803152 / 1000);

   long long duration = next_timeout - now;

   timeout.tv_sec = duration / 1000;
   timeout.tv_usec = ((duration - (timeout.tv_sec * 1000)) * 1000);
   printf("tv_sec = %ld, tv_usec = %ld\n", timeout.tv_sec, timeout.tv_usec);
   return 0;
}

then you'll be getting overflow warnings on the compilation so I don't know if it won't be better to implement it like:

   unsigned long long next_timeout = (1298035287 * (unsigned long long)1000) + (804154 / 100);
   unsigned long long now = (1298035287 * (unsigned long long)1000) + (803152 / 1000);

since this won't be showing any warnings (the values for next_timeout and now are just test values for the test app but modification to the real tools/console/daemon/io.c code would be done to the very same variable names).

Since there's an overflow that's why the sanity check is implemented for xen-3.4-testing version IMHO.

Michal

Comment 106 Paolo Bonzini 2011-02-28 14:13:12 UTC
Michal,

just go for the minimal change from RHEL5's version to upstream.

Comment 107 Paolo Bonzini 2011-02-28 14:18:06 UTC
> According to comment #99 the timeval value coming to select() is having pretty
> big values:
> 
> [pid  8118] select(9, [7 8], [7], NULL, {4293022154, 4294093296}) = -1 EINVAL
> (Invalid argument)
> 
> But on i386 systems the minimum value of (signed) long (and also time_t as
> Paolo mentioned) is -2147483648 (-0x80000000) which means that there is
> overflow which results into -EINVAL.

Those are not really overflows.  They're conversions to unsigned of small(ish) negative values, in the example {-1945142,-874000}.

> If the duration is negative or zero, the value of duration is set to 1 by the
> sanity check and let's say the timeout value is tv.sec = 0 and tv.usec = 186.
                                                  ^^^^^^         ^^^^^^^

(You mean of course timeout.tv_sec and timeout.tv_usec).

tv is computed from the duration.  If the sanity check hits, timeout.tv_sec will always be zero and timeout.tv_usec will always be 1000.

Comment 108 Michal Novotny 2011-02-28 15:56:47 UTC
(In reply to comment #107)
> > According to comment #99 the timeval value coming to select() is having pretty
> > big values:
> > 
> > [pid  8118] select(9, [7 8], [7], NULL, {4293022154, 4294093296}) = -1 EINVAL
> > (Invalid argument)
> > 
> > But on i386 systems the minimum value of (signed) long (and also time_t as
> > Paolo mentioned) is -2147483648 (-0x80000000) which means that there is
> > overflow which results into -EINVAL.
> 
> Those are not really overflows.  They're conversions to unsigned of small(ish)
> negative values, in the example {-1945142,-874000}.


Right however according to my investigation this is what makes select() fail with -EINVAL.


> 
> > If the duration is negative or zero, the value of duration is set to 1 by the
> > sanity check and let's say the timeout value is tv.sec = 0 and tv.usec = 186.
>                                                   ^^^^^^         ^^^^^^^
> 
> (You mean of course timeout.tv_sec and timeout.tv_usec).
> 
> tv is computed from the duration.  If the sanity check hits, timeout.tv_sec
> will always be zero and timeout.tv_usec will always be 1000.


Yeah, I didn't mention it but it's timeout.tv_secand timeout.tv_usec. I know it's computed from the duration and if we would like to implement minimal changes from upstream I suggest adding the sanity check is the code we have to add.

Michal

Comment 112 Pavel Kankovsky 2011-03-02 11:00:55 UTC
I have spent some time eyeballing the code and I think I understand what is wrong with it now. Let us look at the loop in handle_io(), there are several interesting steps there:


1. the current time is obtained from gettimeofday() and converted to the number of milliseconds since the beginning of the epoch:

now = (tv.tv_sec * 1000) + (tv.tv_usec / 1000);

tv.tv_sec * 1000 overflows if time_t is 32 bits long. What's worse, it wraps around from large positive values to large negative values every 2^32 ms == approx. every 50 days.

(A similar assignment to next_timeslice appears before the loop and it can overflow too.)


2. next_timeslice is used to unblock ratelimited domains and reset event counters every 200 ms

if (now >= next_timeslice) {
  next_timeslice = now + RATE_LIMIT_TIMESLICE;
  ...
}

This means next_timeslice is always > now. When now has wrapped around, next_timeslice stays large for a long time and event counters keep growing.


3. next_timeout is set to next_timeslice when there are any ratelimited domain (that has exceeded its event allowance):

if (d->event_count >= RATE_LIMIT_ALLOWANCE) {
  if (next_timeout == 0 ||
      next_timeslice < next_timeout)
    next_timeout = next_timeslice;
  } ...
  ... 
}

When now has wrapped around, event counters keep growing and an active domain will exceed its allowance quickly. next_timeout will be set to the same large positive value as next_timeslice.

(There is also a rare case when the value of next_timeslice is exactly zero. When that happens, select() is called without timeout and xenconsole stalls, leaving ratelimited domains blocked for an indefinite period of time.)


4. next_timeout (if it has been set in the previous step) is used to compute timeout for select():

if (next_timeout) {
  long long duration = (next_timeout - now);
  tv.tv_sec = duration / 1000;
  tv.tv_usec = (duration - (tv.tv_sec * 1000)) * 1000;
} 

This is where all hell breaks loose when now has wrapped around. The value of now is large and negative while the value of next_timeout (== next_timeslice) is large and positive. The value of duration (their difference) is VERY LARGE and *POSITIVE* and COMPLETELY BOGUS! The assignment to tv_sec is likely to overflow but it does not matter much because we are already in GIGO mode.


The sanity check from upstream CANNOT fix the problem (contrary to what I thought) because it sanitizes duration only when it is negative or zero.

I think it is necessary to avoid overflows (and wrap-arounds) when timevals are converted to milliseconds, e.g. (note "LL" after 1000):

  now = (tv.tv_sec * 1000LL) + (tv.tv_usec / 1000);

It might be a good idea to do a more comprehensive sanity check when the timeout is computed, namely not to allow its value to exceed RATE_LIMIT_TIMESLICE but it would not really fix the problem. If implemented alone it would make xenconsoled stall instead of crashing. It might help if the condition was transient (e.g. due to the system clock doing funny things).


The code in upstream (3.4) is more complex because they have got individual "next_timeslice" for every domain (they call it next_period) but I think it can fail the same way as described above too.

Comment 113 Michal Novotny 2011-03-02 11:43:25 UTC
Thanks a lot for your investigation Pavel. Basically the main issue is the overflow there then.

I don't know whether just little change to:

  now = (tv.tv_sec * 1000LL) + (tv.tv_usec / 1000);

can change that much and unfortunately the issue is very hard reproducible so we may try but it still appears to be a really long run :(

Also, can you confirm that it's still an issue on upstream version?

Michal

Comment 114 Paolo Bonzini 2011-03-02 12:09:41 UTC
Indeed, these are the "other overflow problems" I was referring to in comment 100.  I didn't mention them further because I was not really sure about the analysis of the timing.

In particular, I thought the wraparound from positive to negative would happen every 25 days (2^31 ms) but the reporters mentioned that the bug was showing every 30 to 45 days on average.

However, what really happens is that the initial value of tv.tv_sec*1000 (truncated to 32-bit) is in practice random, so that the bug may happen after any number of days from 0 to 49.7.  In particular, the next time it should happen is around March 28, when the number of milliseconds from the epoch will be 0x12F00000000.

I agree that fixing the multiplication to use long longs will avoid the overflow (which is still present upstream---and given they often use 32-bit dom0 on 64-bit hypervisor, I'm surprised they haven't encountered the bug!) and fix the bug correctly.

The change was done in c/s 19814 upstream.  Michal, can you backport that one (and we might desire having the sanity check anyway)?

Comment 115 Pavel Kankovsky 2011-03-02 13:33:46 UTC
(In reply to comment #113)

> can change that much and unfortunately the issue is very hard reproducible
> so we may try but it still appears to be a really long run :(

You can change the system clock to make it happen quickly.

> Also, can you confirm that it's still an issue on upstream version?

I was wrong, the overflow has already been fixed in the upstream, namely in xen-unstable changeset 19814 mentioned by Paolo (incorporated as 19660 to xen-3.4-testing). [I should learn to always double-check whether I am looking at the latest version. I am sorry.]

Comment 116 Michal Novotny 2011-03-02 14:59:49 UTC
Basically this could be a simple backport 19814 just with renamed variable and constant names since we have different variable names.

But to make sure it's working fine and improving the situation could you please try to download and install patched RPMs from: http://people.redhat.com/minovotn/xen/bz593339 and provide us the test results?

Thanks,
Michal

Comment 117 Paolo Bonzini 2011-03-02 15:17:11 UTC
Michal, you can test it yourself by flipping the date of your system between say Mar 20 and Mar 30 2011.

Comment 118 Michal Novotny 2011-03-02 15:27:39 UTC
(In reply to comment #117)
> Michal, you can test it yourself by flipping the date of your system between
> say Mar 20 and Mar 30 2011.

Oh, ok. Then I need to reserve i386 dom0 since I don't have it because of insufficient disk space. Also I was having some disk issues today so I rather won't like to mess with my current setup.

Michal

Comment 119 Alexander Lindqvist 2011-03-02 19:29:24 UTC
Michal, I can test this for you and confirm whether it works if you like.

Comment 120 Michal Novotny 2011-03-03 15:53:09 UTC
(In reply to comment #119)
> Michal, I can test this for you and confirm whether it works if you like.

That would be great Alex.

Thanks a lot!
Michal

Comment 121 Alexander Lindqvist 2011-03-03 19:58:37 UTC
Ok. Will test this weekend on one of our dom0's in production who has this problem. It should be enough installing these two rpm's and flipping the date?

xen-3.0.3-124.el5virttest01.g75f6462.i386.rpm
xen-libs-3.0.3-124.el5virttest01.g75f6462.i386.rpm

I can confirm flipping the date hangs xenconsoled and I had to kill it. It doesn't dissappear as it usally does but the effect was the same.

Comment 122 Michal Novotny 2011-03-04 10:02:23 UTC
(In reply to comment #121)
> Ok. Will test this weekend on one of our dom0's in production who has this
> problem. It should be enough installing these two rpm's and flipping the date?
> 
> xen-3.0.3-124.el5virttest01.g75f6462.i386.rpm
> xen-libs-3.0.3-124.el5virttest01.g75f6462.i386.rpm
> 

Yes, this should be enough. Please provide us the test results when done.

Thanks a lot!
Michal

Comment 123 Alexander Lindqvist 2011-03-07 21:45:59 UTC
I now had a chance to install xen-3.0.3-124.el5virttest01.g75f6462.i386.rpm
and xen-libs-3.0.3-124.el5virttest01.g75f6462.i386.rpm. Restarted the server and startet up one domu. After that I flipped the date 2 months forward and tried to "xm console" into domu again. It works with this patch. Before it hanged the moment U flipped the date. Nice work!

Note: This was tested on a fully updated 5.5 dom0 (not tested on 5.6)

/ Alex

Comment 124 Paolo Bonzini 2011-03-07 22:19:54 UTC
> Note: This was tested on a fully updated 5.5 dom0 (not tested on 5.6)

Good enough, thanks!

Comment 125 Michal Novotny 2011-03-08 14:34:50 UTC
Created attachment 482909 [details]
xenconsoled: Fix rate-limit calculation overflow leading to console freezeTha'

Based on the testing this is the patch to fix the calculations of next
timeout value in xen console daemon.

It's been tested by Alexander Lindqvist <alexander> on i386 RHEL-5.5 dom0 and according to comment #123 everything is working fine and this patch fixed the issue.

Michal

Comment 126 Michal Novotny 2011-03-08 14:37:45 UTC
(In reply to comment #123)
> I now had a chance to install xen-3.0.3-124.el5virttest01.g75f6462.i386.rpm
> and xen-libs-3.0.3-124.el5virttest01.g75f6462.i386.rpm. Restarted the server
> and startet up one domu. After that I flipped the date 2 months forward and
> tried to "xm console" into domu again. It works with this patch. Before it
> hanged the moment U flipped the date. Nice work!
> 
> Note: This was tested on a fully updated 5.5 dom0 (not tested on 5.6)
> 
> / Alex

Thanks a lot for your testing, Alex. The patch has been sent to the list so it will be a part of the new package.

Michal

Comment 128 Miroslav Rezanina 2011-03-17 09:54:50 UTC
Fix built into xen-3.0.3-126.el5

Comment 132 Yufang Zhang 2011-03-30 07:38:35 UTC
QA verified this bug with xen-3.0.3-126.el5:

1. Reproduce this bug with xen-3.0.3-125.el5 on a RHEL5.6 i386 host. 
   1) create a PV guest
   2) flip the date of domain0 for two months
   3) try to get console of the PV guest

At step 3), we cannot get console of the PV guest and xenconsoled disappeares.

2. Verify this bug with xen-3.0.3-126.el5 with the same host.
Replaying the same steps, we can get the console of the PV guest, and xenconsoled still works well.

So change this bug to VERIFIED.

Comment 134 errata-xmlrpc 2011-07-21 09:16:21 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1070.html

Comment 135 errata-xmlrpc 2011-07-21 11:57:44 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1070.html