Bug 906610

Summary: Cluster lockups caused by xmlNodeDump() being too slow
Product: Red Hat Enterprise Linux 6 Reporter: Andrew Beekhof <abeekhof>
Component: libxml2Assignee: Daniel Veillard <veillard>
Status: NEW --- QA Contact: qe-baseos-tools
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.4CC: fdinitto, ohudlick, riel
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Attachments:
Description Flags
zipped trace of the test run none

Description Andrew Beekhof 2013-01-31 21:24:40 EST
Description of problem:

Cluster nodes (physical hardware) are periodically getting backed up causing the cluster to generally fall apart.

The cause for this has been traced to xmlNodeDump() which (apparently at random) decides to be _really_ slow (4s or more for a tree of 20kb) at converting xmlNode's to text.

For comparison, bzlib compresses the 20k result in about 20ms.

The amount of time taken appears quite variable, logs have show that running xmlNodeDump() twice on the exact same input (the same xmlNode pointer) can either return instantaneously or quite a while later. 

While investigating the issue, we turned up the following thread which appears to be essentially the same problem but much easier to reproduce:
   http://www.mail-archive.com/xml@gnome.org/msg06481.html

There was apparently no conclusion, but running his test case on modern hardware still takes 12-13s for a 9Mb xml tree.  Our custom version[1] of xmlNodeDump() does it in ~1s.

[1] We wrote a custom xml->text function as a work-around but this is not an ideal long-term solution.

Version-Release number of selected component (if applicable):

libxml2-2.7.6-4.el6_2.4 and libxml2-2.9.0-3.el7

How reproducible:

Every time

Steps to Reproduce:
1. Build the test indicated at: 
     http://www.mail-archive.com/xml@gnome.org/msg06484.html

2. Create the input using:
     http://www.mail-archive.com/xml@gnome.org/msg06485.html
   (Fix the script by removing the ';' characters on lines 5-8)

3. Run the program and observe how long it takes
  
Actual results:

Time taken is in the order of 10s

Expected results:

Time taken is in the order of 1s

Additional info:
Comment 1 Andrew Beekhof 2013-01-31 21:25:59 EST
CC'ing David who has also been working on this.
Comment 3 Daniel Veillard 2013-02-05 23:17:18 EST
Hum, on Fedora 17 with git head of libxml2, running

while true ; do time ./tst ; done

with the data generated as suggested i get

real	0m0.550s
user	0m0.491s
sys	0m0.051s

real	0m0.552s
user	0m0.500s
sys	0m0.047s

real	0m0.555s
user	0m0.497s
sys	0m0.050s

real	0m0.556s
user	0m0.494s
sys	0m0.057s

  i.e. fairly consistent output ...
thinkpad:~/XML -> ls -l journal.xml 
-rw-rw-r--. 1 veillard veillard 9390736 Feb  6 11:48 journal.xml
thinkpad:~/XML ->

on an older machine running 6.3 and a local build of libxml2-2.7.6-11.el6.x86_64
I get :

[root@server tmp]# time ./tst

real	0m8.758s
user	0m5.693s
sys	0m3.045s
[root@server tmp]# time ./tst

real	0m8.780s
user	0m5.697s
sys	0m3.063s
[root@server tmp]# time ./tst

real	0m9.028s
user	0m5.592s
sys	0m3.413s
[root@server tmp]# time ./tst

real	0m9.007s
user	0m5.682s
sys	0m3.303s
[root@server tmp]# time ./tst

real	0m8.793s
user	0m5.580s
sys	0m3.194s
[root@server tmp]# 

   that doesn't seems to deviate too much but is indeed quite slow ...
my bet would be that we are realloc() 'ating all the time on growing the
in-memory output buffer.

  However I think the problem lies in the lack of reproduceability, i.e.
that sometime serializing a 20k buffer should be instantaneous and instead
takes 4s. I would suspect that at some point memory becomes short on one
of the realloc() leading to some need to free memory somehwere, and
all bets are off. Can you confirm your application program is multithreaded
which i guess won't help in this case ? 

  I can try to chase the buffer memory allocation to see if the problem
is not exacerbated by the fact we are constantly reallocating. if I strace
tst on the RHEL box i see a flow of

mmap(NULL, 9367552, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc3cf45d000
munmap(0x7fc3cfd4c000, 9367552)         = 0
mmap(NULL, 9371648, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc3d0639000
munmap(0x7fc3cf45d000, 9367552)         = 0
mmap(NULL, 9375744, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc3cfd48000
munmap(0x7fc3d0639000, 9371648)         = 0
mmap(NULL, 9379840, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc3cf456000
munmap(0x7fc3cfd48000, 9375744)         = 0
mmap(NULL, 9383936, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc3d0636000
munmap(0x7fc3cf456000, 9379840)         = 0
mmap(NULL, 9388032, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc3cfd42000
munmap(0x7fc3d0636000, 9383936)         = 0
mmap(NULL, 9392128, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc3cf44d000
munmap(0x7fc3cfd42000, 9388032)         = 0

while on the F17 box I see

read(3, "", 5488)                       = 0
brk(0)                                  = 0x75b1000
brk(0x75d2000)                          = 0x75d2000
close(3)                                = 0
-> end of the read of the input
mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f008ef58000
mremap(0x7f008ef58000, 135168, 266240, MREMAP_MAYMOVE) = 0x7f008ef09000
mremap(0x7f008ef09000, 266240, 528384, MREMAP_MAYMOVE) = 0x7f008ee88000
mremap(0x7f008ee88000, 528384, 1052672, MREMAP_MAYMOVE) = 0x7f008ed87000
mremap(0x7f008ed87000, 1052672, 2101248, MREMAP_MAYMOVE) = 0x7f008eb86000
mremap(0x7f008eb86000, 2101248, 4198400, MREMAP_MAYMOVE) = 0x7f008e785000
mremap(0x7f008e785000, 4198400, 8392704, MREMAP_MAYMOVE) = 0x7f008df84000
mremap(0x7f008df84000, 8392704, 16781312, MREMAP_MAYMOVE) = 0x7f008cf83000
exit_group(0)                           = ?
+++ exited with 0 +++
thinkpad:~/XML -> 

so it seems we only remap 7 times by doing a doubled allocation stategy.
If i just change the buffer allocation stategy by adding

xmlBufferSetAllocationScheme(buf, XML_BUFFER_ALLOC_DOUBLEIT)

after the buf = xmlBufferCreate();

then I'm back to something normal:

[root@server tmp]# time ./tst

real	0m0.459s
user	0m0.370s
sys	0m0.088s
[root@server tmp]# time ./tst

real	0m0.463s
user	0m0.375s
sys	0m0.086s
[root@server tmp]# time ./tst

real	0m0.463s
user	0m0.382s
sys	0m0.080s
[root@server tmp]# time ./tst

real	0m0.459s
user	0m0.378s
sys	0m0.080s
[root@server tmp]# 

  so while there is something a bit wrong in libxml2 default allocation,
the test case did allocate the buffer itself and libxml2 has no hint what
it might be used to.

  The immediate solution on RHEL-6 might simply be to make sure you set
the allocation buffer stategy before calling xmlNodeDump()

  Still to me this pinpoint that realloc() can suddenly randomly take like
4s when it should return instantly (realloc to 20k !!!) and that's a serious
flaw of our memory allocator (again guessing multithreading doesn't help),
so we may fix most occurences of that problem with the change of allocation
stategy, but the bug still lies somewhere and that could still happen on
the single or couple reallocs which may be needed to grow the buffer to 20k.

  Another thing is that you can give an initial memory size to the buffer
if you have an idea of the final size by using

xmlBufferPtr xmlBufferCreateSize(size_t size);

Daniel
Comment 4 Andrew Beekhof 2013-02-06 23:55:15 EST
Agreed that the reproducibility is a PITA.
I was hoping that figuring out why this guy's test took too long might fix ours "for free".  It does at least appear to have given us a lead to investigate :)

The application isn't multi threaded.

I can try using XML_BUFFER_ALLOC_DOUBLEIT to see if that helps.
Thats certainly one thing that my function would be doing differently (mine grows the buffer by 4k each time) and might help explain the difference.

I also have the program print out the xml being dumped when it takes too long.
IIRC, when I did last and ran it through something like the original test - it was instantaneous :(

I guess this means that memory allocation is almost certainly the core issue here.

As another datapoint, I took a typically large xml file from a cluster here and put it through a few thousand read/dump cycles hoping that one might take too long - none did.
Comment 5 Daniel Veillard 2013-02-07 02:35:43 EST
We can't fix the realloc() random problem (at least here), but we can try
to avoid it.
I really suggest you change the code by allocating upfront a buffer
you know might be sufficient, and changing the allocation strategy to
limit the realloc() counts. With those two hopefully you will avoid the
issue !

Daniel
Comment 6 Andrew Beekhof 2013-02-07 04:03:58 EST
Here's the results with the default allocator, I'll try double next:


Feb 07 04:00:34	BadNews: Feb  7 03:50:04 east-03 pengine[15598]:    error: dump_xml: xmlNodeDump() -> 284343bytes took 3s
Feb 07 04:00:34	BadNews: Feb  7 03:50:33 east-04 cib[24697]:    error: dump_xml: xmlNodeDump() -> 286379bytes took 6s
Feb 07 04:00:34	BadNews: Feb  7 03:51:28 east-04 pengine[24702]:    error: dump_xml: xmlNodeDump() -> 285169bytes took 3s
Feb 07 04:00:35	BadNews: Feb  7 03:51:44 east-07 cib[25766]:    error: dump_xml: xmlNodeDump() -> 287632bytes took 3s
Feb 07 04:00:35	BadNews: Feb  7 03:52:35 east-07 pengine[25771]:    error: dump_xml: xmlNodeDump() -> 285975bytes took 2s
Feb 07 04:00:35	BadNews: Feb  7 03:53:17 east-07 pengine[25771]:    error: dump_xml: xmlNodeDump() -> 288613bytes took 6s
Feb 07 04:00:35	BadNews: Feb  7 03:54:44 east-08 cib[12525]:    error: dump_xml: xmlNodeDump() -> 291458bytes took 7s
Feb 07 04:00:35	BadNews: Feb  7 03:54:58 east-08 cib[12525]:    error: dump_xml: xmlNodeDump() -> 288536bytes took 2s
Feb 07 04:00:35	BadNews: Feb  7 03:55:07 east-08 cib[12525]:    error: dump_xml: xmlNodeDump() -> 288536bytes took 2s
Feb 07 04:00:35	BadNews: Feb  7 03:55:10 east-08 cib[12525]:    error: dump_xml: xmlNodeDump() -> 288536bytes took 2s
Feb 07 04:00:35	BadNews: Feb  7 03:55:32 east-08 pengine[12530]:    error: dump_xml: xmlNodeDump() -> 288428bytes took 2s
Feb 07 04:00:36	BadNews: Feb  7 03:56:16 east-09 pengine[14400]:    error: dump_xml: xmlNodeDump() -> 293090bytes took 4s
Feb 07 04:00:36	BadNews: Feb  7 03:56:19 east-09 pengine[14400]:    error: dump_xml: xmlNodeDump() -> 293028bytes took 2s
Feb 07 04:00:36	BadNews: Feb  7 03:56:59 east-10 pengine[15879]:    error: dump_xml: xmlNodeDump() -> 283719bytes took 2s
Feb 07 04:00:36	BadNews: Feb  7 03:57:04 east-11 cib[15646]:    error: dump_xml: xmlNodeDump() -> 284082bytes took 3s
Feb 07 04:00:36	BadNews: Feb  7 03:57:18 east-11 cib[15646]:    error: dump_xml: xmlNodeDump() -> 283876bytes took 2s
Feb 07 04:00:36	BadNews: Feb  7 03:57:36 east-11 cib[15646]:    error: dump_xml: xmlNodeDump() -> 283945bytes took 2s
Feb 07 04:00:36	BadNews: Feb  7 03:57:45 east-12 cib[15395]:    error: dump_xml: xmlNodeDump() -> 283945bytes took 3s
Feb 07 04:00:36	BadNews: Feb  7 03:57:49 east-12 cib[15395]:    error: dump_xml: xmlNodeDump() -> 283945bytes took 4s
Feb 07 04:00:36	BadNews: Feb  7 03:57:53 east-12 cib[15395]:    error: dump_xml: xmlNodeDump() -> 283945bytes took 4s
Feb 07 04:00:36	BadNews: Feb  7 03:57:57 east-12 cib[15395]:    error: dump_xml: xmlNodeDump() -> 283945bytes took 3s
Feb 07 04:00:36	BadNews: Feb  7 03:58:01 east-12 cib[15395]:    error: dump_xml: xmlNodeDump() -> 283945bytes took 3s
Feb 07 04:00:36	BadNews: Feb  7 03:58:05 east-12 cib[15395]:    error: dump_xml: xmlNodeDump() -> 283945bytes took 4s
Feb 07 04:00:36	BadNews: Feb  7 03:58:22 east-12 cib[15395]:    error: dump_xml: xmlNodeDump() -> 284014bytes took 2s
Feb 07 04:00:37	BadNews: Feb  7 03:58:29 east-13 cib[15069]:    error: dump_xml: xmlNodeDump() -> 284220bytes took 4s
Feb 07 04:00:37	BadNews: Feb  7 03:59:23 east-14 pengine[14826]:    error: dump_xml: xmlNodeDump() -> 283995bytes took 2s
Feb 07 04:00:37	BadNews: Feb  7 03:59:25 east-15 cib[14471]:    error: dump_xml: xmlNodeDump() -> 284388bytes took 2s
Feb 07 04:00:37	BadNews: Feb  7 03:59:57 east-16 cib[13944]:    error: dump_xml: xmlNodeDump() -> 284427bytes took 12s
Feb 07 04:00:37	BadNews: Feb  7 04:00:15 east-16 pengine[13949]:    error: dump_xml: xmlNodeDump() -> 284133bytes took 2s
Comment 7 Andrew Beekhof 2013-02-07 04:18:53 EST
This one came in while the next one was compiling... ouch!

Feb 07 04:10:21	BadNews: Feb  7 04:08:44 east-15 crmd[23459]:    error: dump_xml: xmlNodeDump() -> 271882bytes took 18s
Comment 8 Andrew Beekhof 2013-02-07 05:03:42 EST
Without XML_BUFFER_ALLOC_DOUBLEIT, xmlNodeDump() took more than 1s on more than 50 occasions in 20 minutes.

With it, the cluster has been running for over half an hour and not hit it once.

Perhaps XML_BUFFER_ALLOC_DOUBLEIT should be the default?
Comment 9 Daniel Veillard 2013-02-10 22:33:57 EST
It actually is the default in the current tree.
But what are you running on that machine for it to take 12s to allocate 200KB ?
Is it swapping ?
We are still handling OOM condition extremely poorly :-\

Daniel
Comment 10 Daniel Veillard 2013-02-10 23:05:34 EST
Hum ... no it wasn't the default, but a double allocation scheme was used
in the current tree compared to RHEL. So it became the default for this use case
however since recent versions changed the buffer implementation and the
new buffer have the new semantic.
However the proper solution I think is to switch the buffer to DOUBLEIT for the
time it is being used for xmlNodeDump() then restore to previous allocation
strategy on the way out. I have commited a patch to do this upstream:

http://git.gnome.org/browse/libxml2/commit/?id=23922c536c97e0eaac4bd7fbe68cbf9fceb66b36

but again the code base for RHEL is different especially in that area, so
if we were to fix the RHEL behaviour, a different patch should be applied.
I'm not sure we need to pursue an errata in RHEL though, as there is a way
though the API to change the behaviour,

Daniel
Comment 12 Andrew Beekhof 2013-02-11 05:56:44 EST
(In reply to comment #10)
> Hum ... no it wasn't the default, but a double allocation scheme was used
> in the current tree compared to RHEL. So it became the default for this use
> case
> however since recent versions changed the buffer implementation and the
> new buffer have the new semantic.
> However the proper solution I think is to switch the buffer to DOUBLEIT for
> the
> time it is being used for xmlNodeDump() then restore to previous allocation
> strategy on the way out. I have commited a patch to do this upstream:
> 
> http://git.gnome.org/browse/libxml2/commit/
> ?id=23922c536c97e0eaac4bd7fbe68cbf9fceb66b36

Patch makes sense to me.  Thanks :)

> 
> but again the code base for RHEL is different especially in that area, so
> if we were to fix the RHEL behaviour, a different patch should be applied.
> I'm not sure we need to pursue an errata in RHEL though, as there is a way
> though the API to change the behaviour,

Agreed.
Comment 13 Andrew Beekhof 2013-02-11 06:05:10 EST
(In reply to comment #9)
> It actually is the default in the current tree.
> But what are you running on that machine for it to take 12s to allocate
> 200KB ?
> Is it swapping ?
> We are still handling OOM condition extremely poorly :-\

I'm almost certain the machine isn't swapping, I will do some tests in the morning to be sure.  The nodes are Dell Power Edge SC1435 1U servers with 8GB of RAM, two Dual-Core AMD Opertron 2212 HE CPUs.  So not exactly under-spec'd.

As for what is running, pretty much just the cluster.  Although that can be more than you'd think when 16 nodes start hammering the config system with status updates.
Comment 15 Daniel Veillard 2013-02-11 10:49:45 EST
So at this point my suggestion is simply to close this bug or reassign it
to the cluster component, the fix associated being the patch to
make the buffer use the doubleit allocation mode. For libxml2
there isn't much we can do reasonably except say it should be  fixed in
upstream version.

   Right ?

Daniel
Comment 16 Rik van Riel 2013-02-11 11:52:17 EST
Daniel, Andrew,

Are the mmap calls in the trace taking a really long time to complete?

Is the full trace available somewhere?  It would be very strange if realloc is not be calling mremap on RHEL6, that is something the C library should have been doing for years...
Comment 17 Daniel Veillard 2013-02-11 21:56:37 EST
Created attachment 696347 [details]
zipped trace of the test run

That's actually a Centos box, but that should not fundamentally change
the output:

[root@server ~]# uname -a
Linux server.veillard.com 2.6.32-279.22.1.el6.x86_64 #1 SMP Wed Feb 6 03:10:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
[root@server tmp]# gcc -o tst -I/usr/include/libxml2 tst.c -lxml2
[root@server tmp]# strace -o tst.trace ./tst
[root@server tmp]# ls -l tst.trace
-rw-r--r--. 1 root root 497758 Feb 12 10:52 tst.trace
[root@server tmp]# gzip tst.trace
[root@server tmp]# ls -l tst.trace.gz 
-rw-r--r--. 1 root root 45826 Feb 12 10:52 tst.trace.gz
[root@server tmp]#
Comment 19 Andrew Beekhof 2013-02-26 21:12:10 EST
(In reply to comment #15)
> So at this point my suggestion is simply to close this bug or reassign it
> to the cluster component, the fix associated being the patch to
> make the buffer use the doubleit allocation mode. For libxml2
> there isn't much we can do reasonably except say it should be  fixed in
> upstream version.
> 
>    Right ?

Agreed, unless you want to see if mmap is causing problems too.
Comment 20 RHEL Product and Program Management 2013-10-14 00:19:50 EDT
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.