Description of problem: I'm trying the following systemtap script: probe syscall.* { printf ("call to %s(%s) from %s(%d)\n", name, argstr, execname(), pid()) } probe syscall.*.return { printf ("return from %s into %s(%d)\n", name, execname(), pid()) } i.e. with two wildcarded probes on all syscalls. There is a long delay when I start the script, every time I start it, not just the first time. Running with -v -p4 , I see that pass 1 completes very quickly, as do passes 3 and 4, using cached results, but pass 2 seems to take over 10 seconds each time. Pass 1: parsed user script and 49 library script(s) in 360usr/10sys/385real ms. Pass 2: analyzed script: 630 probe(s), 102 function(s), 14 embed(s), 1 global(s) in 12570usr/410sys/13318real ms. /root/.systemtap/cache/83/stap_83b9d8fdd3f2b82db13e9f298d732c13_327705.ko Pass 3: using cached /root/.systemtap/cache/83/stap_83b9d8fdd3f2b82db13e9f298d732c13_327705.c Pass 4: using cached /root/.systemtap/cache/83/stap_83b9d8fdd3f2b82db13e9f298d732c13_327705.ko Version-Release number of selected component (if applicable): systemtap-0.9.5-1.fc10.i386 systemtap-runtime-0.9.5-1.fc10.i386 kernel-2.6.27.12-170.2.5.fc10.i686 kernel-devel-2.6.27.12-170.2.5.fc10.i686 kernel-debuginfo-2.6.27.12-170.2.5.fc10.i686 kernel-headers-2.6.27.12-170.2.5.fc10.i386 kernel-debuginfo-common-2.6.27.12-170.2.5.fc10.i686 How reproducible: 100% with the given script, appears to affect other kinds of wildcarding. Steps to Reproduce: 1. "stap -v -p 4 syscalls.stp" with syscalls.stp as the script above Actual results: 10-15 second delay starting the script each time, occurring in pass 2. Expected results: Repeated invocations of the script should hit caches, and take under a second to start. Additional info: I'm trying to write a UI for systemtap (just a hobby/sideproject at this stage), and this 10-15 second startup each time is a nuisance.
The pass-2 time is unfortunately needed to identify the "cache line" in the compiled module cache, so it doesn't "defeat" the cache -- it's a part of using it. But clearly this pass needs to be optimized further.
Thanks. (I was being sloppy in my use of the word "cache"; taking an educated guess it looks like there's a cache at pass 3, caching from parsed stp scripts on their md5sum and another value to generated .c files, and another one at pass 4, caching from .c files on their md5sums and another value to .ko files. Can some kind of caching be added to pass 2? e.g. look at timestamps of the relevant input files, or perhaps the tapset directory, and if nothing's changed, reuse the processed stp script with tapsets already resolved? for a shortcut through on repeated invocations. I may be misunderstanding the innards of stap though...)
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle. Changing version to '11'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Since Josh is improving this area, reassigning to him.
Resolving and hashing all of the inputs to pass-2 is not all that trivial -- it requires all input scripts, tapsets, kernel binaries w/ debuginfo, and user binaries w/ debuginfo. I think simply optimizing this path is the better route. For the systemtap 0.9.8 release we made some improvements in this area. Your example script on 0.9.5 takes me 8010usr/430sys/8451real ms in pass-2, and on 0.9.8 that's down to 3520usr/240sys/3761real ms. I hope that is a noticeable improvement in your scenario, but we'll continue to look for more fat to trim.
Dave, would you be able to try building git systemtap to try this? A recent change went in (and will go into the next release, whenever) that should dramatically speed this up.
Thanks. I'm seeing similar improvements to the ones reported in comment 5, reducing the "warm cache" case from about 13 seconds to about 5 seconds; real time for pass 2 has reduced from about 13.3 seconds to about 4.8 seconds. Details: Old version: [david@brick stap]$ rpm -q systemtap systemtap-0.9.5-1.fc10.i386 [david@brick stap]$ time stap -v -p 4 syscalls.stp Pass 1: parsed user script and 49 library script(s) in 380usr/10sys/418real ms. Pass 2: analyzed script: 630 probe(s), 102 function(s), 14 embed(s), 1 global(s) in 12750usr/430sys/13307real ms. /home/david/.systemtap/cache/db/stap_db2b4881ad1f4745e5d7ed03183bd403_327705.ko Pass 3: using cached /home/david/.systemtap/cache/db/stap_db2b4881ad1f4745e5d7ed03183bd403_327705.c Pass 4: using cached /home/david/.systemtap/cache/db/stap_db2b4881ad1f4745e5d7ed03183bd403_327705.ko real 0m13.860s user 0m13.163s sys 0m0.519s Local build from git (83ca3872dadc89c5a8155150bf76d5c09890dcc9) "cold cache" case: [david@brick stap]$ time ../systemtap-0.9.8-4935/bin/stap -v -p 4 syscalls.stp Pass 1: parsed user script and 58 library script(s) in 350usr/30sys/3642real ms. Pass 2: analyzed script: 630 probe(s), 102 function(s), 15 embed(s), 1 global(s) in 4720usr/50sys/4896real ms. Pass 3: translated to C into "/tmp/stapfAW31w/stap_5fe74f3ef6bbe22ae79f5553c3600695_328134.c" in 780usr/50sys/1000real ms. /home/david/.systemtap/cache/5f/stap_5fe74f3ef6bbe22ae79f5553c3600695_328134.ko Pass 4: compiled C into "stap_5fe74f3ef6bbe22ae79f5553c3600695_328134.ko" in 16410usr/500sys/20268real ms. real 0m31.242s user 0m22.302s sys 0m0.685s "warm cache" case: [david@brick stap]$ time ../systemtap-0.9.8-4935/bin/stap -v -p 4 syscalls.stp Pass 1: parsed user script and 58 library script(s) in 220usr/10sys/245real ms. Pass 2: analyzed script: 630 probe(s), 102 function(s), 15 embed(s), 1 global(s) in 4690usr/60sys/4801real ms. /home/david/.systemtap/cache/5f/stap_5fe74f3ef6bbe22ae79f5553c3600695_328134.ko Pass 3: using cached /home/david/.systemtap/cache/5f/stap_5fe74f3ef6bbe22ae79f5553c3600695_328134.c Pass 4: using cached /home/david/.systemtap/cache/5f/stap_5fe74f3ef6bbe22ae79f5553c3600695_328134.ko real 0m5.136s user 0m4.946s sys 0m0.093s
With stap 1.0, I'm now seeing pass-2 at 1340usr/50sys/1392real ms.
I'm seeing very short times now in pass2 also, and proper caching in pass 3/4. David, are you still seeing anomalous times?
Pass 2 is now taking 2-2.5 real seconds for the "hot" case, so this has improved; thanks! [david@brick stap]$ stap -v -p4 syscalls.stp Pass 1: parsed user script and 65 library script(s) using 20136virt/12264res/2056shr kb, in 220usr/10sys/232real ms. Pass 2: analyzed script: 630 probe(s), 110 function(s), 16 embed(s), 1 global(s) using 180820virt/114452res/73688shr kb, in 2130usr/200sys/2559real ms. /home/david/.systemtap/cache/93/stap_937796f3b632292c5a1c856192c48729_342846.ko Pass 3: using cached /home/david/.systemtap/cache/93/stap_937796f3b632292c5a1c856192c48729_342846.c Pass 4: using cached /home/david/.systemtap/cache/93/stap_937796f3b632292c5a1c856192c48729_342846.ko [david@brick stap]$ rpm -qa systemtap\* kernel\* | sort kernel-debuginfo-common-i686-2.6.31.12-174.2.3.fc12.i686 kernel-firmware-2.6.31.12-174.2.3.fc12.noarch kernel-headers-2.6.31.12-174.2.3.fc12.i686 kernel-PAE-2.6.31.12-174.2.3.fc12.i686 kernel-PAE-2.6.31.6-145.fc12.i686 kernel-PAE-2.6.31.9-174.fc12.i686 kernel-PAE-debuginfo-2.6.31.12-174.2.3.fc12.i686 kernel-PAE-devel-2.6.31.12-174.2.3.fc12.i686 kernel-PAE-devel-2.6.31.6-145.fc12.i686 kernel-PAE-devel-2.6.31.9-174.fc12.i686 systemtap-1.1-2.fc12.i686 systemtap-runtime-1.1-2.fc12.i686 systemtap-sdt-devel-1.1-1.fc12.i686
thanks for confirming!