Bug 1595389
|
Description
Dave Wysochanski
2018-06-26 19:49:36 UTC
(In reply to Dave Wysochanski from comment #0) > ... > Probably focus on refactoring tools.c:do_list() function for now. > > Also it's possible the hashing itself could be improved for better > performance, see HQ_INDEX() definition for example. > ... > Additional info: > We have sample vmcores I'll put as a private note. While it may seem > pathalogical we have increasing vmcore sizes and it would be very useful to > have a more optimized list enumeration for faster vmcore diagnosis. > > I intend to take a crack at this unless someone else beats to me to it. > Right now I have a few other items on my plate so it may be a couple weeks > for me. Thanks Dave, I appreciate it. If I may suggest a few ground rules first. In addition to the user "list" command, the do_list() and hq_xxxx() facilities are used internally throughout the crash utility proper for various purposes, and are exported for use by extension modules, so making wholesale changes to them (i.e., refactoring the do_list() function), would be extremely risky. Since this issue is (for now) specific to the user list command, I would prefer that you generate an alternative function to do_list() that is only used by the list command. That way you could avoid working around the hq_open/hq_enter/hq_close issues, and direct your efforts to making the function do exactly what you come up with as an alternative to hashing for loop detection. Once that alternative do_list()-type function is proven, then perhaps other users of do_list() could make the transition. With respect to tweaking the current hash queue code, that's something that certainly could be entertained. For example, as I recall, the HQ_SHIFT value was originally selected because the prime user of hash queues is/was for task management, which used to be based upon the task_struct, which was located in the kernel stack page. And even now, the task_struct is almost a page size in length. So I guess it's a compromise as to what the best shift value should be used. And w/respect to the hash entry memory usage, it looks like it would be easy enough to dial back the ht->memptr size in restore_sanity(). (In reply to Dave Anderson from comment #3) > (In reply to Dave Wysochanski from comment #0) > > ... > > Probably focus on refactoring tools.c:do_list() function for now. > > > > Also it's possible the hashing itself could be improved for better > > performance, see HQ_INDEX() definition for example. > > ... > > Additional info: > > We have sample vmcores I'll put as a private note. While it may seem > > pathalogical we have increasing vmcore sizes and it would be very useful to > > have a more optimized list enumeration for faster vmcore diagnosis. > > > > I intend to take a crack at this unless someone else beats to me to it. > > Right now I have a few other items on my plate so it may be a couple weeks > > for me. > > Thanks Dave, I appreciate it. > > If I may suggest a few ground rules first. > > In addition to the user "list" command, the do_list() and hq_xxxx() > facilities > are used internally throughout the crash utility proper for various purposes, > and are exported for use by extension modules, so making wholesale changes > to > them (i.e., refactoring the do_list() function), would be extremely risky. > > Since this issue is (for now) specific to the user list command, I would > prefer that you generate an alternative function to do_list() that is only > used by the list command. That way you could avoid working around > the hq_open/hq_enter/hq_close issues, and direct your efforts to making > the function do exactly what you come up with as an alternative to hashing > for loop detection. Once that alternative do_list()-type function is > proven, then perhaps other users of do_list() could make the transition. > Understood and what you explain here is exactly along the lines I was thinking. I want to provide an alternative loop detection algorithm that does not use memory for the 'list' command, then prove it out and see where else (if any) it might be useful. > With respect to tweaking the current hash queue code, that's something > that certainly could be entertained. For example, as I recall, the HQ_SHIFT > value was originally selected because the prime user of hash queues is/was > for task management, which used to be based upon the task_struct, which was > located in the kernel stack page. And even now, the task_struct is almost > a page size in length. So I guess it's a compromise as to what the best > shift value should be used. And w/respect to the hash entry memory usage, > it looks like it would be easy enough to dial back the ht->memptr size in > restore_sanity(). I have the crash-debuginfo installed on optimus now and I want to run that vmcore and the list command again and look at the data structures under gdb. I suspect maybe we the hash does not work well yet but I need to see what is going on. (In reply to Dave Wysochanski from comment #4) > (In reply to Dave Anderson from comment #3) > > (In reply to Dave Wysochanski from comment #0) > > > With respect to tweaking the current hash queue code, that's something > > that certainly could be entertained. For example, as I recall, the HQ_SHIFT > > value was originally selected because the prime user of hash queues is/was > > for task management, which used to be based upon the task_struct, which was > > located in the kernel stack page. And even now, the task_struct is almost > > a page size in length. So I guess it's a compromise as to what the best > > shift value should be used. And w/respect to the hash entry memory usage, > > it looks like it would be easy enough to dial back the ht->memptr size in > > restore_sanity(). > > I have the crash-debuginfo installed on optimus now and I want to run that > vmcore and the list command again and look at the data structures under gdb. > I suspect maybe we the hash does not work well yet but I need to see what is > going on. FWIW, an initial check of the hash_table->queue_heads[*].qcnt after a few minutes revealed a mostly even distribution in the queue_heads. However, I didn't get into a problem until after a few hours so I'm going to let it run, look at the rate of the additions to the file, then stop gdb and look again to see if there's anything interesting going on or it's just really large. Created attachment 1455126 [details]
POC untested v1 patch for a tortoise and hare loop detection algo for the 'list' command
Hey Dave A - do you have a vmcore with a list loop in it? What do you think of my POC approach v1 in comment #6 (sorry ignore the makefile changes they are not necessary)? Created attachment 1455134 [details]
POC v2 patch (lightly tested) for a tortoise and hare loop detection algo for the 'list' command
FWIW, I'm running patch in comment #8 as 'new-crash' and timing a crashrc file with the following command: list -T -H 0xffff8ac03c81fc28 > /dev/null Not that it means much given I'm not controlling workload on the machine or anything but it's a sanity test. The original command where hashing was turned off took around 10 hours. One more note - there's a bug in my second patch - this is obviously wrong: ld->flags &= LIST_ALLOCATE; Should be ld->flags &= ~LIST_ALLOCATE; Created attachment 1455139 [details]
strace of crash process doing 'list -T -H 0xffff8ac03c81fc28 > /dev/null' - shows larger reads which is not obvious why
Comment on attachment 1455139 [details]
strace of crash process doing 'list -T -H 0xffff8ac03c81fc28 > /dev/null' - shows larger reads which is not obvious why
There's a lot of ~1k reads going on here from the disk, it looks like crash is reading outside of the list_head possibly trying to cache part of the structure? I can understand the 24 byte read and the lseek but not sure about the larger read.
write(19, "ffff8a978a7abc40\n", 17) = 17
write(19, "ffff8a978a7abdc0\n", 17) = 17
lseek(3, 2122052384, SEEK_SET) = 2122052384
read(3, "\324\247\20*\30\0\0\0\301\4\0\0\2\0\0\0\0\0\0\0\0\0\0\0", 24) = 24
lseek(3, 103784949716, SEEK_SET) = 103784949716
read(3, "\0 \200\4\10\0\2\0\0\0\310\206\335\211\230\212\377\377HK\211\243\222\212\377\377\200|\3 \276\212"..., 1217) = 1217 <---- NOT OBVIOUS WHY???
write(19, "ffff8a978a7abf40\n", 17) = 17
write(19, "ffff8a978a7ac140\n", 17) = 17
write(19, "ffff8a978a7ac2c0\n", 17) = 17
write(19, "ffff8a978a7ac440\n", 17) = 17
write(19, "ffff8a978a7ac5c0\n", 17) = 17
write(19, "ffff8a978a7ac740\n", 17) = 17
write(19, "ffff8a978a7ac8c0\n", 17) = 17
write(19, "ffff8a978a7aca40\n", 17) = 17
write(19, "ffff8a978a7acbc0\n", 17) = 17
write(19, "ffff8a978a7acd40\n", 17) = 17
lseek(3, 2122052408, SEEK_SET) = 2122052408
read(3, "\225\254\20*\30\0\0\0\236\4\0\0\2\0\0\0\0\0\0\0\0\0\0\0", 24) = 24
lseek(3, 103784950933, SEEK_SET) = 103784950933
read(3, "\0\0016_dOeSnotExist_.db\0+\0\0\r\300c\245\300\377\377\377"..., 1182) = 1182 <---- NOT OBVIOUS WHY???
write(19, "ffff8a978a7acec0\n", 17) = 17
write(19, "ffff8a978a7ad040\n", 17) = 17
write(19, "ffff8a978a7ad1c0\n", 17) = 17
Ok that is normal - reading the dump file. I guess we cannot read a small amount because of the dump format so we are reading around 1k. Confirmed in gdb. Created attachment 1455200 [details]
RFC PATCH Brent algo for list command loop detection - more efficient than Floyd / tortoise and hare
I realize these initial patches are not the approach of a different do_list() function from comment #3 so they probably won't get accepted. These were the minimal changes I thought of to get a test working and help me understand a little better. I think I understand a bit more of what is going on now. I knew do_list() was used by many callers but didn't understand it well and didn't understand hq_* functions. I see now that sometimes the hash is used for memory allocation at the end (callers setting LIST_ALLOCATE flag), as well as existence checks (callers of hq_entry_exists), so the hash is used for more than cycle detection in some places. This LIST_ALLOCATE flag is not set for the list command though, and no calls to hq_entry_exists, so it looks like hq_* functions are only used for cycle detection there. So far I have not found a reason that a new non-storage cycle-detector would be a non-starter for list command, and I cannot see any downside so far. Created attachment 1455222 [details]
RFC patch (under testing) with do_list() replacement using Brent's algorithm for cycle detection
There is a lot more duplication in the patch in comment #17 but that is my understanding of the approach asked for in comment #3. I am not sure it is correct though. It's also less clear what is changing there and it's possible more can be cut out or we could refactor at least some of the duplication (for example, the 'if (CRASHDEBUG(1)) {' part at the top). (In reply to Dave Wysochanski from comment #7) > Hey Dave A - do you have a vmcore with a list loop in it? > I don't. But it would be simple to inject an error into the readmem() function. Generate a long list from any vmcore, and dump the output into a file. Then take an address near the very beginning (other than the first entry), and find an address near the end of the list. Then at the top of readmem() do something like: if ((addr == <address_near_end>) && STREQ(pc->curcmd, "list") && STREQ(type, "list entry")) { ulong *listptr; listptr = (ulong *)buffer; *listptr = <address_near_beginning>; return TRUE; } (In reply to Dave Anderson from comment #19) > (In reply to Dave Wysochanski from comment #7) > > Hey Dave A - do you have a vmcore with a list loop in it? > > > > I don't. But it would be simple to inject an error into the > readmem() function. Great suggestion thanks Dave! One difference with this simplified non-hash table Brent algorithm is that patch #14 won't give you the precise location of the first duplicate, but it will tell you there is a loop in the list. Is that a show stopper? Ok forget my question - there are other portions to these algorithms that give the start of the cycle - duh. Let me see how complicated that is and whether it is worthwhile. Ok so for floyd, if a loop is detected, we basically have to reset the tortoise to the start, make the hare the same speed, and step them until they meet. This may mean a worst case of running the list twice to find the start of the loop. It is probably ok. I will have to check Brent and how much it improves on this and how much. Seems doable so we should be able to remove the hash table from the list function with a bit more work. Just playing with your comment #17 Brent patch, and yeah, I see that the "duplicate list entry" value is all over the place, such that it's more confusing than helpful to show it. In any case, it's pretty nifty how it works -- although honestly I haven't really made sense of *how* it actually works. Anyway, if we keep that algorithm in place, it would make more sense to make the call to the new function as an alternative "list" option that specifically serves to check whether a given list loops around. (In reply to Dave Anderson from comment #25) > Just playing with your comment #17 Brent patch, and yeah, I see that the > "duplicate list entry" value is all over the place, such that it's more > confusing than helpful to show it. In any case, it's pretty nifty how > it works -- although honestly I haven't really made sense of *how* it > actually works. > > Anyway, if we keep that algorithm in place, it would make more sense to > make the call to the new function as an alternative "list" option > that specifically serves to check whether a given list loops around. Yeah that patch from comment #17 will not fly. I have a v2 patch I'm working on to identify the start of the loop - i.e. the first duplicate and then stop. I'm seeing how close to identical output I can get but I'm not sure. I think it will likely print a few more list entries along the loop before it stops. Basically this is the tradeoff of not using a hash table but more a mathematical algorithm. I'm not sure if that is enough? If it prints the same line as the hash table, i.e. list: duplicate list entry: ffff8abdf3bb5d00 but the list output before that contains a bit more items (some in the loop) would that be acceptable or not? You get rid of the overhead of the hash table but the output is slightly more than before (but you still know the exact location as in the hash table). The Brent algo is better than Floyd as far as I can tell in that you don't have to actually read from two points in the list (file) at the same time, unless you actually have a loop. With Brent the second faster pointer just follows the slow one around, doubling every so often. Then the slow pointer eventually hits this second pointer, and you know you're in the loop. Once you are in the loop, with Brent you pay the penalty of reading two places in the file but with Floyd even if you're not in the loop you're paying that penalty so I don't like it as much. I would like to at least get to a stopping point and then make a call whether this is worthwhile or not. Here is a concrete example of what I have now. crash-7.3.2 (with error inject): Working directory /cores/retrace/tasks/795082132/misc. crash-debug> list -H 0xffff8ac03c81fc28 ffff8abdf38b7d00 ffff8abdf38b7ac0 ffff8abdf38b7f40 ffff8abe0edb4b00 ffff8abdf3bb5d00 ffff8abdf3bb5b80 ffff8abdf3bb5640 ffff8abe0e92b100 ffff8abe0e92ad40 DEBUG brent - injecting error - next = ffff8abe0ead2200 points back to ffff8abdf3bb5d00 ffff8abdf3bb5d00 list: duplicate list entry: ffff8abdf3bb5d00 crash-debug> Example output of Brent v2: new-crash-v2-debug> list -H 0xffff8ac03c81fc28 ffff8abdf38b7d00 ffff8abdf38b7ac0 ffff8abdf38b7f40 ffff8abe0edb4b00 ffff8abdf3bb5d00 ffff8abdf3bb5b80 ffff8abdf3bb5640 ffff8abe0e92b100 ffff8abe0e92ad40 DEBUG brent - injecting error - next = ffff8abe0ead2200 points back to ffff8abdf3bb5d00 ffff8abdf3bb5d00 ffff8abdf3bb5b80 <---------- NEW OUTPUT DEBUG brent - injecting error - next = ffff8abe0ead2200 points back to ffff8abdf3bb5d00 list: duplicate list entry: ffff8abdf3bb5d00 new-crash-v2-debug> Ok I did a floyd part (b) and it I got identical output now: new-crash-floyd-b> list -H 0xffff8ac03c81fc28 ffff8abdf38b7d00 ffff8abdf38b7ac0 ffff8abdf38b7f40 ffff8abe0edb4b00 ffff8abdf3bb5d00 ffff8abdf3bb5b80 ffff8abdf3bb5640 ffff8abe0e92b100 ffff8abe0e92ad40 ffff8abdf3bb5d00 list: duplicate list entry: ffff8abdf3bb5d00 new-crash-floyd-b> Will need to test some more and I think I figured out what was wrong with brent. I'd like to get brent part (b) in there - that would be awesome. Ok I have Brent working now fully. Here is what I have as far as output for Brent now: new-crash-brent-v1-debug> list -H 0xffff8ac03c81fc28 ffff8abdf38b7d00 ffff8abdf38b7ac0 ffff8abdf38b7f40 ffff8abe0edb4b00 ffff8abdf3bb5d00 ffff8abdf3bb5b80 ffff8abdf3bb5640 ffff8abe0e92b100 ffff8abe0e92ad40 ffff8abdf3bb5d00 ffff8abdf3bb5b80 ffff8abdf3bb5640 list: loop detected, loop length: 5 list: length from start to loop: 4 list: duplicate list entry: ffff8abdf3bb5d00 new-crash-brent-v1-debug> While different, I think this actually may be superior output to what is currently given. The reason is you see part of the repeat, you have the loop length as well as the length at the start of the loop. Brent adds almost no overhead in the list traversal case. If there is a loop, it adds overhead of having to re-traverse the list (similar to floyd) but I think this is a good tradeoff since the overhead of a list advance is fixed vs a hash table which is increasing and necessary even for detection. I am working on a really good patch header to explain how it works (it's not as difficult and obtuse as it may seem), and can clean up the code to be exactly a duplicate function to do_list() but with this different algorithm. I can also add whatever documentation is necessary. I still need a better test plan and error injection but I'll work on that soon. I don't think with either Floyd or Brent we will be able to guarantee identical output to existing crash with hash on in the loop case. I'm not 100% sure it's impossible with Floyd but I'm certain it's impossible with Brent. So if identical output to existing crash with hash on is a requirement, I'm not sure either algorithm will work. (In reply to Dave Wysochanski from comment #33) > I don't think with either Floyd or Brent we will be able to guarantee > identical output to existing crash with hash on in the loop case. I'm not > 100% sure it's impossible with Floyd but I'm certain it's impossible with > Brent. > > So if identical output to existing crash with hash on is a requirement, I'm > not sure either algorithm will work. The output in comment #30 seems OK/harmless/useful, although it may argue for a unique "list" command option whose sole purpose is to check for loops. Posted the Brent implementation to upstream crash-utility list with a '-B' option for the list command, retaining the existing list function untouched. Patch applied upstream: https://github.com/crash-utility/crash/commit/6596f1121b89162f96d1e1825c2905b83b59bec1 The existing "list" command uses a hash table to detect duplicate items as it traverses the list. The hash table approach has worked well for many years. However, with increasing memory sizes and list sizes, the overhead of the hash table can be substantial, often leading to commands running for a very long time. For large lists, we have found that the existing hash based approach may slow the system to a crawl and possibly never complete. You can turn off the hash with "set hash off" but then there is no loop detection; in that case, loop detection must be done manually after dumping the list to disk or some other method. This patch is an implementation of the cycle detection algorithm from R. P. Brent as an alternative algorithm for the "list" command. The algorithm both avoids the overhead of the hash table and yet is able to detect a loop. In addition, further loop characteristics are printed, such as the distance to the start of the loop as well as the loop length. An excellent description of the algorithm can be found here on the crash-utility mailing list: https://www.redhat.com/archives/crash-utility/2018-July/msg00019.html A new "list -B" option has been added to the "list" command to invoke this new algorithm rather than using the hash table. In addition to low memory usage, the output of the list command is slightly different when a loop is detected. In addition to printing the first duplicate entry, the length of the loop, and the distance to the loop is output. (dwysocha) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2071 |