Bug 1463002
Summary: | ksh function calls hang in loop | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Greg Chandler <greg.chandler> |
Component: | ksh | Assignee: | Siteshwar Vashisht <svashisht> |
Status: | CLOSED WONTFIX | QA Contact: | BaseOS QE - Apps <qe-baseos-apps> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 6.9 | CC: | greg.chandler, mkyral |
Target Milestone: | rc | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-12-06 11:02:27 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Greg Chandler
2017-06-19 21:29:06 UTC
It does not wait on my system. Here is the output from my test system : # rpm -q ksh ksh-20120801-33.el6.x86_64 # cat test.sh #!/bin/ksh _urlencode () { ####For every line in the string encode it properly. ####UNIX cut will take the same cut section from every input line, so you have to walk one line at a time. ####Note, unfortunately this method will always dump an extra line feed at the end, which must be cleaned up #### outside of this function, due to the way it is called. string="$1" echo "$string" | while IFS= read -r line do x=1 length=${#line} while [ $x -le $length ] do CHAR="`echo \"$line\" |cut -c$x`" if [ -z "`echo \"$CHAR\" |egrep \"[0-9]|[a-z]|[A-Z]|[._/-]\"`" ] ; then DECIMAL="`echo \"$CHAR\" |od -t d1 |awk '{printf \"%s\",\$2}'`" HEX="`echo \"obase=16; $DECIMAL\" |bc`" printf "%%%s" "$HEX" else printf "%s" "$CHAR" fi x=`expr $x + 1` done printf "%%%s" "0A" done } ENCODED_STRING="`_urlencode \" Some ra^ndo#m string \"`" echo "$ENCODED_STRING" # ksh test.sh %20Some%20ra%5Endo%23m%20string%20%0A I am able to reproduce the problem 100% of the time. It's been tried on 30+ servers, all of which fail, and all of which are RHEL 6.9. I haven't had a chance to test on other versions of RHEL yet, but it has worked on our AIX and Solaris systems. What other info do we need to figure out what the problem might be? Ok, so I went to check again this morning, and the script above, as just a snippet does indeed work. However the code it is being called from does not. The function is copied verbatim from the larger script, and the ENCODED_STRING set only differs from the string being passed. ENCODED_STRING="`_urlencode \" Some ra^ndo#m string \"`" vs ENCODED_STRING="`_urlencode \"$2\"`" I am starting to wonder now if a previously patched ksh problem is now to blame. There was a bug I found in Solaris' ksh where the limit size for a job had been capped. It seems that they decided that the script size + env + memory would be capped at an arbitrary number rather than allow for an unlimited size like ksh was designed specifically to do. The script itself is 410,821 bytes, 8106 lines long. ------------------------------------------------------------------------------ Here is the top info for it once it is hung: top - 11:13:04 up 29 days, 8:28, 3 users, load average: 0.67, 0.71, 0.46 Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie Cpu(s): 10.4%us, 2.8%sy, 0.0%ni, 86.6%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 32752628k total, 28189840k used, 4562788k free, 399044k buffers Swap: 3145724k total, 0k used, 3145724k free, 16630996k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18369 root 20 0 108m 4004 1528 S 0.0 0.0 0:01.79 ksh ------------------------------------------------------------------------------ A second run: top - 11:17:59 up 29 days, 8:33, 3 users, load average: 0.67, 0.70, 0.51 Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie Cpu(s): 0.5%us, 0.2%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 32752628k total, 28177092k used, 4575536k free, 399064k buffers Swap: 3145724k total, 0k used, 3145724k free, 16633400k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30161 root 20 0 108m 4000 1528 S 0.0 0.0 0:01.74 ksh ------------------------------------------------------------------------------ [root@ipvrw00a0006 ~]# lsof |grep 30161 ksh 30161 root cwd DIR 253,0 4096 524315 /var/tmp ksh 30161 root rtd DIR 253,0 4096 2 / ksh 30161 root txt REG 253,0 1507840 917663 /bin/ksh93 ksh 30161 root mem REG 253,0 161704 786454 /lib64/ld-2.12.so ksh 30161 root mem REG 253,0 1930416 786472 /lib64/libc-2.12.so ksh 30161 root mem REG 253,0 23088 786584 /lib64/libdl-2.12.so ksh 30161 root mem REG 253,0 600048 786497 /lib64/libm-2.12.so ksh 30161 root mem REG 253,0 18056 787004 /lib64/libutil-2.12.so ksh 30161 root mem REG 253,0 99170352 1068621 /usr/lib/locale/locale-archive ksh 30161 root 0u unix 0xffff880119625780 0t0 41259139 socket ksh 30161 root 1u REG 253,0 2 1182388 /tmp/sf1c.j3a (deleted) ksh 30161 root 2u CHR 136,3 0t0 6 /dev/pts/3 ksh 30161 root 10r DIR 253,0 4096 524315 /var/tmp ksh 30161 root 11r REG 253,0 410821 1180654 /tmp/hcm.sh ksh 30161 root 12r DIR 253,0 4096 524315 /var/tmp ksh 30161 root 13u CHR 136,3 0t0 6 /dev/pts/3 ksh 30161 root 14u CHR 136,3 0t0 6 /dev/pts/3 ksh 30161 root 15r DIR 253,0 4096 524315 /var/tmp ksh 30161 root 16u REG 253,0 0 1182380 /tmp/sf12.iih (deleted) ------------------------------------------------------------------------------ The exact code from the script: _urlencode () { ####For every line in the string encode it properly. ####UNIX cut will take the same cut section from every input line, so you have to walk one line at a time. ####Note, unfortunately this method will always dump an extra line feed at the end, which must be cleaned up #### outside of this function, due to the way it is called. string="$1" echo "$string" | while IFS= read -r line do x=1 length=${#line} while [ $x -le $length ] do CHAR="`echo \"$line\" |cut -c$x`" if [ -z "`echo \"$CHAR\" |egrep \"[0-9]|[a-z]|[A-Z]|[._/-]\"`" ] ; then DECIMAL="`echo \"$CHAR\" |od -t d1 |awk '{printf \"%s\",\$2}'`" HEX="`echo \"obase=16; $DECIMAL\" |bc`" printf "%%%s" "$HEX" else printf "%s" "$CHAR" fi x=`expr $x + 1` done printf "%%%s" "0A" done } append_CA_super_field () { echo "$1 = $2" |$TEE $CONLOG ENCODED_STRING="`_urlencode \"$2\"`" CLEANED_ENCODED_STRING="`echo \"$ENCODED_STRING\" |sed -e 's/%0A\$//g'`" CA_SUPER_FIELD_MAIN_VALUES="$CA_SUPER_FIELD_MAIN_VALUES^$1:$CLEANED_ENCODED_STRING" } set_attributes () { ####For a complete list of all attributes being set, do not forget to check all the calls to file_to_CA, as they reference this function output_header "Setting HPSA Custom Attributes" append_CA_super_field "ADDM_Keys" "$OR_ADDM_KEYS" append_CA_super_field "ADDM_User" "$OR_ADDM_USER" #### #### About 200 more lines of the same kind of calls removed #### } set_attributes is called near the end of the script. Prior to it's call, OR_ADDM_KEYS="No" , OR_ADDM_USER="" I modified the function to include the 3 debug statements below: The echos must be redirected to a file since the function takes the output and sends it to a variable rather than stdout _urlencode () { ####For every line in the string encode it properly. ####UNIX cut will take the same cut section from every input line, so you have to walk one line at a time. ####Note, unfortunately this method will always dump an extra line feed at the end, which must be cleaned up #### outside of this function, due to the way it is called. string="$1" echo "$string" | while IFS= read -r line do x=1 length=${#line} while [ $x -le $length ] do CHAR="`echo \"$line\" |cut -c$x`" echo "$x: $CHAR" >>/tmp/lines if [ -z "`echo \"$CHAR\" |egrep \"[0-9]|[a-z]|[A-Z]|[._/-]\"`" ] ; then DECIMAL="`echo \"$CHAR\" |od -t d1 |awk '{printf \"%s\",\$2}'`" HEX="`echo \"obase=16; $DECIMAL\" |bc`" echo "$x: HEX: $HEX" >>/tmp/lines printf "%%%s" "$HEX" else echo "$x: CHAR: $CHAR" >>/tmp/lines printf "%s" "$CHAR" fi x=`expr $x + 1` done printf "%%%s" "0A" done } [root@ipvrw00a0006 tmp]# ls -al /tmp/lines ls: cannot access /tmp/lines: No such file or directory [root@ipvrw00a0006 tmp]# ksh hcm.sh -pre {standard output} ADDM_Keys = No {hang} {ctrl-z} ^Z [1]+ Stopped ksh hcm.sh -pre [root@ipvrw00a0006 tmp]# ls -al /tmp/lines -rw-r--r-- 1 root root 5 Jun 20 11:29 /tmp/lines [root@ipvrw00a0006 tmp]# cat /tmp/lines 1: N [root@ipvrw00a0006 tmp]# This is the behavior I am seeing, and able to replicate. For whatever reason which I am still having trouble figuring out, it gets 1 character into the encode loop and hangs. it seemingly hangs in the if statement, but I am not 100% certain that the if statement actually is the hang point. To show there are no aliases for echo/egrep: [root@ipvrw00a0006 tmp]# alias alias cp='cp -i' alias l.='ls -d .* --color=auto' alias ll='ls -l --color=auto' alias ls='ls --color=auto' alias mv='mv -i' alias rm='rm -i' alias which='alias | /usr/bin/which --tty-only --read-alias --show-dot --show-tilde' [root@ipvrw00a0006 tmp]# [root@ipvrw00a0006 tmp]# ulimit unlimited [root@ipvrw00a0006 tmp]# I also wanted to drop a reminder, when this hangs at that point, ctrl-c does not work to stop the script. The only way I can get out of it is to pause the process and then kill it: crtl-z kill %1 I removed the code for AIX, to see if we were hitting a size limit like with the Oracle bug: [root@ipvrw00a0006 tmp]# ls -al hcm.sh -rw-r--r-- 1 x819713 usasa 370705 Jun 20 11:37 hcm.sh But it didn't make that much of a difference, and the script still hung in the exact same place. I then removed the AIX and HP/UX code as well: [root@ipvrw00a0006 tmp]# ls -al hcm.sh -rw-r--r-- 1 x819713 usasa 265511 Jun 20 11:46 hcm.sh [root@ipvrw00a0006 tmp]# cat /tmp/lines 1: N [root@ipvrw00a0006 tmp]# top -p 24611 top - 11:48:25 up 29 days, 9:03, 4 users, load average: 0.35, 0.23, 0.21 Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie Cpu(s): 0.8%us, 0.4%sy, 0.0%ni, 98.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 32752628k total, 28297756k used, 4454872k free, 399084k buffers Swap: 3145724k total, 0k used, 3145724k free, 16640852k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24611 root 20 0 108m 3256 1528 S 0.0 0.0 0:01.32 ksh And again it hung in the exact same place, the resident size of ram has reduced, and in the case of the Oracle bug, the script would have continued to a point where it hit the limit, and then died there. Each of the global variables is combined after encoding into a single string to be sent to another application, so the reduced size of the environment in the script load itself, freed more ram for the final string. This however is NOT happening here, so I have at least ruled that bug out. While testing... RHEL 7 has the same issue, but much farther up in the script. In this case, there is a function, that executes all the lines of the function, but does not properly return. message_parser () { #### $1 input file #### Sort -u? #### Pass/Fail #### $ERROR_STRINGS is a PIPE demlimited list if [ -f "$1" ] ; then for CHECKS in `echo "$ERROR_STRINGS" |tr "|" " "` do output_header "$1 $CHECKS logs" LINES="`cat \"$1\" 2>/dev/null |grep -i \"$CHECKS\" |wc -l |sed -e 's/ //g'`" echo "$1 has $LINES lines with $CHECKS within it." |$TEE $CONLOG LINES="`cat \"$1\" 2>/dev/null |grep -i \"$CHECKS\" |sort -uk 6 |wc -l |sed -e 's/ //g'`" #### Create a ratio of found lines to the max lines, and print that many only. Add one for the case that it returns 0 but should have been rounded upwards. GET_LINES="`echo \"scale=4; $LINES/$MAX_LINES*100+1\" |bc |awk -F\".\" '{print \$1}'`" echo "The unique lines, {up to $GET_LINES lines}, of type \"$CHECKS\" in $1 are:" |$TEE $CONLOG grep -i "$CHECKS" "$1" |sort -uk 6 |$TAIL -"$GET_LINES" |$TEE $CONLOG done fi echo "exiting" } mesg_dmesg_check () { #### This is meant for dmesg... AIX needs something like alog -o -t console which looks completely different if [ ! "$OS" = "AIX" ] ; then dmesg >$DMESG_LOG message_parser "$DMESG_LOG" retain_messages "$DMESG_LOG" "dmesg_Log" "$CA_SUPER_FIELD_LOGS" fi message_parser "$MESSAGE_LOG" echo "before" retain_messages "$MESSAGE_LOG" "Error_Log" "$CA_SUPER_FIELD_LOGS" echo "after" \rm "$DMESG_LOG" 2>/dev/null } I see "exiting", but not "before" top - 12:43:48 up 5 days, 14:50, 3 users, load average: 0.04, 0.15, 0.18 Tasks: 1 total, 0 running, 0 sleeping, 1 stopped, 0 zombie %Cpu(s): 0.4 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 16268312 total, 8696812 free, 1429144 used, 6142356 buff/cache KiB Swap: 16777212 total, 16777212 free, 0 used. 14466576 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21753 root 20 0 119096 4120 1636 T 0.0 0.0 0:00.25 ksh I am noticing that the RAM usage on RHEL 7 is also higher, and I am starting to wonder again if there is indeed a limit being placed somewhere. However, this thing should be crashing/exiting or seg-faulting somewhere if it hit a hard limit. There are no core files anywhere on the systems. {RHEL 6 or 7} You should install debuginfo symbols for ksh (debuginfo-install ksh), reproduce the problem and execute 'pstack <ksh process id>" to check where it hangs in the code. These servers are all in a secure environment. I neither have access to, or have the legal ability to install those packages. I have not been able to reproduce this bug and it will not be possible for me to make any progress with current data. Please reopen if you are able to provide a reproducer. I don't appreciate the ticket being closed while we are still actively looking for a solution. I was able to find a system in a non-secure environment, to run this on, and here is the stack trace: [root@dev1a ~]# pstack 8736 #0 0x0000003a63aacbd0 in __pause_nocancel () from /lib64/libc.so.6 #1 0x000000000047265c in sleep () #2 0x000000000047274f in sh_delay () #3 0x0000000000472904 in b_sleep () #4 0x000000000045e0f6 in sh_exec () #5 0x0000000000407660 in exfile () #6 0x0000000000407e70 in sh_main () #7 0x0000003a63a1ed1d in __libc_start_main () from /lib64/libc.so.6 #8 0x0000000000406b39 in _start () The debuginfo symbols are missing from your system, but the stacktrace shows that sleep builtin is being executed. I would suggest you to look at any calls to 'sleep' builtin in your scripts. My apologies... I ran a trace on the tripwire fork and not the main execution. Here is the proper trace: [root@dev1a ~]# pstack 8698 #0 0x0000003a63aac82e in waitpid () from /lib64/libc.so.6 #1 0x0000000000428fca in job_reap () #2 0x0000000000429fc6 in job_wait () #3 0x000000000045cd64 in sh_exec () #4 0x0000000000454448 in sh_subshell () #5 0x00000000004377cf in comsubst () #6 0x00000000004371c3 in copyto () #7 0x000000000043874d in sh_macexpand () #8 0x000000000047a3bb in sh_argbuild () #9 0x0000000000459637 in sh_exec () #10 0x000000000045b153 in sh_exec () #11 0x000000000045992b in sh_exec () #12 0x000000000045c91b in sh_exec () #13 0x000000000045992b in sh_exec () #14 0x000000000045c91b in sh_exec () #15 0x000000000045a2fa in sh_exec () #16 0x000000000045b3b5 in sh_exec () #17 0x000000000045994a in sh_exec () #18 0x000000000046daed in b_dot_cmd () #19 0x000000000045ed75 in sh_funct () #20 0x000000000045aca2 in sh_exec () #21 0x0000000000454448 in sh_subshell () #22 0x00000000004377cf in comsubst () #23 0x00000000004371c3 in copyto () #24 0x00000000004374ea in sh_mactrim () #25 0x000000000044048f in nv_setlist () #26 0x0000000000459ed3 in sh_exec () #27 0x000000000045992b in sh_exec () #28 0x000000000046daed in b_dot_cmd () #29 0x000000000045ed75 in sh_funct () #30 0x000000000045aca2 in sh_exec () #31 0x000000000045992b in sh_exec () #32 0x000000000046daed in b_dot_cmd () #33 0x000000000045ed75 in sh_funct () #34 0x000000000045aca2 in sh_exec () #35 0x000000000045992b in sh_exec () #36 0x0000000000459afb in sh_exec () #37 0x000000000045992b in sh_exec () #38 0x000000000045b17a in sh_exec () #39 0x0000000000407660 in exfile () #40 0x0000000000407e70 in sh_main () #41 0x0000003a63a1ed1d in __libc_start_main () from /lib64/libc.so.6 #42 0x0000000000406b39 in _start () debuginfo symbols are still missing from the stacktrace. Do you have a Red Hat support ticket for this bug ? If not, I would suggest you to start by opening a ticket with Red Hat Customer Support. They will help you in narrowing down this issue. [root@dev1a ~]# rpm -qa |grep debuginfo ksh-debuginfo-20120801-33.el6.x86_64 [root@dev1a ~]# ps -aux |grep hcm.sh Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ root 8698 0.0 0.0 110792 3248 pts/3 S+ 13:07 0:01 ksh hcm.sh -pre root 28125 0.0 0.0 103328 888 pts/5 S+ 14:27 0:00 grep hcm.sh [root@dev1a ~]# pstack 8698 #0 0x0000003a63aac82e in waitpid () from /lib64/libc.so.6 #1 0x0000000000428fca in job_reap () #2 0x0000000000429fc6 in job_wait () #3 0x000000000045cd64 in sh_exec () #4 0x0000000000454448 in sh_subshell () #5 0x00000000004377cf in comsubst () #6 0x00000000004371c3 in copyto () #7 0x000000000043874d in sh_macexpand () #8 0x000000000047a3bb in sh_argbuild () #9 0x0000000000459637 in sh_exec () #10 0x000000000045b153 in sh_exec () #11 0x000000000045992b in sh_exec () #12 0x000000000045c91b in sh_exec () #13 0x000000000045992b in sh_exec () #14 0x000000000045c91b in sh_exec () #15 0x000000000045a2fa in sh_exec () #16 0x000000000045b3b5 in sh_exec () #17 0x000000000045994a in sh_exec () #18 0x000000000046daed in b_dot_cmd () #19 0x000000000045ed75 in sh_funct () #20 0x000000000045aca2 in sh_exec () #21 0x0000000000454448 in sh_subshell () #22 0x00000000004377cf in comsubst () #23 0x00000000004371c3 in copyto () #24 0x00000000004374ea in sh_mactrim () #25 0x000000000044048f in nv_setlist () #26 0x0000000000459ed3 in sh_exec () #27 0x000000000045992b in sh_exec () #28 0x000000000046daed in b_dot_cmd () #29 0x000000000045ed75 in sh_funct () #30 0x000000000045aca2 in sh_exec () #31 0x000000000045992b in sh_exec () #32 0x000000000046daed in b_dot_cmd () #33 0x000000000045ed75 in sh_funct () #34 0x000000000045aca2 in sh_exec () #35 0x000000000045992b in sh_exec () #36 0x0000000000459afb in sh_exec () #37 0x000000000045992b in sh_exec () #38 0x000000000045b17a in sh_exec () #39 0x0000000000407660 in exfile () #40 0x0000000000407e70 in sh_main () #41 0x0000003a63a1ed1d in __libc_start_main () from /lib64/libc.so.6 #42 0x0000000000406b39 in _start () I'm not sure what else to do, the debuginfo package was installed prior to the execution that is currently hanging {I've left it hung so that it stays the same process ID, currently hung 81 minutes}. Was there something else I missed? Ok, so the code above is still breaking as-is. I have found another method which works, which leads me to believe this is a stream/pipe bug, possibly in libreadline, or the shell's use of it. old_IFS=$IFS ####IFS should be set to $'\n' but it does not properly translate on some OSes, so it is a hard CR here IFS=$' ' for line in $1 do x=1 length=${#line} while [ $x -le $length ] do CHAR="`echo \"$line\" |cut -c$x`" if [ -z "`echo \"$CHAR\" |egrep \"[0-9]|[a-z]|[A-Z]|[._/-]\"`" ] ; then DECIMAL="`echo \"$CHAR\" |od -t d1 |awk '{printf \"%s\",\$2}'`" HEX="`echo \"obase=16; $DECIMAL\" |bc`" printf "%%%s" "$HEX" else printf "%s" "$CHAR" fi x=`expr $x + 1` done printf "%s" "%0A" done IFS=$old_IFS } The IFS set has to be done with an actual carriage return, as this script needs to work on about 32 variants of Unix, and the \n is not interpreted properly on all of them. For some reason, modifying the IFS statement and running a for loop, works perfectly. The while loop above takes an echo in as a pipe, and from what I can tell, it hangs in the background waiting on input from read. the IFS= does not properly set in the while loop like it is supposed to, and I assume that is the reason it seems to be waiting on read. My best guess is that read is waiting on user input inside of a forked process, and that process lost it's way to send that back to the invoking shell. Maybe not, but at this point it's all I've got. Since this is 100% repeatable for me on all my RHEL 6 boxes, and the for loop does work, this is definitely a bug somewhere in ksh or read. Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available. The official life cycle policy can be reviewed here: http://redhat.com/rhel/lifecycle This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL: https://access.redhat.com/ |