+++ This bug was initially created as a clone of Bug #494404 +++ Escalated to Bugzilla from IssueTracker --- Additional comment from tao on 2009-04-06 14:10:01 EDT --- (1) Category Defect Report (2) Abstract Even if a process have recieved data but schedule() in select() cannot return. (3) Symptom In the application product, even if data is transmitted from the process of the server while the process of the client is waiting for data by select(), the process does not wake up. Server process Client process readv() select() writev() --------------> Not return from select() (4) Environment RHEL4.5 2.6.9-55.0.12.ELsmp(EM64T) (5) Recreation Steps When local data delivery is repeated many times by the application. the problem occurs. We made a simple reproducer and we're trying to reproduce. However, the phenomenon has not been reproduced yet. (6) Investigation We have investigated the system occurring the phenomenon. Then, we found that the process waiting by select() was connected in the wait queue, and received data were stored the reception queue of the process. Details are as follows. * server process: pdfes * client process: pdbes * The client process of PID16812 was not returned from select(). [Backtrace of PID16812] crash> bt 16812 PID: 16812 TASK: 1020cbd97f0 CPU: 2 COMMAND: "pdbes" #0 [1001e38dca8] schedule at ffffffff8030c89e #1 [1001e38dd80] schedule_timeout at ffffffff8030d331 #2 [1001e38dde0] do_select at ffffffff8018cabf #3 [1001e38ded0] sys_select at ffffffff8018ce3e #4 [1001e38df80] system_call at ffffffff8011026a RIP: 0000003df2ec0176 RSP: 0000002b1ec27000 RFLAGS: 00010246 RAX: 0000000000000017 RBX: ffffffff8011026a RCX: 0000002b0aec9570 RDX: 0000000000000000 RSI: 00000000005588b8 RDI: 0000000000000007 RBP: 0000000000000000 R8: 0000000000000000 R9: 000000000000000b R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 R13: 0000007fbffffc10 R14: 0000000000406c70 R15: 0000007fbfffd0c0 ORIG_RAX: 0000000000000017 CS: 0033 SS: 002b [Status of WAIT queue] crash> net -s 16812 PID: 16812 TASK: 1020cbd97f0 CPU: 2 COMMAND: "pdbes" FD SOCKET SOCK FAMILY:TYPE SOURCE-PORT DESTINATION-PORT 3 1016c9118c0 10110e6a0c0 INET:STREAM 0.0.0.0-0 0.0.0.0-0 4 10145904680 100253e4040 UNIX:STREAM 6 10066d22400 1016f4d8700 INET:STREAM 0.0.0.0-0 0.0.0.0-2768 crash> struct sock 0x100253e4040 | grep sk_sleep sk_sleep = 0x101459046b0, crash> waitq 0x101459046b0 PID: 16812 TASK: 1020cbd97f0 CPU: 2 COMMAND: "pdbes" [Result of netstat] # netstat -anp |grep 16812 ------------------------------------------------------------------- tcp 0 0 0.0.0.0:57192 0.0.0.0:* LISTEN 16812/pdbes tcp 13572 0 10.208.131.224:54096 10.208.131.227:57147 ESTABLISHED 16812/pdbes unix 2 [ ACC ] STREAM LISTENING 2464034988 16812/pdbes /dev/HiRDB/pth/tk26847 ------------------------------------------------------------------- * There are data of 13572bytes in the reception queue of PID16812. [Collection of the system info by systemtap] Based on the above-mentioned result of the survey, when we tried the information collection by systemtap, we found the server process did not call try_to_wake_up(). WAIT queue and the result of netstat command have the same situation to the survey of PID16812 as above. * the client process of PID17519 was not returned from select(). ---------------------------------------------------------------------- … pdbes : do_select(pid:17519) pdbes : add_wait_queue(pid:17519) pdbes : add_wait_queue(pid:17519) pdbes : add_wait_queue(pid:17519) pdfes : sock_def_readable(sock:0x101CEEAF840) //pdbes : PID 17519 pdfes : try_to_wake_up(17519) pdbes : do_select(pid:17519) pdbes : add_wait_queue(pid:17519) pdbes : add_wait_queue(pid:17519) pdbes : add_wait_queue(pid:17519) pdfes : sock_def_readable(sock:0x101CEEAF840) //pdbes : PID 17519 ---------------------------------------------------------------------- => The display of the client process(PID17519) is as above. It seems try_to_wake_up() was not called. We can mention the following points from the investigation. - try_to_wake_up() was not called. The task is not added to the WAIT queue when pdfes wake the task (when calling sock_def_readable()). - The process is added to the WAIT queue after occurring the phenomonon. - After occurring the phenomenon, tp->rcv_nxt was updated and stored to the reception queue. * The size of the reception queue is calculated by using tp->rcv_nxt in "netstat -anp" We think the cause of this problem might be that try_to_wake_up() was not called when data was received since local delivery procedure of the server process conflicted with select() procedure of the client process.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-176.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: When data was transmitted from a server process to a client process while the client process was waiting for data provided by the select() function, the client process might not have returned from the select() function. With these update, the client process returns from the select() function.