Bug 509866

Summary: [RHEL5.3] Even if a process have received data but schedule() in select() cannot return
Product: Red Hat Enterprise Linux 5 Reporter: Flavio Leitner <fleitner>
Component: kernelAssignee: Cong Wang <amwang>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.2CC: cward, dhoward, dzickus, fleitner, jolsa, kzhang, nmurray, phan, rkhan, tao
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
When data was transmitted from a server process to a client process while the client process was waiting for data provided by the select() function, the client process might not have returned from the select() function. With these update, the client process returns from the select() function.
Story Points: ---
Clone Of: 494404 Environment:
Last Closed: 2010-03-30 07:29:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 533192, 596383, 648823    

Description Flavio Leitner 2009-07-06 15:16:19 UTC
+++ This bug was initially created as a clone of Bug #494404 +++



Escalated to Bugzilla from IssueTracker

--- Additional comment from tao on 2009-04-06 14:10:01 EDT ---

(1) Category
	Defect Report

(2) Abstract
	Even if a process have recieved data but schedule() in select() cannot return.

(3) Symptom
	In the application product, even if data is transmitted 
        from the process of the server while the process of the client 
        is waiting for data by select(), the process does not wake up.

        Server process               Client process
                                         readv()
                                        select()
            writev()   --------------> Not return from select()

(4) Environment
	RHEL4.5
	2.6.9-55.0.12.ELsmp(EM64T)

(5) Recreation Steps
        When local data delivery is repeated many times by the application.
        the problem occurs.
	We made a simple reproducer and we're trying to reproduce.
	However, the phenomenon has not been reproduced yet.

(6) Investigation
	We have investigated the system occurring the phenomenon.
	Then, we found that the process waiting by select() was connected
	in the wait queue, and received data were stored the reception
        queue of the process.

	Details are as follows.
	* server process: pdfes
	* client process: pdbes
	* The client process of PID16812 was not returned from select().

	[Backtrace of PID16812]
	crash> bt 16812
	PID: 16812  TASK: 1020cbd97f0       CPU: 2   COMMAND: "pdbes"
	 #0 [1001e38dca8] schedule at ffffffff8030c89e
	 #1 [1001e38dd80] schedule_timeout at ffffffff8030d331
	 #2 [1001e38dde0] do_select at ffffffff8018cabf
	 #3 [1001e38ded0] sys_select at ffffffff8018ce3e
	 #4 [1001e38df80] system_call at ffffffff8011026a
	    RIP: 0000003df2ec0176  RSP: 0000002b1ec27000  RFLAGS: 00010246
	    RAX: 0000000000000017  RBX: ffffffff8011026a  RCX: 0000002b0aec9570
	    RDX: 0000000000000000  RSI: 00000000005588b8  RDI: 0000000000000007
	    RBP: 0000000000000000   R8: 0000000000000000   R9: 000000000000000b
	    R10: 0000000000000000  R11: 0000000000000202  R12: 0000000000000000
	    R13: 0000007fbffffc10  R14: 0000000000406c70  R15: 0000007fbfffd0c0
	    ORIG_RAX: 0000000000000017  CS: 0033  SS: 002b

	[Status of WAIT queue]
	crash> net -s 16812
	PID: 16812  TASK: 1020cbd97f0       CPU: 2   COMMAND: "pdbes"
	FD      SOCKET            SOCK       FAMILY:TYPE SOURCE-PORT DESTINATION-PORT
	 3      1016c9118c0      10110e6a0c0 INET:STREAM  0.0.0.0-0 0.0.0.0-0
	 4      10145904680      100253e4040 UNIX:STREAM
	 6      10066d22400      1016f4d8700 INET:STREAM  0.0.0.0-0 0.0.0.0-2768

	crash> struct sock 0x100253e4040 | grep sk_sleep
 	 sk_sleep = 0x101459046b0,

	crash> waitq 0x101459046b0
	PID: 16812  TASK: 1020cbd97f0       CPU: 2   COMMAND: "pdbes"

	[Result of netstat]
	# netstat -anp |grep 16812
	-------------------------------------------------------------------
	tcp         0      0 0.0.0.0:57192               0.0.0.0:*                   LISTEN      16812/pdbes
	tcp     13572      0 10.208.131.224:54096        10.208.131.227:57147        ESTABLISHED 16812/pdbes
	unix  2      [ ACC ]     STREAM     LISTENING     2464034988 16812/pdbes         /dev/HiRDB/pth/tk26847
	-------------------------------------------------------------------
	* There are data of 13572bytes in the reception queue of PID16812.

	[Collection of the system info by systemtap]
	Based on the above-mentioned result of the survey,
	when we tried the information collection by systemtap,
	we found the server process did not call try_to_wake_up().
	WAIT queue and the result of netstat command have the same situation to
	the survey of PID16812 as above.

	* the client process of PID17519 was not returned from select().
	----------------------------------------------------------------------
	…
	pdbes : do_select(pid:17519)
	pdbes : add_wait_queue(pid:17519)
	pdbes : add_wait_queue(pid:17519)
	pdbes : add_wait_queue(pid:17519)
	pdfes : sock_def_readable(sock:0x101CEEAF840)  //pdbes : PID 17519
	pdfes : try_to_wake_up(17519)
	pdbes : do_select(pid:17519)
	pdbes : add_wait_queue(pid:17519)
	pdbes : add_wait_queue(pid:17519)
	pdbes : add_wait_queue(pid:17519)
	pdfes : sock_def_readable(sock:0x101CEEAF840)  //pdbes : PID 17519
	----------------------------------------------------------------------
	=> The display of the client process(PID17519) is as above.
	   It seems try_to_wake_up() was not called.

	We can mention the following points from the investigation.

	- try_to_wake_up() was not called. The task is not added to the WAIT queue
	  when pdfes wake the task (when calling sock_def_readable()).
	- The process is added to the WAIT queue after occurring the phenomonon.
	- After occurring the phenomenon, tp->rcv_nxt was updated and stored to the reception queue.
	  * The size of the reception queue is calculated by using tp->rcv_nxt in "netstat -anp"

	We think the cause of this problem might be that try_to_wake_up() was not called 
	when data was received since local delivery procedure of the server process conflicted
        with select() procedure of the client process.

Comment 2 RHEL Program Management 2009-11-25 23:10:30 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 3 Don Zickus 2009-12-02 21:02:27 UTC
in kernel-2.6.18-176.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 4 Don Zickus 2009-12-02 21:13:54 UTC
in kernel-2.6.18-176.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 7 Chris Ward 2010-02-11 10:10:42 UTC
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 9 errata-xmlrpc 2010-03-30 07:29:24 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Comment 16 Jaromir Hradilek 2010-12-14 14:27:12 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
When data was transmitted from a server process to a client process while the client process was waiting for data provided by the select() function, the client process might not have returned from the select() function. With these update, the client process returns from the select() function.