Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1136072

Summary:

Calling clnt_call() with a timeout of 0 results in the recv-q filling up and eventual connection failure

Product:

Red Hat Enterprise Linux 6

Reporter:

noah davids <ndavids>

Component:

kernel

Assignee:

Rashid Khan <rkhan>

kernel sub component:

Networking

QA Contact:

Network QE <network-qe>

Status:

CLOSED NOTABUG

Docs Contact:

Severity:

unspecified

Priority:

unspecified

CC:

jmaxwell, jpirko, kzhang, steved, tbowling, tgraf

Version:

6.5

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-09-03 15:29:12 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
RHEL system is 10.3.233.120, the issue starts are frame 1398 when RHEL ACKs 10109	none

Description noah davids 2014-09-01 15:07:02 UTC

Created attachment 933467 [details]
RHEL system is 10.3.233.120, the issue starts are frame 1398 when RHEL ACKs 10109

Description of problem:
Application is calling clnt_call() with a timeout of 0. The recv-q is not being drained and at some point the system stops ACKing data. The system is not advetising a zero-window because of an interaction between window scaling and the small size of the packets being received. The issue is not that the system is not advertising a zero window, the issue is that the recv-q is not being drained.


Version-Release number of selected component (if applicable):


How reproducible:
This appears to be very reproduable.


Steps to Reproduce:
1. Create an RPC application
2. call clnt_call with a timeout of 0
3.

Actual results:
The receive queue is not being drained and eventually reaches capacity.

Expected results:
Drain the receive queue.

Additional info:

Comment 1 Steve Dickson 2014-09-03 15:29:12 UTC

The application is not reading the replies sent by the server
which is causing server read que to fill up and ultimately
reset the connection. 

Having the application read the replies by setting the timeout = 1
fixes the problem.

Comment 2 Terry Bowling 2014-09-03 18:29:56 UTC

Made bug publicly visible for customer visibility.

From a separate email thread it was explained that RPC is a bidirectional protocol.  Thus, when clnt_call() is executed a response is expected by definition of the RPC protocol even if timeout is set to 0.

Do to the significant use of this in legacy applications, it would not be safe to suddenly change this behavior, even if it is to protect developers from this issue.

If it is critical to have timeout=0 for performance reasons, then a suggested workaround is to add a counter mechanism to the code so that for every 1000th execution of clnt_call(), timeout is changed temporarily to 1.  This would allow for periodic draining of the ACKs from the recv-q.

This workaround would need to be tested and tuned for the particular application (counter interval of 100, 1,000, 10,000 etc) as there is no way to know the impact on a particular application.