Bug 119304

Summary:	mysql query hangs at _dl_sysinfo_int80 () from /lib/ld-linux.so.2
Product:	Red Hat Enterprise Linux 3	Reporter:	Alois Treindl <alois>
Component:	mysql	Assignee:	Tom Lane <tgl>
Status:	CLOSED NOTABUG	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.0	CC:	hhorak
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-03-30 14:34:09 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alois Treindl 2004-03-29 09:14:20 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030922

Description of problem:
An application uses the C interface to mysql and calls
a remote mysql database.
Occasionally, the mysql calls hang forever. It happens in mysql_query
or in mysql_ping(), which I have now inserted before each mysql_query.

In gdb, I see this stack trace for a hanging application:
(gdb) bt
#0  0xb75ebc32 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0xb7583943 in __read_nocancel () from /lib/tls/libpthread.so.0
#2  0x08218b1a in vio_read ()
#3  0x08219650 in my_real_read ()
#4  0x08219802 in my_net_read ()
#5  0x0820d57b in net_safe_read ()
#6  0x0820d93b in simple_command ()
#7  0x08210b4a in mysql_ping ()
#8  0x080c9cba in awd_query (
    query=0xbfff9380 "SELECT uid FROM osq WHERE nosq=40860011",
    serr=0xbfff9520 "") at awd.c:46
....

It seems that after about 15 minutes a timeout happens, and the
hanging call returns, but I am not sure whether this happens only
under gdb debugging.
If the timeout is genuine, it would be enough to learn where I can
shorten the timeout.


Version-Release number of selected component (if applicable):
kernel 2.4.21-9.0.1.ELsmp, mysql 4.0.17

How reproducible:
Sometimes

Steps to Reproduce:
1. start the application which opens a mysql conenction, do some calls
2. keep the application runnign but do not make any mysql calls for
about 20 minutes or longer
3. now, have the application make another mysql call, and see whether
it hangs.
    

Actual Results:  So far, we have not been able to reproduce it in a
controlled situation.
The probelm happens with end users running an accounting application
which occasionally makes mysql calls to a remote mysql database.
These remote calls are rare, often with an hour pause in between.

The problems was diagnosed by attaching a gdb session to a hanging
process of an end user.


Additional info:

Comment 1 Tom Lane 2004-03-29 15:47:34 UTC

Hm, does the connection to the remote database pass through a
firewall?  Some firewalls will drop connections after a certain period
of inactivity.  If you can't adjust the firewall's behavior, it may be
that the only solution is not to assume that you can hold the remote
connection open indefinitely.

Comment 2 Alois Treindl 2004-03-29 17:01:14 UTC

Yes, the connection passes through a firewall, actually ipvsadm
in a load balancer. But before we moved the application to RHEL 3,
it has been running several years on HPUX and used exactly the same
connection to the remote mysql database. There we never had this problem.

By the way, the remote mysql server is a 3.23.37 release, while the
local mysql client libraries are from the 4.0 tree. I am not sure 
whether that has anything to do with the problem, and I understand that
redhat does not support mysql 4.x for RHEL 3.

Comment 3 Tom Lane 2004-03-29 18:33:31 UTC

Indeed, there's only a limited amount that I can do with mysql 4
problem reports. (I do hope they fix the licensing problem soon,
because it's pretty silly that we can't ship mysql 4, but I'm just an
engineer...)

It seems like you have quite a lot of variables here: you have a
platform move and possibly a mysql version move (were you running
4.0.17 or something older when you were on HPUX?).  This makes it
really hard to tell whether the issue is an RHEL bug or a mysql bug. 
If it is in mysql 4 then there's not much I can do except send you to
MySQL AB.  If it is RHEL then I will have to reassign it to a kernel
or glibc hacker, but I'm not going to bother those guys until I have
enough evidence to at least tell whose problem it's likely to be.

In short then we need more data.  One thing I am wondering about is
whether the command that's supposed to be sent to the remote database
is actually getting flushed out completely.  Obviously if it doesn't
get sent fully then the remote wouldn't respond.  One way to look into
that is to watch the remote connection with a packet sniffer such as
tcpdump or ethereal --- are you in a position to do that?

Comment 4 Alois Treindl 2004-03-29 19:20:36 UTC

Yes, I can watch the network. 

I have not used a sniffer ever, and would appreciate
some advice.

My setup is like this
Client Machine A ------ LVS load balancer + NAT ---- Mysql host B
The mysql host listens on host B port 3306, to which I send packets
from A, any random port.
Then A receives reply packets from B, port 3306.

Where should I listen, and how do I setup the sniffer properly?
Host B receives a lot of other myslq traffic from within the 
load balanced cluster, but these are easy to diferentiate
because they come from different IP addresses.
The packets from A will appear at B under the IP address of the
load balancer, which sends no other requests.

There are several simultaneous users on machine A with occasional
mysql queries, but they all log their traffic in log files I have
set up so that I should be able to sort out what came from whom
later, via the timestamps.

Comment 5 Alois Treindl 2004-03-29 19:24:19 UTC

PS: before the move from HPUX to RHEL 3, yes, the mysql client
was 3.23.37 too.
But as we were moving to RHEL, I thought it a good idea to update,
even at the risk of not getting redhat support.
The load balancer is redhat 7.0, and the mysql host is Redhat 7.1.
I assume this is not relevant for the problem.

Comment 6 Tom Lane 2004-03-29 19:34:43 UTC

I think you'd be best off sniffing the single connection at the client
machine, so as to minimize the amount of data you have to look
through.  I'm not much of a tcpdump expert myself, but I know you can
filter by port number to watch just one connection.  So it would go
something like: start up client, determine what port is being used for
the remote database connection (lsof should work if nothing else),
start tcpdump, do something (to have a comparison point for what a
working ping looks like), wait 20 minutes, do something again, repeat
till failure, compare traces.

BTW, if you use tcpdump, be aware that the default output format omits
most of the non-header content of packets; you'll need to select
options that show you the whole content of each packet.  Not sure what
ethereal's defaults are like, I've hardly used it.

Comment 7 Alois Treindl 2004-03-30 09:35:46 UTC

I have put up tcpdump sniffer on both sides, at the client and
at the server.

After runnign the sniffer for a while, I found this:

if the client does not send packets for some time (a few minutes),
and then sends mysql packets again (via mysql_ping), these packets
never arrive at the the mysql server.

The client keeps trying, but never gets a reply.

Who is responsible for terminating the connection I do not know yet.
Because it used to work before, with the HPUX client and the same
firewall/load balancer, I assume it is not the load balancer.

I have decided to program a workaround into our application: the
client terminates each mysql connection voluntarily, after making
a short sequence of queries, and a opens a connection if needed.
It does not try to keep a permanent connection.

That way, the problem is solved for me. The bugzilla case can be
closed, as solved by workaround.

Comment 8 Tom Lane 2004-03-30 14:34:09 UTC

It sure sounds like the firewall is at fault.  That's an unusually
short timeout though.

Is it possible that the firewall was configured to not drop
connections from your old client machine, and that that preference
didn't get changed over to point to your new client's IP address? 
This is a bizarre theory but I have no others at the moment.

Anyway, I'm going to close this bug since it's evidently not either
mysql's or RHEL's fault.  Feel free to reopen it if you get new
evidence pointing back at us.

Comment 9 Alois Treindl 2004-03-30 14:55:12 UTC

The firewall rules were not touched, and there is none there which
should create a different behaviour. Actually, the new client inherited
to old HPUX client's IP address.

Anyway, my workaround works.

I impletemented a last_time_talked flags which remembers when the
mysql connection was used the last time.
If a new request sees that the last_time_talked is too far in the
past, it closes the connection and opens a new one. Apparently, it can
close it without sending a single packet, so that there is no danger
of hangup.

That way, we avoid using old connections which might have gone stale.