From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030922 Description of problem: An application uses the C interface to mysql and calls a remote mysql database. Occasionally, the mysql calls hang forever. It happens in mysql_query or in mysql_ping(), which I have now inserted before each mysql_query. In gdb, I see this stack trace for a hanging application: (gdb) bt #0 0xb75ebc32 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0xb7583943 in __read_nocancel () from /lib/tls/libpthread.so.0 #2 0x08218b1a in vio_read () #3 0x08219650 in my_real_read () #4 0x08219802 in my_net_read () #5 0x0820d57b in net_safe_read () #6 0x0820d93b in simple_command () #7 0x08210b4a in mysql_ping () #8 0x080c9cba in awd_query ( query=0xbfff9380 "SELECT uid FROM osq WHERE nosq=40860011", serr=0xbfff9520 "") at awd.c:46 .... It seems that after about 15 minutes a timeout happens, and the hanging call returns, but I am not sure whether this happens only under gdb debugging. If the timeout is genuine, it would be enough to learn where I can shorten the timeout. Version-Release number of selected component (if applicable): kernel 2.4.21-9.0.1.ELsmp, mysql 4.0.17 How reproducible: Sometimes Steps to Reproduce: 1. start the application which opens a mysql conenction, do some calls 2. keep the application runnign but do not make any mysql calls for about 20 minutes or longer 3. now, have the application make another mysql call, and see whether it hangs. Actual Results: So far, we have not been able to reproduce it in a controlled situation. The probelm happens with end users running an accounting application which occasionally makes mysql calls to a remote mysql database. These remote calls are rare, often with an hour pause in between. The problems was diagnosed by attaching a gdb session to a hanging process of an end user. Additional info:
Hm, does the connection to the remote database pass through a firewall? Some firewalls will drop connections after a certain period of inactivity. If you can't adjust the firewall's behavior, it may be that the only solution is not to assume that you can hold the remote connection open indefinitely.
Yes, the connection passes through a firewall, actually ipvsadm in a load balancer. But before we moved the application to RHEL 3, it has been running several years on HPUX and used exactly the same connection to the remote mysql database. There we never had this problem. By the way, the remote mysql server is a 3.23.37 release, while the local mysql client libraries are from the 4.0 tree. I am not sure whether that has anything to do with the problem, and I understand that redhat does not support mysql 4.x for RHEL 3.
Indeed, there's only a limited amount that I can do with mysql 4 problem reports. (I do hope they fix the licensing problem soon, because it's pretty silly that we can't ship mysql 4, but I'm just an engineer...) It seems like you have quite a lot of variables here: you have a platform move and possibly a mysql version move (were you running 4.0.17 or something older when you were on HPUX?). This makes it really hard to tell whether the issue is an RHEL bug or a mysql bug. If it is in mysql 4 then there's not much I can do except send you to MySQL AB. If it is RHEL then I will have to reassign it to a kernel or glibc hacker, but I'm not going to bother those guys until I have enough evidence to at least tell whose problem it's likely to be. In short then we need more data. One thing I am wondering about is whether the command that's supposed to be sent to the remote database is actually getting flushed out completely. Obviously if it doesn't get sent fully then the remote wouldn't respond. One way to look into that is to watch the remote connection with a packet sniffer such as tcpdump or ethereal --- are you in a position to do that?
Yes, I can watch the network. I have not used a sniffer ever, and would appreciate some advice. My setup is like this Client Machine A ------ LVS load balancer + NAT ---- Mysql host B The mysql host listens on host B port 3306, to which I send packets from A, any random port. Then A receives reply packets from B, port 3306. Where should I listen, and how do I setup the sniffer properly? Host B receives a lot of other myslq traffic from within the load balanced cluster, but these are easy to diferentiate because they come from different IP addresses. The packets from A will appear at B under the IP address of the load balancer, which sends no other requests. There are several simultaneous users on machine A with occasional mysql queries, but they all log their traffic in log files I have set up so that I should be able to sort out what came from whom later, via the timestamps.
PS: before the move from HPUX to RHEL 3, yes, the mysql client was 3.23.37 too. But as we were moving to RHEL, I thought it a good idea to update, even at the risk of not getting redhat support. The load balancer is redhat 7.0, and the mysql host is Redhat 7.1. I assume this is not relevant for the problem.
I think you'd be best off sniffing the single connection at the client machine, so as to minimize the amount of data you have to look through. I'm not much of a tcpdump expert myself, but I know you can filter by port number to watch just one connection. So it would go something like: start up client, determine what port is being used for the remote database connection (lsof should work if nothing else), start tcpdump, do something (to have a comparison point for what a working ping looks like), wait 20 minutes, do something again, repeat till failure, compare traces. BTW, if you use tcpdump, be aware that the default output format omits most of the non-header content of packets; you'll need to select options that show you the whole content of each packet. Not sure what ethereal's defaults are like, I've hardly used it.
I have put up tcpdump sniffer on both sides, at the client and at the server. After runnign the sniffer for a while, I found this: if the client does not send packets for some time (a few minutes), and then sends mysql packets again (via mysql_ping), these packets never arrive at the the mysql server. The client keeps trying, but never gets a reply. Who is responsible for terminating the connection I do not know yet. Because it used to work before, with the HPUX client and the same firewall/load balancer, I assume it is not the load balancer. I have decided to program a workaround into our application: the client terminates each mysql connection voluntarily, after making a short sequence of queries, and a opens a connection if needed. It does not try to keep a permanent connection. That way, the problem is solved for me. The bugzilla case can be closed, as solved by workaround.
It sure sounds like the firewall is at fault. That's an unusually short timeout though. Is it possible that the firewall was configured to not drop connections from your old client machine, and that that preference didn't get changed over to point to your new client's IP address? This is a bizarre theory but I have no others at the moment. Anyway, I'm going to close this bug since it's evidently not either mysql's or RHEL's fault. Feel free to reopen it if you get new evidence pointing back at us.
The firewall rules were not touched, and there is none there which should create a different behaviour. Actually, the new client inherited to old HPUX client's IP address. Anyway, my workaround works. I impletemented a last_time_talked flags which remembers when the mysql connection was used the last time. If a new request sees that the last_time_talked is too far in the past, it closes the connection and opens a new one. Apparently, it can close it without sending a single packet, so that there is no danger of hangup. That way, we avoid using old connections which might have gone stale.