Bug 235517
Summary: | php stopping half way, closer look reveals null terminators | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | James Jhurani <jjhurani> | ||||||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||||
Status: | CLOSED NOTABUG | QA Contact: | Martin Jenner <mjenner> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 4.4 | CC: | jjhurani, jorton, jwest | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2010-06-07 04:59:55 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
James Jhurani
2007-04-06 16:21:34 UTC
Ronnie K's Update: I've installed lighttpd-1.4.13 on 209.85.xx.180, one of James Ds boxes. It's having the same issues as Apache. I can't view http://209.85.xx.180/phpinfo.php on port 80, but 81 or anything else is no problem. On this server, you can view the .php pages with HTTP 1.0, you just get badchecksum errors on tcpdump. However, if you do this: rwkendrick@10-33-3-139 ~ $ telnet 209.85.xx.180 80 Trying 209.85.xx.180... Connected to 209.85.xx.180. Escape character is '^]'. GET /phpinfo.php HTTP/1.0 The whole file comes through with no chksum errors. Not sure why that is. However, James is working on another server (209.85.(xx+33).99/gabetest.php) and HTTP 1.0 doesn't work at all. So, that throws out the HTTP 1.1 theory. These files work without any problems locally. Not using the loopback address, but the actual IP. However, I doubt it's the network because we have a test box the customer hasn't touched and it servers phpinfo no problem. I was going to rebuild the kernel modules, but James already installed and tested new kernels without any success. Read the rest of the internal posts if you want to see all the methods we've exhausted. If you have any other ideas, let us know. NOTE: his references to James, are referring to me(jjhurani), not the customers James D, or James H. unless he added the last initial. Created attachment 152131 [details]
sysreport from server exhibiting this bug
Note regarding the sysreport, I used the sysreport from a server that is exhibiting the problem, but with the least amount of trouble shooting done. I did this with the intentions of providing you with a clean sysreport with the least amount of configuration changes made by us. The php files we have been testing all contain: [root@rpweb1 public_html]# cat test.php <?php phpinfo(); ?> [root@rpweb1 public_html]# Some php files with small output will work. With test.php, while the code is small, the phpinfo output is fairly large. The output here seems to be what counts(to cause the problem). When using a php file that simply counts line by line, the result would vary. In some cases it would reach line 129, in other cases 157, basically the size of what is successfully output is seemingly random. Please note, we have tried default configuration, and several different versions of apache/php. I seriously doubt the sysreport will give you enough to reproduce this issue. The only thing we have noticed in common between the servers, is that they are all RHEL4.4. Thanks for the report. So in summary the symptoms of failure you have are: 1) IP checksum/corruption failures shown in tcpdump 2) HTTP response corruption 3) occurs only on traffic from port 80 4) occurs with both httpd and lighttpd 5) occurs on multiple machines ("curl" is better at detecting HTTP chunked response corruption that "wget" FWIW) I would suggest the following possible causes: a) TCP filtering gone wrong at kernel level (is iptables in use?) b) kernel bug, or more likely, as you suggested: kernel compromised c) TCP filtering done on port 80 somewhere external to the box You should be able to eliminate (a) and (b) by doing fresh-from-CD RHEL installs on the boxes. You can eliminate (c) by plugging something directly into the NIC on the box and testing directly without passing packets through the LAN. I'd strongly recommend you contact Red Hat Support to help with a successful diagnosis of a complex issue like this. I apologize for the delay in my response. I originally used curl, but simply received the error you mentioned. I believe it was "chunky parse error". Another tech mentioned wget would keep downloading. After some investigation with wget, we came to the aforementioned results. In response to your suggestions: a) iptables was stopped during the investigation. b) I do not believe this is the issue. This type of rootkit generally installs an LKM, or modifies the address of an existing one. Since we installed a new kernel, this would have a different set of LKMs(I believe). The rootkit might be smart enough to install into all existing kernels on the system, but I doubt that it would be smart enough to actively watch for new kernel installations. c) This has occurred with servers both on the floor(not on racks), and on private racks. The original was a private rack customer, we put a test server running rhel4.4 on the very same rack(plugged into the same switch and all) to try and recreate the issue. We were unable to get the test server to exhibit the same problems. As for connecting to the server directly, this idea was tossed around, but having the the problem on boxes both on and off the racks, as well as putting a test server on a rack(containing a server that had the problem) and not showing any signs of the issue was fairly convincing. To be absolutely sure, I will see if I can get another test box to plug directly into the NIC of one of the buggy servers in the next few days. Right now we are looking into the possibility of a bad PXE image, however this is highly unlikely as they have not been changed in quite some time. But at this point we are simply out of ideas. To my understanding Kevin L. has contacted Red Hat Support. Unfortunately I do not believe that Red Hat Support will be able to recreate the issue from the sysreport. I noticed that in the config file from the sysreport you have mod_dosevasive20 activated. Can you reproduce the issue without that module? [root@plesk httpd]# grep -iR dos conf* conf/magic:#0 leshort 0x76FF squeezed data (CP/M, DOS) conf/magic:#0 leshort 0x76FE crunched data (CP/M, DOS) [root@plesk httpd]# webtech@10-33-3-139 ~ > wget http://schoolserver.eduss[snip].com/status/phpinfo.php --08:54:52-- http://schoolserver.eduss[snip].com/status/phpinfo.php => `phpinfo.php' Resolving schoolserver.eduss[snip].com... 67.15.15x.10 Connecting to schoolserver.eduss[snip].com|67.15.15x.10|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] [ <=> ] 4,162,371 2.44M/s webtech@10-33-3-139 ~ > As you can see the problem still exists without mod_dosevasive20. Also, I believe using lighthttpd we ruled out any problems caused by an apache module. To further narrow the source of the corruption can you try two things: 1) Capture a tcpdump running on the server box, simultaneously to one running remotely, to see whether the packet corruption is evident in the outgoing packets. 2) Check output from httpd process itself: # service httpd stop # strace -o /tmp/httpd.trace /usr/sbin/httpd -X then reproduce the problem, and attach the httpd.trace to this bug report. Note: please use: # strace -s8192 -o /tmp/httpd.trace /usr/sbin/httpd -X and # tcpdump -s0 -o /tmp/tcp.dump ... to ensure the entire packet and system call information is captured. We also have a ticket open, regarding this issue. In the ticket a different tcpdump and strace syntax was used. In order to respond to both the ticket and this bugzilla, I have done both. While doing this I noticed something odd, but it makes sense. When running apache with strace using -s2048(string size). The page will load half way, and null terminators, etc. But if I use -s8192, as you suggested, the page actually completes. Since everyone has their own tcpdump, and strace preference, I tried to keep the parameters as close to the requested syntax. 209.85.xx.181 == the server 216.12.193.x == me If you actually want the full ips, let us know, and we will provide them privately. Created attachment 153196 [details]
strace string size 2048
client: tcpdump -vvv host 209.85.xx.181 -w tcpdump.2048.client
server: tcpdump -vvv host 216.12.193.x and port 80 -s0 -w tcpdump.2048.server
strace: strace -fxvto http.trace.2048 -s2048 /usr/sbin/httpd -X &
Created attachment 153197 [details]
strace string size 8192
s8192.tar
client: tcpdump -s0 -o tcpdump.8192.client src or dst host 209.85.xx.181 and
src or dst port 80
server: tcpdump -s0 -w tcpdump.8192.server src or dst host 216.12.193.x
strace: strace -s8192 -o http.trace.8192 /usr/sbin/httpd -X &
Thanks. The fact that the strace options used makes a difference to the behaviour implies this might be somehow timing related. The behaviour seen in the -s2048 strace is as follows: 1) first response block is sent with writev() 2) second response block is sent with writev(), returns short 3) httpd does a "poll/writev" loop five times attempting to continue sending the second block - one byte gets sent each time until the final attempt, which completes 4) subsequent blocks of response are sent without timing out The packets in the tcpdump look correct exactly up until (4), where only zero bytes appear. The checksum mismatch issue may be simply due to TCP checksum offloading. I can't see any misbehaviour by php/httpd here. You could try disabling the TCP offload features for the NIC in question, to see if that makes a difference; e.g: # ethtool -K ethN tx off sg off tso off Before ethtool: webtech@10-33-3-139 ~ > wget http://10.34.1.185/phptest.php --07:09:53-- http://10.34.1.185/phptest.php => `phptest.php.1' Connecting to 10.34.1.185:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] [ <=> ] 1,167,427 866.31K/s webtech@10-33-3-139 ~ > ran the aforementioned command: [root@PR-03-actionol html]# ethtool -K eth0 tx off sg off tso off [root@PR-03-actionol html]# webtech@10-33-3-139 ~ > wget http://10.34.1.185/phptest.php --07:10:33-- http://10.34.1.185/phptest.php => `phptest.php.2' Connecting to 10.34.1.185:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] [ <=> ] 4,268,517 2.12M/s webtech@10-33-3-139 ~ > Unfortunately the problem still exists. From the s2048.tar attachment httpd's system call behaviour is correct so this needs triage from the kernel side, could be some NIC issue. The writev() call in http.trace.2048: 28245 11:59:04 writev(9, [{"0</td>.../root/bin", 1104}, {"\r\n", 2}], 2) = 1106 corresponds to the first half of frame 23 in tcpdump.2048.server. Subsequent writev() calls correspond only to zero bytes in the tcpdump. We have already tried adding a new(intel) NIC and disabling the onboard. This did not resolve the situation. As for the kernel being the problem, this kernel was installed via up2date from the RHN Satellite. The majority of the servers were using 2.6.9-42.ELsmp. On one of the 2.6.9-42smp servers, I tried updating the kernel to not only a newer version, but a uniprocessor(non smp) kernel 2.6.9-42.0.10.EL. The problem still persisted. This was also done previously to rule out an LKM trojan. Linux PR-03-actionol 2.6.9-42.0.10.EL #1 Fri Feb 16 17:06:10 EST 2007 i686 i686 i386 GNU/Linux Are there any tests I can do to give you more information on the kernel? sysrq? Are there any updates regarding this issue? We also just tried reinstalling from CD rather than PXE drive as requested earlier in the bug report. After migrating the data over to the new server(fresh install from CD) the php issue still occurred. We are still waiting on a resolution for this issue. The issue still occurs with Kernel 2.6.9-55.0.2. |