Hide Forgot
Description of problem: When trying to start httpd, the process hangs for an indefinite time. Version-Release number of selected component (if applicable): - Apache: httpd-2.2.17-1.fc14.i686 - Fedora: Fedora release 14 (Laughlin) - Kernel: 2.6.35.6-48.fc14.i686.PAE - Amazon EC2 AIM: ami-ac281dd8 (fedora-images-eu-west-1/fedora-14-i386-S3.ec2.manifest.xml) - SElinux is disabled by default on the Fedora-14 AIMs. (SELinux status: disabled) How reproducible: Install an Amazon EC2 AIM, login, install httpd and try to start it. Steps to Reproduce: 1. Install a Fedora-14 provided AIM. I used ami-ac281dd8. (EU West) 2. Login, run "yum update" and install httpd. (yum -y install httpd) 3. Try to run "apachectl configtest". It will hang. Actual results: A hanging httpd. Expected results: A running httpd. Additional info: I've tried to debug this problem with strace, the bottom output is: # strace -f httpd <skipped many lines> mmap2(NULL, 24656, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) = 0xc14000 mmap2(0xc19000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x4) = 0xc19000 close(4) = 0 mprotect(0xc19000, 4096, PROT_READ) = 0 open("/etc/httpd/modules/mod_version.so", O_RDONLY) = 4 read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0p\6\0\0004\0\0\0"..., 512) = 512 fstat64(4, {st_mode=S_IFREG|0755, st_size=9540, ...}) = 0 mmap2(NULL, 12368, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) = 0x713000 mprotect(0x714000, 4096, PROT_NONE) = 0 mmap2(0x715000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x1) = 0x715000 close(4) = 0 mprotect(0x715000, 4096, PROT_READ) = 0 read(3, "cern_meta.so\n#LoadModule cgid_mo"..., 4096) = 4096 stat64("/etc/httpd/conf.d", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("/etc/httpd/conf.d", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 4 getdents64(4, /* 6 entries */, 32768) = 184 getdents64(4, /* 0 entries */, 32768) = 0 <here httpd is hanging>
Can you install the debuginfo and run it under gdb? # debuginfo-install httpd # gdb --args /usr/sbin/httpd -X ... (gdb) run ...hang... <CTRL-C> (gdb) bt
Okay: # yum -y install yum-utils gdb # debuginfo-install httpd # debuginfo-install cyrus-sasl-lib libuuid nspr nss nss-softokn-freebl nss-util # gdb --args /usr/sbin/httpd -XGNU gdb (GDB) Fedora (7.2-23.fc14) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "i686-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/httpd...Reading symbols from /usr/lib/debug/usr/sbin/httpd.debug...done. done. (gdb) run Starting program: /usr/sbin/httpd -X [Thread debugging using libthread_db enabled] ^C Program received signal SIGINT, Interrupt. 0xb7c146d0 in __readdir64_r (dirp=0x1c7360, entry=0x1c5188, result=0xbfffdfe8) at ../sysdeps/unix/readdir_r.c:132 132 return dp != NULL ? 0 : reclen ? errno : 0; (gdb) bt #0 0xb7c146d0 in __readdir64_r (dirp=0x1c7360, entry=0x1c5188, result=0xbfffdfe8) at ../sysdeps/unix/readdir_r.c:132 #1 0xb7d37865 in apr_dir_read (finfo=<value optimized out>, wanted=<value optimized out>, thedir=<value optimized out>) at file_io/unix/dir.c:157 #2 0x001395d8 in ap_process_resource_config (s=0x172ad8, fname=0x1c5120 "/etc/httpd/conf.d/*.conf", conftree=0xbffff18c, p=0x16d0b8, ptemp=0x19d178) at /usr/src/debug/httpd-2.2.17/server/config.c:1712 #3 0x0012aa8b in include_config (cmd=0xbffff46c, dummy=0xbffff344, name=0x1c5088 "conf.d/*.conf") at /usr/src/debug/httpd-2.2.17/server/core.c:2605 #4 0x0013673c in invoke_cmd (cmd=0x1624d8, parms=0xbffff46c, mconfig=0xbffff344, args=0x1a41a5 "") at /usr/src/debug/httpd-2.2.17/server/config.c:895 #5 0x0013822b in execute_now (p=<value optimized out>, temp_pool=<value optimized out>, l=0x1a4190 "Include conf.d/*.conf", parms=0xbffff46c, current=0xbffff3ac, curr_parent=0xbffff3a8, conftree=0x163d54) at /usr/src/debug/httpd-2.2.17/server/config.c:1441 #6 ap_build_config_sub (p=<value optimized out>, temp_pool=<value optimized out>, l=0x1a4190 "Include conf.d/*.conf", parms=0xbffff46c, current=0xbffff3ac, curr_parent=0xbffff3a8, conftree=0x163d54) at /usr/src/debug/httpd-2.2.17/server/config.c:1012 #7 0x0013884d in ap_build_config (parms=0xbffff46c, p=0x16d0b8, temp_pool=0x19d178, conftree=0x163d54) at /usr/src/debug/httpd-2.2.17/server/config.c:1224 #8 0x00138cec in process_resource_config_nofnmatch (s=0x172ad8, fname=<value optimized out>, conftree=0x163d54, p=0x16d0b8, ptemp=0x19d178, depth=0) at /usr/src/debug/httpd-2.2.17/server/config.c:1634 #9 0x00139534 in ap_process_resource_config (s=0x172ad8, fname=0x19fc58 "/etc/httpd/conf/httpd.conf", conftree=0x163d54, p=0x16d0b8, ptemp=0x19d178) at /usr/src/debug/httpd-2.2.17/server/config.c:1666 #10 0x0013a1e1 in ap_read_config (process=0x16b140, ptemp=0x19d178, filename=0x14f1cc "conf/httpd.conf", conftree=0x163d54) at /usr/src/debug/httpd-2.2.17/server/config.c:2026 #11 0x00120688 in main (argc=2, argv=0xbffff844) at /usr/src/debug/httpd-2.2.17/server/main.c:632 Interesting to know too; httpd consumes 99.something% of the CPU while running and hanging. Regards, Robert de Bock.
1) What are the contents of: /etc/httpd/conf.d/ 2) Can you step a few times in gdb to see where the hang occurs? (enter "s" in gdb and see if it steps or simply hangs again)
Hi, The contents of the /etc/httpd/conf.d directory: --- README welcome.conf --- (This is just the package httpd, no configuration changes have been done.) I have stepped through gdb. Hitting CTRL+C every +- 10 seconds followed by "s" and RETURN. (If this is not correct, please let me know; I am not very familiar with gdb.) Here is the output: --- # gdb --args httpd -X GNU gdb (GDB) Fedora (7.2-23.fc14) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "i686-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/httpd...Reading symbols from /usr/lib/debug/usr/sbin/httpd.debug...done. done. (gdb) run Starting program: /usr/sbin/httpd -X [Thread debugging using libthread_db enabled] ^C Program received signal SIGINT, Interrupt. 0xb7c146d0 in __readdir64_r (dirp=0x1c7360, entry=0x1c5188, result=0xbfffdfd8) at ../sysdeps/unix/readdir_r.c:132 132 return dp != NULL ? 0 : reclen ? errno : 0; (gdb) s ^C Program received signal SIGINT, Interrupt. 0xb7c146d0 in __readdir64_r (dirp=0x1c7360, entry=0x1c5188, result=0xbfffdfd8) at ../sysdeps/unix/readdir_r.c:132 132 return dp != NULL ? 0 : reclen ? errno : 0; (gdb) s ^C Program received signal SIGINT, Interrupt. 0xb7c146d0 in __readdir64_r (dirp=0x1c7360, entry=0x1c5188, result=0xbfffdfd8) at ../sysdeps/unix/readdir_r.c:132 132 return dp != NULL ? 0 : reclen ? errno : 0; (gdb) s ^C Program received signal SIGINT, Interrupt. 0xb7c146d0 in __readdir64_r (dirp=0x1c7360, entry=0x1c5188, result=0xbfffdfd8) at ../sysdeps/unix/readdir_r.c:132 132 return dp != NULL ? 0 : reclen ? errno : 0; (gdb) bt #0 0xb7c146d0 in __readdir64_r (dirp=0x1c7360, entry=0x1c5188, result=0xbfffdfd8) at ../sysdeps/unix/readdir_r.c:132 #1 0xb7d37865 in apr_dir_read (finfo=<value optimized out>, wanted=<value optimized out>, thedir=<value optimized out>) at file_io/unix/dir.c:157 #2 0x001395d8 in ap_process_resource_config (s=0x172ad8, fname=0x1c5120 "/etc/httpd/conf.d/*.conf", conftree=0xbffff17c, p=0x16d0b8, ptemp=0x19d178) at /usr/src/debug/httpd-2.2.17/server/config.c:1712 #3 0x0012aa8b in include_config (cmd=0xbffff45c, dummy=0xbffff334, name=0x1c5088 "conf.d/*.conf") at /usr/src/debug/httpd-2.2.17/server/core.c:2605 #4 0x0013673c in invoke_cmd (cmd=0x1624d8, parms=0xbffff45c, mconfig=0xbffff334, args=0x1a41a5 "") at /usr/src/debug/httpd-2.2.17/server/config.c:895 #5 0x0013822b in execute_now (p=<value optimized out>, temp_pool=<value optimized out>, l=0x1a4190 "Include conf.d/*.conf", parms=0xbffff45c, current=0xbffff39c, curr_parent=0xbffff398, conftree=0x163d54) at /usr/src/debug/httpd-2.2.17/server/config.c:1441 #6 ap_build_config_sub (p=<value optimized out>, temp_pool=<value optimized out>, l=0x1a4190 "Include conf.d/*.conf", parms=0xbffff45c, current=0xbffff39c, curr_parent=0xbffff398, conftree=0x163d54) at /usr/src/debug/httpd-2.2.17/server/config.c:1012 #7 0x0013884d in ap_build_config (parms=0xbffff45c, p=0x16d0b8, temp_pool=0x19d178, conftree=0x163d54) at /usr/src/debug/httpd-2.2.17/server/config.c:1224 #8 0x00138cec in process_resource_config_nofnmatch (s=0x172ad8, fname=<value optimized out>, conftree=0x163d54, p=0x16d0b8, ptemp=0x19d178, depth=0) at /usr/src/debug/httpd-2.2.17/server/config.c:1634 #9 0x00139534 in ap_process_resource_config (s=0x172ad8, fname=0x19fc58 "/etc/httpd/conf/httpd.conf", conftree=0x163d54, p=0x16d0b8, ptemp=0x19d178) at /usr/src/debug/httpd-2.2.17/server/config.c:1666 #10 0x0013a1e1 in ap_read_config (process=0x16b140, ptemp=0x19d178, filename=0x14f1cc "conf/httpd.conf", conftree=0x163d54) at /usr/src/debug/httpd-2.2.17/server/config.c:2026 #11 0x00120688 in main (argc=2, argv=0xbffff834) at /usr/src/debug/httpd-2.2.17/server/main.c:632 (gdb) --- Regards, Robert de Bock.
I have same issue. If you comment this line: Include conf.d/*.conf in /etc/httd/conf/httpd.conf Apache at least will start. Unfortunately it still hangs on any request and therefore is unusable. Strace: https://gist.github.com/664565 I tried to find the issue with #fedora-devel guys, but failed to track the issue :(
Marek, are you seeing this in EC2 or on plain i686? Please give uname -a output.
Joe, I saw this on EC2 *only*. On my local boxes (both: bare metal and KVM/VMware VM's) – it runs great. I used ami-669f680f listed here: https://fedoraproject.org/wiki/Cloud_SIG/EC2_Images#Fedora_14 [ec2-user@ip-10-112-19-176 ~]$ uname -a Linux ip-10-112-19-176 2.6.35.6-48.fc14.i686.PAE #1 SMP Fri Oct 22 15:27:53 UTC 2010 i686 i686 i386 GNU/Linux --Marek
OK, thanks. Is there anything special about the affected installs other than that they run in EC2? Is it an extN filesystem?
Whether this is reproducable with the x86_64 F14 AMI would be a useful data point.
I tried 64 bit appliance: ami-e291668b. [ec2-user@ip-10-204-33-212 ~]$ uname -a Linux ip-10-204-33-212 2.6.35.6-48.fc14.x86_64 #1 SMP Fri Oct 22 15:36:08 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux And httpd *starts normally*. Both AMIs have ext3 filesystem.
Forgot to add: I tried to create (with BoxGrinder) AMI with ext4 filesystem, but it had same issue, everything else worked great.
Confirmed; httpd starts normally on 64 bit: "Linux ip-10-227-158-79 2.6.35.6-48.fc14.x86_64 #1 SMP Fri Oct 22 15:36:08 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux" (AIM: ami-a8281ddc)
I see similar behavior when installing a perl module with "perl Makefile.PL" (example strace output below is from Bloom-Filter-1.0 but I see it on other perl modules also). It's hanging on getdents64. getcwd("/root/Bloom-Filter-1.0", 4095) = 23 lstat64(".", {st_mode=S_IFDIR|0777, st_size=4096, ...}) = 0 stat64(".", {st_mode=S_IFDIR|0777, st_size=4096, ...}) = 0 open(".", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 3 getdents64(3, /* 9 entries */, 32768) = 264 getdents64(3, /* 0 entries */, 32768) = 0 When I run the same command on my rackspace Fedora 14 instance, which happens to be 64-bit, it uses getdents, NOT gendents64 (knowing absolutely nothing about getdents, it seems strange that the 64-bit instance uses getdents and the 32-bit instance uses getdents64, but maybe that's normal.) If I have some time I'll try running the perl command under the debugger as requested for httpd.
sorry, here is the uname info that strace was from: [ec2-user@ip-10-212-118-242 Bloom-Filter-1.0]$ uname -a Linux ip-10-212-118-242 2.6.35.6-48.fc14.i686.PAE #1 SMP Fri Oct 22 15:27:53 UTC 2010 i686 i686 i386 GNU/Linux The ami is: ami-669f680f, m1.small I installed the following software: yum -y install wget yum -y install perl yum -y install perl-ExtUtils-MakeMaker and then after the problem occurred, yum -y install strace
Created attachment 460093 [details] minimal readdir64_r test Can those seeing this issue try this minimal test case for readdir64_r? gcc -Wall -O2 -Werror readdir64.c -o readdir64 ./readdir64 should list current directory contents. In case this is specific to some directory try cd /etc/httpd/conf.d /path/to/readdir64 to see whether that makes a difference.
(In reply to comment #15) > should list current directory contents. In case this is specific to some > directory try Here is my output: [ec2-user@ip-10-244-190-103 ~]$ ls readdir64.c [ec2-user@ip-10-244-190-103 ~]$ gcc -Wall -O2 -Werror readdir64.c -o readdir64 [ec2-user@ip-10-244-190-103 ~]$ ls readdir64 readdir64.c [ec2-user@ip-10-244-190-103 ~]$ ./readdir64 entry: .bash_logout entry: .bash_profile entry: .ssh entry: .. entry: .bashrc entry: readdir64 entry: . entry: readdir64.c ^C [ec2-user@ip-10-244-190-103 ~]$ strace ./readdir64 execve("./readdir64", ["./readdir64"], [/* 20 vars */]) = 0 brk(0) = 0x869c000 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7835000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=15248, ...}) = 0 mmap2(NULL, 15248, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7831000 close(3) = 0 open("/lib/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\1\1\1\3\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0po\1\0004\0\0\0"..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0755, st_size=1886052, ...}) = 0 mmap2(NULL, 1649160, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x9b0000 mmap2(0xb3d000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x18c) = 0xb3d000 mmap2(0xb40000, 10760, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb40000 close(3) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7830000 set_thread_area({entry_number:-1 -> 6, base_addr:0xb78306c0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0 mprotect(0xb3d000, 8192, PROT_READ) = 0 mprotect(0xea3000, 4096, PROT_READ) = 0 munmap(0xb7831000, 15248) = 0 open(".", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 3 brk(0) = 0x869c000 brk(0x86c5000) = 0x86c5000 brk(0) = 0x86c5000 getdents64(3, /* 8 entries */, 32768) = 240 fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7834000 write(1, "entry: .bash_logout\n", 20entry: .bash_logout ) = 20 write(1, "entry: .bash_profile\n", 21entry: .bash_profile ) = 21 write(1, "entry: .ssh\n", 12entry: .ssh ) = 12 write(1, "entry: ..\n", 10entry: .. ) = 10 write(1, "entry: .bashrc\n", 15entry: .bashrc ) = 15 write(1, "entry: readdir64\n", 17entry: readdir64 ) = 17 write(1, "entry: .\n", 9entry: . ) = 9 write(1, "entry: readdir64.c\n", 19entry: readdir64.c ) = 19 getdents64(3, /* 0 entries */, 32768) = 0 ^C--- SIGINT (Interrupt) @ 0 (0) ---
Reassigning to glibc since it appears the hang is there (thought it may be a kernel bug.
... or even something caused by the Amazon environment)...
from the EC2 32-bit instance, the ami is: ami-669f680f, m1.small uname -a Linux ip-10-212-118-242 2.6.35.6-48.fc14.i686.PAE #1 SMP Fri Oct 22 15:27:53 UTC 2010 i686 i686 i386 GNU/Linux Here is the output of the readdir64 test you asked for: [root@ip-10-212-118-242 tmp]# strace ./readdir64 execve("./readdir64", ["./readdir64"], [/* 18 vars */]) = 0 brk(0) = 0x93a0000 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7705000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=15248, ...}) = 0 mmap2(NULL, 15248, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7701000 close(3) = 0 open("/lib/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\1\1\1\3\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0po\1\0004\0\0\0"..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0755, st_size=1886052, ...}) = 0 mmap2(NULL, 1649160, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb41000 mmap2(0xcce000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x18c) = 0xcce000 mmap2(0xcd1000, 10760, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xcd1000 close(3) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7700000 set_thread_area({entry_number:-1 -> 6, base_addr:0xb77006c0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0 mprotect(0xcce000, 8192, PROT_READ) = 0 mprotect(0xd7e000, 4096, PROT_READ) = 0 munmap(0xb7701000, 15248) = 0 open(".", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 3 brk(0) = 0x93a0000 brk(0x93c9000) = 0x93c9000 brk(0) = 0x93c9000 getdents64(3, /* 7 entries */, 32768) = 232 fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7704000 write(1, "entry: Bloom-Filter-1.0\n", 24entry: Bloom-Filter-1.0 ) = 24 write(1, "entry: .ICE-unix\n", 17entry: .ICE-unix ) = 17 write(1, "entry: Bloom-Filter-1.0.tar.gz\n", 31entry: Bloom-Filter-1.0.tar.gz ) = 31 write(1, "entry: ..\n", 10entry: .. ) = 10 write(1, "entry: readdir64\n", 17entry: readdir64 ) = 17 write(1, "entry: .\n", 9entry: . ) = 9 write(1, "entry: readdir64.c\n", 19entry: readdir64.c ) = 19 getdents64(3, /* 0 entries */, 32768) = 0
I experienced the same hang on getdents64 when trying to install the latest Oracle/Sun JRE RPM on Fedora 14 EC2 32-bit AMI. It invokes the java runtime and that's the one that stalls out. Seems like plenty of strace logs have been posted but let me know if you'd like me to generate another one for this specific circumstance.
What are the contents of *dirp and *entry? What is the returned value?
(In reply to comment #21) > What are the contents of *dirp and *entry? What is the returned value? I'm not familiar with C. Could you please describe step-by-step instructions what I should do? Or better – an attachment with instructions? Thanks!
Just print them.
Andreas: What do you specifically want to see about *d and *entry? The names? The full struct contents? It may be easier if you modify readdir64.c (https://bugzilla.redhat.com/attachment.cgi?id=460093) to printf what you would like to see - that would save some back and forth.
Just print it in the debugger.
Here is my gdb session. Let me know if there is anything else you want me to try. [ec2-user@ip-10-243-15-194 ~]$ gcc -Wall -g -Werror readdir64.c -o readdir64 [ec2-user@ip-10-243-15-194 ~]$ gdb readdir64 GNU gdb (GDB) Fedora (7.2-23.fc14) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "i686-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/ec2-user/readdir64...done. (gdb) run Starting program: /home/ec2-user/readdir64 entry: .bash_logout result: .bash_logout entry: .bash_profile result: .bash_profile entry: .bash_history result: .bash_history entry: .ssh result: .ssh entry: .readdir64.c.swp result: .readdir64.c.swp entry: typescript result: typescript entry: .. result: .. entry: .bashrc result: .bashrc entry: readdir64 result: readdir64 entry: . result: . entry: readdir64.c result: readdir64.c ^C Program received signal SIGINT, Interrupt. 0x001cf6d0 in __readdir64_r (dirp=0x804a008, entry=0xbffff578, result=0xbffff574) at ../sysdeps/unix/readdir_r.c:132 132 return dp != NULL ? 0 : reclen ? errno : 0; (gdb) list 127 else 128 *result = NULL; 129 130 __libc_lock_unlock (dirp->lock); 131 132 return dp != NULL ? 0 : reclen ? errno : 0; 133 } 134 135 #ifdef __READDIR_R_ALIAS 136 weak_alias (__readdir_r, readdir_r) (gdb) c Continuing. ^C Program received signal SIGINT, Interrupt. 0x001cf6d0 in __readdir64_r (dirp=0x804a008, entry=0xbffff578, result=0xbffff574) at ../sysdeps/unix/readdir_r.c:132 132 return dp != NULL ? 0 : reclen ? errno : 0; (gdb) s ^C Program received signal SIGINT, Interrupt. 0x001cf6d0 in __readdir64_r (dirp=0x804a008, entry=0xbffff578, result=0xbffff574) at ../sysdeps/unix/readdir_r.c:132 132 return dp != NULL ? 0 : reclen ? errno : 0; (gdb) bt #0 0x001cf6d0 in __readdir64_r (dirp=0x804a008, entry=0xbffff578, result=0xbffff574) at ../sysdeps/unix/readdir_r.c:132 #1 0x08048532 in main (argc=1, argv=0xbffff744) at readdir64.c:20 (gdb) up #1 0x08048532 in main (argc=1, argv=0xbffff744) at readdir64.c:20 20 while (readdir64_r(d, &entry, &result) == 0 (gdb) print *d $1 = {fd = 7, lock = 0, allocation = 32768, size = 352, offset = 352, filepos = 2147483647, data = 0x804a008 "\a"} (gdb) print entry $2 = {d_ino = 524299, d_off = 2147483647, d_reclen = 32, d_type = 8 '\b', d_name = "readdir64.c\000\000swp\000\000\000\000\000\364d\033\000\344\367\023\000q\352\261\a\363\003\000\000\b\000\000\000.N=\366\000\371\377\267\003\000\000\000\000\031\023\000\000\000\000\000\000\000\000\000\001\000\000\000\237\b\000\000\060\371\377\267P\366\377\267\220\202\004\b\004\000\024\000́\004\b\001\000\000\000\274\017\023\000\360\366\377\277\270\032\023\000\300\366\377\277\241\247\021\000\260\366\377\277́\004\b\244\366\377\277\\\032\023\000\000\000\000\000\060\371\377\267\001\000\000\000\000\000\000\000\001\000\000\000\000\031\023\000\000\000`\000\000\b\004\002\000\000`\000\001\000\000\000\000\200\000\000\364\037,\000\344\001,\000D\367\377\277X\366\377\277\000\000\000\000\360\366\377\277ȗ\004\bh\366\377\277d\203\004\b\225/\026\000ȗ\004\b\230\366\377\277y\205\004\b\220\202\004\b\240,,\000\240,,\000\364\037,\000`\205\004\b\360\203\004\bk\205\004"} (gdb) list 15 if (d == NULL) { 16 perror("opendir64"); 17 return 1; 18 } 19 20 while (readdir64_r(d, &entry, &result) == 0 21 && result != NULL) { 22 printf("entry: %s\n", entry.d_name); 23 printf("result: %s\n", result->d_name); 24 } (gdb) quit
Just realized I pasted the version with the prints in main, instead of in __readdir64_r. Here are the *dirp and *entry values at that point in the call stack: [ec2-user@ip-10-243-15-194 ~]$ gdb readdir64 GNU gdb (GDB) Fedora (7.2-23.fc14) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "i686-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/ec2-user/readdir64...done. (gdb) run Starting program: /home/ec2-user/readdir64 entry: .bash_logout result: .bash_logout entry: .bash_profile result: .bash_profile entry: .bash_history result: .bash_history entry: .ssh result: .ssh entry: .readdir64.c.swp result: .readdir64.c.swp entry: typescript result: typescript entry: .. result: .. entry: .bashrc result: .bashrc entry: readdir64 result: readdir64 entry: . result: . entry: readdir64.c result: readdir64.c ^C Program received signal SIGINT, Interrupt. 0x001cf6d0 in __readdir64_r (dirp=0x804a008, entry=0xbffff578, result=0xbffff574) at ../sysdeps/unix/readdir_r.c:132 132 return dp != NULL ? 0 : reclen ? errno : 0; (gdb) print dirp $1 = (DIR *) 0x804a008 (gdb) print *dirp $2 = {fd = 7, lock = 0, allocation = 32768, size = 352, offset = 352, filepos = 2147483647, data = 0x804a008 "\a"} (gdb) print entry $3 = (struct dirent64 *) 0xbffff578 (gdb) print *entry $4 = {d_ino = 524299, d_off = 2147483647, d_reclen = 32, d_type = 8 '\b', d_name = "readdir64.c\000\000swp\000\000\000\000\000\364d\033\000\344\367\023\000q\352\261\a\363\003\000\000\b\000\000\000.N=\366\000\371\377\267\003\000\000\000\000\031\023\000\000\000\000\000\000\000\000\000\001\000\000\000\237\b\000\000\060\371\377\267P\366\377\267\220\202\004\b\004\000\024\000́\004\b\001\000\000\000\274\017\023\000\360\366\377\277\270\032\023\000\300\366\377\277\241\247\021\000\260\366\377\277́\004\b\244\366\377\277\\\032\023\000\000\000\000\000\060\371\377\267\001\000\000\000\000\000\000\000\001\000\000\000\000\031\023\000\000\000`\000\000\b\004\001\000\000`\000\001\000\000\000\000\200\000\000\364\037,\000\344\001,\000D\367\377\277X\366\377\277\000\000\000\000\360\366\377\277ȗ\004\bh\366\377\277d\203\004\b\225/\026\000ȗ\004\b\230\366\377\277y\205\004\b\220\202\004\b\240,,\000\240,,\000\364\037,\000`\205\004\b\360\203\004\bk\205\004"} (gdb) bt #0 0x001cf6d0 in __readdir64_r (dirp=0x804a008, entry=0xbffff578, result=0xbffff574) at ../sysdeps/unix/readdir_r.c:132 #1 0x08048532 in main (argc=1, argv=0xbffff744) at readdir64.c:20
Where does it hang?
It appears to hang at ../sysdeps/unix/readdir_r.c:132. I issued the interrupt when in hung in my previous two comments. If you need access to an ec2 instance to debug, I can provide that. Just let me know.
Where does it hang _exactly_?
Everything I know about where it is hanging is in the two gdb sessions above. If you have any tips on gathering more information, I would be happy to follow them - I am not a C developer. I would also be happy to give you access to an instance for you to debug on, or share an AMI that you can launch if you would prefer.
disp/i $pc si
It hangs on the si call, requiring an interrupt. I also printed the registers. Could it be related to this issue? http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1432 The xensource issue is a segfault instead of a hang, but is on the same operation. xen-detect on the instance reports: Running in PV context on Xen v3.0. [ec2-user@ip-10-243-15-194 ~]$ gdb readdir64 GNU gdb (GDB) Fedora (7.2-23.fc14) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "i686-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/ec2-user/readdir64...done. (gdb) disp/i $pc (gdb) run Starting program: /home/ec2-user/readdir64 entry: .bash_logout result: .bash_logout entry: .bash_profile result: .bash_profile entry: .bash_history result: .bash_history entry: .ssh result: .ssh entry: .readdir64.c.swp result: .readdir64.c.swp entry: typescript result: typescript entry: .. result: .. entry: .bashrc result: .bashrc entry: readdir64 result: readdir64 entry: . result: . entry: readdir64.c result: readdir64.c ^C Program received signal SIGINT, Interrupt. 0x001cf6d0 in __readdir64_r (dirp=0x804a008, entry=0xbffff578, result=0xbffff574) at ../sysdeps/unix/readdir_r.c:132 132 return dp != NULL ? 0 : reclen ? errno : 0; 1: x/i $pc => 0x1cf6d0 <__readdir64_r+240>: cmovne %gs:(%edx),%eax (gdb) si ^C Program received signal SIGINT, Interrupt. 0x001cf6d0 in __readdir64_r (dirp=0x804a008, entry=0xbffff578, result=0xbffff574) at ../sysdeps/unix/readdir_r.c:132 132 return dp != NULL ? 0 : reclen ? errno : 0; 1: x/i $pc => 0x1cf6d0 <__readdir64_r+240>: cmovne %gs:(%edx),%eax (gdb) si ^C Program received signal SIGINT, Interrupt. 0x001cf6d0 in __readdir64_r (dirp=0x804a008, entry=0xbffff578, result=0xbffff574) at ../sysdeps/unix/readdir_r.c:132 132 return dp != NULL ? 0 : reclen ? errno : 0; 1: x/i $pc => 0x1cf6d0 <__readdir64_r+240>: cmovne %gs:(%edx),%eax (gdb) info registers eax 0x0 0 ecx 0x0 0 edx 0xffffffc8 -56 ebx 0x2c1ff4 2891764 esp 0xbffff530 0xbffff530 ebp 0xbffff558 0xbffff558 esi 0x804a008 134520840 edi 0x0 0 eip 0x1cf6d0 0x1cf6d0 <__readdir64_r+240> eflags 0x10246 [ PF ZF IF RF ] cs 0x73 115 ss 0x7b 123 ds 0x7b 123 es 0x7b 123 fs 0x0 0 gs 0x33 51 (gdb)
Tell amazon.
Andreas, Why was this marked as CLOSED NOTABUG?. So many apps failing to run on the 32-bit F14 AMI definitely seems like a bug. Please describe why you believe it is not a bug and what you think we need to tell Amazon. Thanks!
Regardless of the reason, what do you recommend we tell Amazon in the bug report we send them? I'm afraid it's too low-level for most of us to know how to interpret the info you found.
I started a thread on AWS forums: https://forums.aws.amazon.com/thread.jspa?threadID=55419
Amazon is listening; what is it that we should tell them? Closing in on 3 weeks now since this was reported; apache not starting certainly seems to qualify as a bug in many people's opinion...?
I investigated this a little. First I noted that changing register values so that the cmovne instruction would do something different didn't help, but pounding $eip to skip that instruction did. This led me to believe that the problem did indeed have something to do with Xen trapping this instruction, which led me to several references (e.g. [1] [2] [3]) about segfaults and other problems with negative gs offsets - which are being used in this case. They also had some references to a "nosegneg" hardware capability in the dynamic-loading machinery to use library versions which avoid such offsets. I'm very far from being an expert in any of these areas, but I'm a curious kind of guy so I started experimenting. Sure enough, the following commands allowed the previously failing test to run: echo "hwcap 1 nosegneg" > /etc/ld.so.conf.d/libc6-xen.conf ldconfig I don't know whether this will hold up as a more general fix/workaround, or whether this hwcap is supposed to be set as part of how we build the images. Maybe someone with more relevant knowledge will chip in, but it seems like an important data point. [1] http://xen.1045712.n5.nabble.com/quot-hwcap-0-nosegneg-quot-doesnt-work-with-paravirt-ops-xen-as-of-2-6-23-9-td2513579.html [2] http://web.archiveorange.com/archive/v/tXSRNylI9Pu2xPy0T9k5 [3] http://www.mail-archive.com/blag-devel@lists.blagblagblag.org/msg00008.html
Good catch; was going to jump in myself and try to do this, but being at work I couldn't really do it until I had a free moment or so. From everything I'm reading, that should be the default setting anyway: <pre>testhost$ pwd /root/linux-2.6.35/arch/x86/xen testhost$ cat vdso.h /* Bit used for the pseudo-hwcap for non-negative segments. We use bit 1 to avoid bugs in some versions of glibc when bit 0 is used; the choice is otherwise arbitrary. */ #define VDSO_NOTE_NONEGSEG_BIT 1</pre> That's the source from "yumdownloader --source kernel" - why is it 0, when it should be 1? It's set to 1 on Ubuntu images...though, it seems to have been 0 on older fedora images? http://web.archiveorange.com/archive/v/bUZ6Keh9JWAEGccFUebP shows this fun line: <pre>NOTE_KERNELCAP(1, "nosegneg") /* Change 1 back to 0 when glibc is fixed! */</pre> At http://wiki.xensource.com/xenwiki/XenSpecificGlibc we find a suggestion to compile glibc with a particular flag. But, looking in the glibc source file, we find these lines: <pre>%ifarch i686 BuildFlags="-march=i686 -mtune=generic" %endif %ifarch i386 i486 i586 BuildFlags="$BuildFlags -mno-tls-direct-seg-refs"</pre> So, here are my questions: 1) which should it be? if it's truly otherwise "arbitrary" like the kernel source inline documentation suggests, and only matters when it's needed (and when it's needed, it should be 1) then...why is it even an option? 2) was glibc "fixed" after the above info, making 0 the right choice, and then "broken" again recently? 3) shouldn't "-mno-tls-direct-seg-refs" be a build option for i686 too, not just i[345]86?
err...note, that's the glibc source spec file; from "yumdownloader --source glibc", and then ~/rpmbuild/SPECS/glibc.spec
Reassigned to Fedora/kernel
Having talked to some folks who know a lot more about this stuff than I do, I think the picture has become clearer. Apparently this issue of negative segment offsets is (or was once) fairly well known, which is why the whole "nosegneg" thing exists. This was apparently set by default in RHEL/Fedora until fairly recently, but it does have a fairly serious performance impact for bare metal so it is not set in RHEL6/F14. However, *some* of the machines at Amazon - apparently not all - are running a version of Xen that's buggy with respect to emulation of the cmovxx instruction with negative offsets. Even without such bugs, using nosegneg in the guest performs better than relying on emulation in the host, so for EC2 and other Xen-based infrastructures this option should still be set. To answer Brian's questions as best I can, then: (1) We *should* enable nosegneg for EC2 to avoid the faulty emulation. (2) I think the fix was to Xen, but we should set the option for performance reasons even where the fix is applicable. We should *not* set it for bare metal, though. (3) No idea. We'd probably need to investigate more to know whether it's even relevant. Nick, does "kernel" in this case include ld.so, or (considering that the bug might not apply to current Fedora kernels with KVM) should that go to glibc instead?
(In reply to comment #43) > (2) I think the fix was to Xen, Hi Jeff, Can you expand on this a bit? Did you come across any particular threads with respect to fixing it in xen that you can point me to? Thanks, Drew
I'm afraid I don't have a more specific reference, Andrew, but I think the fix would have to be in Xen. All of the references I can find refer to negative offsets having a big performance impact, but this bug report exists because the emulation was not working *at all* on EC2. Maybe it's more a case of Amazon's proprietary patches breaking it than of anyone fixing it, but when stepping over a single instruction in the guest hangs it's hard to reach any conclusion other than that the bug is in the hypervisor emulation of that instruction.
Hmm, this sounds like something I should look into then. I think I'll clone this bug over to kernel-xen to see if the reproducer for it also reproduces on RHEL as well as on EC2.
(In reply to comment #44) > (In reply to comment #43) > > (2) I think the fix was to Xen, > > Hi Jeff, > > Can you expand on this a bit? Did you come across any particular threads with > respect to fixing it in xen that you can point me to? I'm the one who talked to Jeff about this. To be clear, I didn't mean to imply that the bug had definitely been fixed, just that I know between 5.0 and now we have fixed several emulation bugs in the Xen hypervisor. That being said, the performance impact of *not* using nosegneg on Xen is pretty horrendous, so we should definitely turn nosegneg on for F-14 guests running under Xen (irrespective of the emulation bug). Chris Lalancette
This is a problem in Amazon's hypervisor. Some EC2 machines are running a very old version of Xen (based on RHEL5.0). This particular bug has been fixed in RHEL5.2. In order to backport the fix, Amazon would only need the Xensource patch linked above; other relevant hypervisor fixes appeared in the RHEL kernels 2.6.18-133 and 2.6.18-170. Until Amazon fixes the hypervisor, no fix is possible. glibc is fine since it provides the nosegneg files; there is no need to compile everything with -mno-tls-direct-seg-refs: %define xenarches i686 athlon %ifarch %{xenarches} %define buildxen 1 %define xenpackage 0 ... %endif ... %if %{buildxen} build_nptl linuxnptl-nosegneg -mno-tls-direct-seg-refs %endif etc. The kernel package is "guilty" since its /etc/ld.so.conf.d/*.conf file should include the hwcap. From a quick look at the spec, an ldconfig-kernel.conf file is missing.
So the least invasive fix is to add echo "hwcap 1 nosegneg" > /etc/ld.so.conf.d/libc6-xen.conf in the 32bit image post. The next images will have this done.
For what it's worth, I would concur; it seems that Amazon needs to patch their old servers running the 32bit machines, but until that's completed the way to fix it is for the 32bit AMI to have that setting.
Just a note that Amazon is still carrying a similar workaround in their Amazon Linux images as of January 2013.