Bug 1173946
Summary: | Squid fails to run. Gives Illegal instruction (core dumped) | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | otheos <bugzilla> | ||||
Component: | squid | Assignee: | Pavel Šimerda (pavlix) <psimerda> | ||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 21 | CC: | henrik, jonathansteffan, marcosfrm, mluscon, psimerda, red, sylvain, thozza, trevor | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | squid-3.4.12-2.fc22 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-03-31 21:56:20 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
More testing: The issue is reproducible on clean install. The issue is limited to the Athlon 64 3200+ CPU. I have now tested the Athlon II X3,X4, Phenom II X4 and had no issues. So this is a problem in the compilation of the binary that has specific optimisations for CPU's that the the Athlon 64 3200+ clearly cannot cope with. Can we please have an i386 version of this (or a new compile of the i686 that also works on the Athlon 64)? Thanks. Tested on Athlon x64 x2 (not II) 4800+ with good results. This issue is now (from my end) only limitted to the original Athlon 64 single core (Venice). Thank you. Final test: Fedora 21 x64 (not 32 bit) on Athlon 64 3200+. Squid works as it should. So this problem is now confined to: 32 bit Fedora 21 Athlon 64 3200+ single core Thank you. P.S. I sideloaded squid-3.3.13-2.fc20.i686.rpm (from F20) on to the system (F21 32bit) and i works fine. *** Bug 1175895 has been marked as a duplicate of this bug. *** it is interesting that it works correctly if you just recompile from source: $ mock squid-3.4.7-2.fc21.src.rpm # yum reinstall ./squid-3.4.7-2.fc21.src.rpm ...and then it runs correctly. I just saw this bug. I have 4 boxes, here are the results: #rpm -q squid ; uname --kernel-release ; cat /proc/cpuinfo | grep 'model name' | head -1 WORKS: squid-3.4.7-2.fc21.i686 3.18.7-200.fc21.i686 model name : Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz WORKS: squid-3.4.7-2.fc21.i686 3.18.7-200.fc21.i686+PAE model name : Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz WORKS: squid-3.4.7-2.fc21.x86_64 3.18.5-201.fc21.x86_64 model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz FAILS (as per this bug): squid-3.4.7-2.fc21.i686 3.18.6-200.fc21.i686 model name : Intel(R) Pentium(R) 4 CPU 2.80GHz So only my P4 fails to run squid. 2 32-bit machines work. 1 64-bit machine works. My guess is this is 32-bit + older CPU. I will have more data points over the next few weeks as I upgrade other boxes to newest squid. All these boxes have run squid problem free for years. So this is a new problem. Yes, looks like some sort of build problem or perhaps even a compiler bug? Here's the end of the strace when it segfaults on the P4 machine: time(NULL) = 1424408732 time(NULL) = 1424408732 mmap2(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb6734000 gettimeofday({1424408732, 656346}, NULL) = 0 open("/etc/localtime", O_RDONLY|O_CLOEXEC) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=2891, ...}) = 0 fstat64(3, {st_mode=S_IFREG|0644, st_size=2891, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb71f1000 read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\7\0\0\0\7\0\0\0\0"..., 4096) = 2891 _llseek(3, -24, [2867], SEEK_CUR) = 0 read(3, "\nCST6CDT,M3.2.0,M11.1.0\n", 4096) = 24 close(3) = 0 munmap(0xb71f1000, 4096) = 0 time(NULL) = 1424408732 gettimeofday({1424408732, 658173}, NULL) = 0 socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 3 setsockopt(3, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0 close(3) = 0 --- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPN, si_addr=0xb730b343} --- +++ killed by SIGILL (core dumped) +++ Here's the gdb result with debuginfos: (gdb) run -f /etc/squid/squid.conf Starting program: /usr/sbin/squid -f /etc/squid/squid.conf [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/libthread_db.so.1". Program received signal SIGILL, Illegal instruction. 0x800f1343 in parseTimeLine (tptr=tptr@entry=0xbffff6d0, allowMsec=allowMsec@entry=false, units=0x80481829 "second") at cache_cf.cc:1053 1053 *tptr = static_cast<time_msec_t>(m * d); More: (gdb) where #0 0x800f1343 in parseTimeLine (tptr=tptr@entry=0xbffff6d0, allowMsec=allowMsec@entry=false, units=0x80481829 "second") at cache_cf.cc:1053 #1 0x800f51ac in parse_time_t (var=0x8068db78 <Config+440>) at cache_cf.cc:3080 #2 0x80106773 in parse_line (buff=<optimized out>) at cf_parser.cci:819 #3 0x8010898f in default_line (s=s@entry=0x804854f8 "authenticate_cache_garbage_interval 1 hour") at cf_parser.cci:15 #4 0x80109914 in default_all () at cf_parser.cci:24 #5 parseConfigFile (file_name=0x8069ed78 "/etc/squid/squid.conf") at cache_cf.cc:613 #6 0x8021681c in SquidMain (argc=3, argv=0xbffffaf4) at main.cc:1406 #7 0x800e1bca in SquidMainSafe (argv=0xbffffaf4, argc=3) at main.cc:1260 #8 main (argc=3, argv=0xbffffaf4) at main.cc:1252 (gdb) print m $1 = 3600000 (gdb) print d $2 = <optimized out> The source code of the failing function, failing line indicated with ***** /* Parse a time specification from the config file. Store the * result in 'tptr', after converting it to 'units' */ static void parseTimeLine(time_msec_t * tptr, const char *units, bool allowMsec) { char *token; double d; time_msec_t m; time_msec_t u; if ((u = parseTimeUnits(units, allowMsec)) == 0) self_destruct(); if ((token = strtok(NULL, w_space)) == NULL) self_destruct(); d = xatof(token); m = u; /* default to 'units' if none specified */ if (0 == d) (void) 0; else if ((token = strtok(NULL, w_space)) == NULL) debugs(3, DBG_CRITICAL, "WARNING: No units on '" << config_input_line << "', assuming " << d << " " << units ); else if ((m = parseTimeUnits(token, allowMsec)) == 0) self_destruct(); *tptr = static_cast<time_msec_t>(m * d); /************FAIL**************/ if (static_cast<double>(*tptr) * 2 != m * d * 2) { debugs(3, DBG_CRITICAL, "ERROR: Invalid value '" << d << " " << token << ": integer overflow (time_msec_t)."); self_destruct(); } } Perhaps odd that d is "optimized out" at the failing point? I don't know enough C/gdb to know if the problem is on the LHS or RHS of the *tptr cast line. Perhaps someone can recognize these structures / code and any relevant recent changes. Squid is compiled with compiler optimizations enabled and it is normal that local variables are optimized out by gcc to the level that gdb can't track them. Also line indications are not 100% reliable. Please include cat /proc/cpuinfo of the failing machine and a source annotated gdb disassembly of the failing function (gdb) disass /mr after gdb trapped the SIGILL. You can also try to rebuild the squid rpm on your P4 server. yum-downloader --source squid-3.4.7-2.fc21 yum-builddep squid-3.4.7-2.fc21.src.rpm rpmbuild --rebuild squid-3.4.7-2.fc21.src.rpm yum remove squid yum localinstall RPMBUILD/RPMS/i686/squid-3.4.7-2.fc21.i686.rpm Alternatively if you are comfortable with editing spec files the rpmbuild and later steps can be replaced by rpm -i squid-3.4.7-2.fc21.src.rpm then edit rpmbuild/SPECS/squid.spec and add your own tag on Release line to set the package apart from the original Fedora build. rpmbuild -bb rpmbuild/SPECS/squid.spec yum localinstall RPMBUILD/RPMS/i686/squid-3.4.7-2.fc21.yourtag.i686.rpm Failing machine: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 2.80GHz stepping : 9 microcode : 0x2e cpu MHz : 2806.350 cache size : 512 KB physical id : 0 siblings : 1 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fdiv_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs bts cid xtpr bugs : bogomips : 5612.70 clflush size : 64 cache_alignment : 128 address sizes : 36 bits physical, 32 bits virtual power management: Here's the first bit, if you need the whole (multipages) output it gave me, I can attach it later. (gdb) disass /mr Dump of assembler code for function parseTimeLine(time_msec_t*, bool, char const*): 273 LegacyParser.destruct(); 0x800f150e <+622>: 8d 83 84 63 00 00 lea 0x6384(%ebx),%eax 0x800f1514 <+628>: 89 04 24 mov %eax,(%esp) 0x800f1517 <+631>: e8 f4 e9 06 00 call 0x8015ff10 <ConfigParser::destruct()> 0x800f151c <+636>: eb 04 jmp 0x800f1522 <parseTimeLine(time_msec_t*, bool, char const*)+642> 0x800f151e <+638>: 66 90 xchg %ax,%ax 0x800f1520 <+640>: dd d8 fstp %st(0) 0x800f1591 <+753>: 8d 83 84 63 00 00 lea 0x6384(%ebx),%eax 0x800f1597 <+759>: 89 04 24 mov %eax,(%esp) 0x800f159a <+762>: e8 71 e9 06 00 call 0x8015ff10 <ConfigParser::destruct()> 0x800f159f <+767>: dd 44 24 28 fldl 0x28(%esp) 0x800f15a3 <+771>: e9 6d fd ff ff jmp 0x800f1315 <parseTimeLine(time_msec_t*, bool, char const*)+117> 0x800f15a8 <+776>: 8d 83 84 63 00 00 lea 0x6384(%ebx),%eax 0x800f15ae <+782>: 89 04 24 mov %eax,(%esp) 0x800f15b1 <+785>: e8 5a e9 06 00 call 0x8015ff10 <ConfigParser::destruct()> 0x800f15b6 <+790>: e9 1e fd ff ff jmp 0x800f12d9 <parseTimeLine(time_msec_t*, bool, char const*)+57> 0x800f15bb <+795>: 90 nop 0x800f15bc <+796>: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi 0x800f1788 <+1256>: 8d 83 84 63 00 00 lea 0x6384(%ebx),%eax 0x800f178e <+1262>: 89 04 24 mov %eax,(%esp) 0x800f1791 <+1265>: e8 7a e7 06 00 call 0x8015ff10 <ConfigParser::destruct()> 0x800f1796 <+1270>: e9 5e fb ff ff jmp 0x800f12f9 <parseTimeLine(time_msec_t*, bool, char const*)+89> 0x800f179b <+1275>: 90 nop 0x800f179c <+1276>: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi Strange, while compiling (it's taking a while) the gcc progress output looks like: g++ -DHAVE_CONFIG_H -DDEFAULT_CONFIG_FILE=\"/etc/squid/squid.conf\" -DDEFAULT_SQUID_DATA_DIR=\"/usr/share/squid\" -DDEFAULT_SQUID_CONFIG_DIR=\"/etc/squid\" -I.. -I../include -I../lib -I../src -I../include -I../src -I/usr/include/libxml2 -I/usr/include/libxml2 -Wall -Wpointer-arith -Wwrite-strings -Wcomments -Wshadow -Werror -pipe -D_REENTRANT -m32 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m32 -march=i686 -mtune=atom -fasynchronous-unwind-tables -fpie -march=native -std=c++11 -c -o pconn.o pconn.cc Why is -mtune=atom being set? Isn't that completely bogus, or am I missing something here? I rebuild squid on the P4 and it now runs ok. I set a break at the offending function and stepped through all calls to it and bt'd and never once saw a similar bt as shown in comment #9: it never had the *default* functions show up in it. Weird. Here's the disass of said fn in the new binary: Dump of assembler code for function parseTimeLine(time_msec_t*, bool, char const*): 273 LegacyParser.destruct(); 0x800f16c6 <+646>: 8d 83 84 63 00 00 lea 0x6384(%ebx),%eax 0x800f16cc <+652>: 89 04 24 mov %eax,(%esp) 0x800f16cf <+655>: e8 fc fe 06 00 call 0x801615d0 <ConfigParser::destruct()> 0x800f16d4 <+660>: eb 04 jmp 0x800f16da <parseTimeLine(time_msec_t*, bool, char const*)+666> 0x800f16d6 <+662>: 66 90 xchg %ax,%ax 0x800f16d8 <+664>: dd d8 fstp %st(0) 0x800f1761 <+801>: 8d 83 84 63 00 00 lea 0x6384(%ebx),%eax 0x800f1767 <+807>: 89 04 24 mov %eax,(%esp) 0x800f176a <+810>: e8 61 fe 06 00 call 0x801615d0 <ConfigParser::destruct()> 0x800f176f <+815>: dd 44 24 20 fldl 0x20(%esp) 0x800f1773 <+819>: e9 3d fd ff ff jmp 0x800f14b5 <parseTimeLine(time_msec_t*, bool, char const*)+117> 0x800f1778 <+824>: 8d 83 84 63 00 00 lea 0x6384(%ebx),%eax 0x800f177e <+830>: 89 04 24 mov %eax,(%esp) 0x800f1781 <+833>: e8 4a fe 06 00 call 0x801615d0 <ConfigParser::destruct()> 0x800f1786 <+838>: e9 ee fc ff ff jmp 0x800f1479 <parseTimeLine(time_msec_t*, bool, char const*)+57> 0x800f178b <+843>: 90 nop 0x800f178c <+844>: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi 0x800f1958 <+1304>: 8d 83 84 63 00 00 lea 0x6384(%ebx),%eax 0x800f195e <+1310>: 89 04 24 mov %eax,(%esp) 0x800f1961 <+1313>: e8 6a fc 06 00 call 0x801615d0 <ConfigParser::destruct()> 0x800f1966 <+1318>: e9 2e fb ff ff jmp 0x800f1499 <parseTimeLine(time_msec_t*, bool, char const*)+89> 0x800f196b <+1323>: 90 nop 0x800f196c <+1324>: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi Very similar to the faulting asm posted before. I don't know enough asm to figure out why one faults and not the other. New data point, new machine was upgraded: WORKS squid-3.4.7-2.fc21.i686 3.18.7-200.fc21.i686+PAE model name : Intel(R) Pentium(R) D CPU 2.80GHz Confirms again my newer-cpu-works theory (so far). Cutoff seems to be between P4 and P-D in Intel-land, and AMD as per original commenter which I'd bet is right around the same era. We need --disable-arch-native on ./configure: http://bazaar.launchpad.net/~squid/squid/3.4/view/head:/configure.ac#L38 As per my P4 build, which was NOT virrtualized in any way, it detected arch ok but (until someone informs me the opposite) it detected tune wrong (atom). But even the wrong tune shouldn't produce SIGILL, right? So the --disable-arch-native is really a hack to work around something seriously wrong in the build chain. There's no way a compile should produce illegal instructions on the correct arch, right? Someone working upstream should mention this. Actually, looking again at my comment #13, see that the squid build is specifying arch twice: -march=i686 -mtune=atom -fasynchronous-unwind-tables -fpie -march=native Is that even legal? Doesn't the 2nd just override the 1st? Yes, that could explain problems on virtual machine cross-compile builds. Is there a way to see what options it is compiling with on Fedora's build systems? Perhaps someone needs to find a better way for it to detect the correct arch (like force it in the specfile?) and avoid native/not-native altogether. And fix that tune problem too. In fact, I just looked in man gcc for "atom" as a tune target and I couldn't even find it! man gcc | grep -P '\batom\b' # empty. Is atom even a valid tune? (In reply to Trevor Cordes from comment #18) > Actually, looking again at my comment #13, see that the squid build is > specifying arch twice: > > -march=i686 -mtune=atom -fasynchronous-unwind-tables -fpie -march=native > > Is that even legal? Doesn't the 2nd just override the 1st? Yes, it does override. -march affects instruction set and squid configure script should really not be doing that by default. Short term solution: --disable-arch-native. Long term: try to convince upstream to stop doing this by default. It has to be inverted. If user wants a high optimized build for his/her CPU, pass --enable-arch-native. (In reply to Trevor Cordes from comment #18) > that tune problem too. In fact, I just looked in man gcc for "atom" as a > tune target and I couldn't even find it! man gcc | grep -P '\batom\b' # > empty. Is atom even a valid tune? Atom is valid but it is an alias to 'bonnell' on current versions: https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/i386-and-x86-64-Options.html Since F12 x86-32 Fedora toolchain configures GCC this way: https://fedoraproject.org/wiki/Features/F12X86Support So what -march is the Fedora build env + squid configure picking for the rpm? I'm just curious. Is there any way to tell? Must be pretty close to i686 to give these results. Maybe guessing nocona (from the man gcc list)? That seems to be the cutoff we're seeing in the real world. Aside: The mtune=atom tidbit is interesting, I see that on the Fedora link you listed, but it doesn't say why. Strange that we tune for a CPU that probably only 1% of Fedora systems are using!! Must be something I am missing. Maybe it makes a good lowest-common-denominator? Actually, looking at man gcc, atom doesn't even seem to be a tune option?? Weird. (In reply to Trevor Cordes from comment #21) > So what -march is the Fedora build env + squid configure picking for the > rpm? I'm just curious. Is there any way to tell? Must be pretty close to > i686 to give these results. Maybe guessing nocona (from the man gcc list)? > That seems to be the cutoff we're seeing in the real world. gcc -march=native -E -v - </dev/null 2>&1 | sed -n 's/.* -v - //p' It has to run on the build servers for we see what is being used. > > Aside: The mtune=atom tidbit is interesting, I see that on the Fedora link > you listed, but it doesn't say why. Strange that we tune for a CPU that > probably only 1% of Fedora systems are using!! Must be something I am > missing. Maybe it makes a good lowest-common-denominator? Actually, > looking at man gcc, atom doesn't even seem to be a tune option?? Weird. Atom was documented until GCC 4.8 (it was introduced in 4.5). In 4.9, Intel developers renamed lots of options (but kept aliases for older ones). The reasoning back then: http://www.mail-archive.com/fedora-devel-list@redhat.com/msg02889.html Thanks for the detailed analysis. A new build have been scheduled with --disable-arch-native, and the question about -march=native being added by default have been brought up with the other Squid developers. Regards Henrik squid-3.4.12-1.fc21 has been submitted as an update for Fedora 21. https://admin.fedoraproject.org/updates/squid-3.4.12-1.fc21 squid-3.4.12-1.fc22 has been submitted as an update for Fedora 22. https://admin.fedoraproject.org/updates/squid-3.4.12-1.fc22 Package squid-3.4.12-1.fc22: * should fix your issue, * was pushed to the Fedora 22 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing squid-3.4.12-1.fc22' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2015-2630/squid-3.4.12-1.fc22 then log in and leave karma (feedback). *** Bug 1177376 has been marked as a duplicate of this bug. *** squid-3.4.12-2.fc21 has been submitted as an update for Fedora 21. https://admin.fedoraproject.org/updates/squid-3.4.12-2.fc21 I can confirm some issues : Right the problem I got is because running squid on a P4 Successful fix on my case was rebuilt from Fedora 21 Release SRPMS as is by mock -rebuild on the local machine where the problem was squid-3.4.12-2.fc22 has been submitted as an update for Fedora 22. https://admin.fedoraproject.org/updates/squid-3.4.12-2.fc22 squid-3.4.12-2.fc21 has been pushed to the Fedora 21 stable repository. If problems still persist, please make note of it in this bug report. squid-3.4.12-2.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report. |
Created attachment 968444 [details] strace output of squid. Description of problem: Squid will not start at all giving: Illegal instruction (core dumped). This is F21 32bit on an Athlon 64 3200+ 3GB Ram. It seems this rpm was compiled for the wrong architecture (not i386). squid runs fine on x64 systems. Version-Release number of selected component (if applicable): squid.3.4.7.2.fc21.i686 How reproducible: Everytime Steps to Reproduce: 1. install squid.3.4.7.2.fc21.i686 on 32bit system 2. run squid. 3. Actual results: Illegal instruction (core dumped) Expected results: squid runs normally. Additional info: see attached strace output.