Description of problem: Condor was upgraded from condor-7.6.5-0.14.el6.i686 to condor-7.6.5-0.22.el6.i686. After condor restart, both condor_master and condor_collector daemons crashed. Affected machine has custom hostname (rhel-6-i386.virtualdomain) with proper entry in /etc/hosts: ~~~ 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 # local pool 192.168.122.7 rhel-5-i386.virtualdomain rhel-5-i386 192.168.122.198 rhel-5-x86_64.virtualdomain rhel-5-x86_64 192.168.122.231 rhel-6-i386.virtualdomain rhel-6-i386 192.168.122.111 rhel-6-x86_64.virtualdomain rhel-5-x86_64 192.168.122.187 rhel-5-x86_64_duo.virtualdomain rhel-5-x86_64_duo ~~~ Default personal condor configuration was used. The master logs shows (using 'ALL_DEBUG=D_FULLDEBUG'): ~~~ 09/25/12 13:29:02 IPVERIFY: checking rhel-6-i386.virtualdomain against e77aa8c0 09/25/12 13:29:02 IPVERIFY: matched e77aa8c0 to e77aa8c0 09/25/12 13:29:02 IPVERIFY: ip found is 1 09/25/12 13:29:02 IPVERIFY: checking rhel-6-i386 against e77aa8c0 09/25/12 13:29:02 IPVERIFY: comparing 100007f to e77aa8c0 09/25/12 13:29:02 IPVERIFY: ip found is 0 09/25/12 13:29:02 WARNING: forward resolution of localhost4 doesn't match e77aa8c0! Stack dump for process 5237 at timestamp 1348572542 (24 frames) condor_master(dprintf_dump_stack+0x44)[0x810cfb4] condor_master[0x8144a87] [0x743400] /lib/libc.so.6(_IO_vfprintf+0x38fe)[0xe7835e] /lib/libc.so.6(__vsnprintf_chk+0xd4)[0xf2e104] condor_master(vprintf_length+0x38)[0x81295f8] condor_master(vsprintf_realloc+0x4b)[0x812964b] condor_master[0x810da2d] condor_master(_condor_dprintf_va+0x318)[0x810ef18] condor_master(dprintf+0x20)[0x81380d0] condor_master(_Z18verify_name_has_ipPc7in_addr+0x2c)[0x80cc03c] condor_master(_ZN8IpVerify6VerifyE12DCpermissionPK11sockaddr_inPKcP8MyStringS7_+0x6bd)[0x80ce69d] condor_master(_ZN6SecMan6VerifyE12DCpermissionPK11sockaddr_inPKcP8MyStringS7_+0x3d)[0x80e174d] condor_master(_ZN10DaemonCore6VerifyEPKc12DCpermissionPK11sockaddr_inS1_+0x71)[0x80a3871] condor_master(_ZN10DaemonCore9HandleReqEP6StreamS1_+0xcb7)[0x80b1c87] condor_master(_ZN10DaemonCore22HandleReqSocketHandlerEP6Stream+0x5f)[0x80b47bf] condor_master(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x5af)[0x80b4f6f] condor_master(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x2d)[0x80b503d] condor_master(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x57)[0x81429f7] condor_master(_ZN10DaemonCore17CallSocketHandlerERib+0x107)[0x80aae17] condor_master(_ZN10DaemonCore6DriverEv+0x1f6d)[0x80af99d] condor_master(main+0x1432)[0x809e4b2] /lib/libc.so.6(__libc_start_main+0xe6)[0xe4ace6] condor_master[0x8092761] ~~~ and condor_collector log shows the same problem: ~~~ 09/25/12 13:29:02 IPVERIFY: checking rhel-6-i386.virtualdomain against e77aa8c0 09/25/12 13:29:02 IPVERIFY: matched e77aa8c0 to e77aa8c0 09/25/12 13:29:02 IPVERIFY: ip found is 1 09/25/12 13:29:02 IPVERIFY: checking rhel-6-i386 against e77aa8c0 09/25/12 13:29:02 IPVERIFY: comparing 100007f to e77aa8c0 09/25/12 13:29:02 IPVERIFY: ip found is 0 09/25/12 13:29:02 WARNING: forward resolution of localhost4 doesn't match e77aa8c0! Stack dump for process 5239 at timestamp 1348572542 (24 frames) condor_collector(dprintf_dump_stack+0x44)[0x8125834] condor_collector[0x8164a07] [0xb25400] /lib/libc.so.6(_IO_vfprintf+0x38fe)[0x58a35e] /lib/libc.so.6(__vsnprintf_chk+0xd4)[0x640104] condor_collector(vprintf_length+0x38)[0x8141b78] condor_collector(vsprintf_realloc+0x4b)[0x8141bcb] condor_collector[0x81262ad] condor_collector(_condor_dprintf_va+0x318)[0x8127798] condor_collector(dprintf+0x20)[0x8156520] condor_collector(_Z18verify_name_has_ipPc7in_addr+0x2c)[0x80de9cc] condor_collector(_ZN8IpVerify6VerifyE12DCpermissionPK11sockaddr_inPKcP8MyStringS7_+0x6bd)[0x80e102d] condor_collector(_ZN6SecMan6VerifyE12DCpermissionPK11sockaddr_inPKcP8MyStringS7_+0x3d)[0x80f3ccd] condor_collector(_ZN10DaemonCore6VerifyEPKc12DCpermissionPK11sockaddr_inS1_+0x71)[0x80b7891] condor_collector(_ZN10DaemonCore9HandleReqEP6StreamS1_+0xcb7)[0x80c5ca7] condor_collector(_ZN10DaemonCore22HandleReqSocketHandlerEP6Stream+0x5f)[0x80c87df] condor_collector(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x5af)[0x80c8f8f] condor_collector(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x2d)[0x80c905d] condor_collector(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x57)[0x8162977] condor_collector(_ZN10DaemonCore17CallSocketHandlerERib+0x107)[0x80bee37] condor_collector(_ZN10DaemonCore6DriverEv+0x1f6d)[0x80c39bd] condor_collector(main+0x1432)[0x80b24d2] /lib/libc.so.6(__libc_start_main+0xe6)[0x55cce6] condor_collector[0x809c7c1] ~~~ Version-Release number of selected component (if applicable): [root@rhel-6-i386 condor]# rpm -qa | grep condor condor-classads-7.6.5-0.22.el6.i686 python-condorutils-1.5-4.el6.noarch condor-7.6.5-0.22.el6.i686 How reproducible: I fail to reproduce the problem on new clean installed machine. Steps to Reproduce: Don't know. Actual results: Some condor daemons crashes. Expected results: Condor should not crash. Additional info: On one hand I don't know how to reproduce the issue on new clean installed machine, on the other hand I was able to reproduce it on my other old virtual machines: rhel-6-x86_64 one as well as on rhel 5 nodes (both i386 and x86_64) when using /etc/hosts file from rhel 6 (all these machines were installed 2 months ago and I use them for testing purposes). Quick fix: When you replace following lines from /etc/hosts (default on rhel 6): ~~~ 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 ~~~ with these: ~~~ 127.0.0.1 localhost ::1 localhost ~~~ the problem doesn't occur. Related problems: https://bugzilla.redhat.com/show_bug.cgi?id=853945 https://lists.cs.wisc.edu/archive/condor-users/2012-August/msg00086.shtml
Interesting note: When I don't use ALL_DEBUG=D_FULLDEBUG option, I will get slightly different stack trace in the condor master log: ~~~ 09/25/12 16:30:55 WARNING: forward resolution of localhost4 doesn't match e77aa8c0! Stack dump for process 6657 at timestamp 1348583455 (19 frames) condor_master(dprintf_dump_stack+0x44)[0x810cfb4] condor_master[0x8144a87] [0x2c8400] /lib/libc.so.6(__nss_hostname_digits_dots+0x39)[0x549239] /lib/libc.so.6(gethostbyname+0x9a)[0x54e3ba] condor_master(_Z18verify_name_has_ipPc7in_addr+0x34)[0x80cc044] condor_master(_ZN8IpVerify6VerifyE12DCpermissionPK11sockaddr_inPKcP8MyStringS7_+0x6bd)[0x80ce69d] condor_master(_ZN6SecMan6VerifyE12DCpermissionPK11sockaddr_inPKcP8MyStringS7_+0x3d)[0x80e174d] condor_master(_ZN10DaemonCore6VerifyEPKc12DCpermissionPK11sockaddr_inS1_+0x71)[0x80a3871] condor_master(_ZN10DaemonCore9HandleReqEP6StreamS1_+0xcb7)[0x80b1c87] condor_master(_ZN10DaemonCore22HandleReqSocketHandlerEP6Stream+0x5f)[0x80b47bf] condor_master(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x5af)[0x80b4f6f] condor_master(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x2d)[0x80b503d] condor_master(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x57)[0x81429f7] condor_master(_ZN10DaemonCore17CallSocketHandlerERib+0x107)[0x80aae17] condor_master(_ZN10DaemonCore6DriverEv+0x1f6d)[0x80af99d] condor_master(main+0x1432)[0x809e4b2] /lib/libc.so.6(__libc_start_main+0xe6)[0x467ce6] condor_master[0x8092761] ~~~ Note that this one is the same as in 2 other bug reports I linked to above.
When run without ALL_DEBUG=D_FULLDEBUG, it seems that condor crashes because of calling gethostbyname with wrong string - as can be seen in the following excerpt from ltrace log: ~~~ 6699 16:48:01.259591 write(10, "09/25/12 16:48:01 WARNING: forward resolution of localhost4 doesn't match e77aa8c0!\n ", 84) = 84 6699 16:48:01.259728 fflush(0x9e49510) = 0 6699 16:48:01.259827 fclose(0x9e49510) = 0 6699 16:48:01.259953 umask(022) = 022 6699 16:48:01.260061 sigprocmask(2, 0xbfb6d1dc, NULL) = 0 6699 16:48:01.260215 gethostbyname("\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\ 377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377 \377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\37 7\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\3 77\377\377\377\377\377\377\377\377\377\377"... <unfinished ...> 6699 16:48:01.260337 --- SIGSEGV (Segmentation fault) --- ~~~ Log was generated using: ltrace -tt -n 2 -f -s 120 -o condor_ltrace condor_master
I was able to reproduce the problem on fresh virtual machine using the following steps: Steps to Reproduce: 1) install fresh rhel 6.3 2) change hostname to 'rhel-6-x86_64.virtualdomain' edit /etc/sysconfig/network 3) add following lines into /etc/hosts: # local pool 192.168.122.7 rhel-5-i386.virtualdomain rhel-5-i386 192.168.122.198 rhel-5-x86_64.virtualdomain rhel-5-x86_64 192.168.122.231 rhel-6-i386.virtualdomain rhel-6-i386 192.168.122.169 rhel-6-x86_64.virtualdomain rhel-6-x86_64 where 192.168.122.169 is global ipv4 address of the machine 4) reboot machine (for hostname to be updated) 5) install from mrg 2.1 install these packages: condor-7.6.5-0.14.el6.x86_64.rpm condor-classads-7.6.5-0.14.el6.x86_64.rpm condor-debuginfo-7.6.5-0.14.el6.x86_64.rpm 6) start condor if it's not already running, run condor_status to see that it's working, stop condor 7) upgrade to mrg 2.2 (just do yum upgrade) 8) start condor, try condor_status, see logs
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0564.html