condor-7.2.0-0.8, with NAD etc configured on 2 machines (1 RHEL4 & 1 RHEL5) The negotiator transfer from one machine to the other happens, but the associated accounting information is not. You can observe this by running some jobs when machineA is the negotiator, e.g. # condor_status -negotiator Name Machine machineA machineA Perform a condor_userprio to see priorities, e.g. # condor_userprio Last Priority Update: 12/6 12:45 Effective User Name Priority ------------------------------ --------- testmonkey@blah 1268.34 ------------------------------ --------- Number of users shown: 1 Then get the negotiator transferred to machineB, e.g. condor_off -negotiator machineA (wait a bit, a few min at most, see HAD_CONNECTION_TIMEOUT) # condor_status -negotiator Name Machine machineB machineB So far this demonstrates the HAD daemon is working. Now to make sure the transfer was complete check the user priorities, e.g. # condor_userprio Last Priority Update: 12/6 12:47 Effective User Name Priority ------------------------------ --------- ------------------------------ --------- Number of users shown: 0 That's a failure, the output should be similar to when the negotiator was running on machineA * * * The condor_transferer is crashing on RHEL5 (from LOG/TransfererLog) 12/6 12:33:24 utilSafeGetFile .../Version.24007.down started 12/6 12:33:24 ERROR "Assertion ERROR on (s == __null)" at line 1883 in file stream.cpp Stack dump for process 24007 at timestamp 1228588404 (16 frames) /usr/bin/condor_transferer(dprintf_dump_stack+0xc0)[0x499b8f] /usr/bin/condor_transferer[0x499e62] /lib64/libc.so.6[0x33cf8301b0] /lib64/libc.so.6(gsignal+0x35)[0x33cf830155] /lib64/libc.so.6(abort+0x110)[0x33cf831bf0] /usr/bin/condor_transferer(_EXCEPT_+0x1a5)[0x49837b] /usr/bin/condor_transferer(Stream::get(char*&)+0x5a)[0x507dd6] /usr/bin/condor_transferer(Stream::code(char*&)+0x50)[0x508c28] /usr/bin/condor_transferer(utilSafeGetFile(ReliSock&, MyString const&)+0xab)[0x46745b] /usr/bin/condor_transferer(DownloadReplicaTransferer::downloadFile(MyString&, MyString&)+0xa9)[0x46673d] /usr/bin/condor_transferer(DownloadReplicaTransferer::download()+0x54)[0x4668ba] /usr/bin/condor_transferer(DownloadReplicaTransferer::initialize()+0x3a)[0x466f1e] /usr/bin/condor_transferer(main_init(int, char**)+0x3f3)[0x468137] /usr/bin/condor_transferer(main+0x188c)[0x49376c] /lib64/libc.so.6(__libc_start_main+0xf4)[0x33cf81d8b4] /usr/bin/condor_transferer[0x465589] The condor_transferer is logging garbage on the machine with the negotiator (this example is RHEL4) (from LOG/TransfererLog) 12/6 13:32:30 utilSafePutFile .../Version.12311.up started 12/6 13:32:30 utilSafePutFile MAC created ^Z��[��AimqESCT&0 with actual length 16, total bytes read 6 12/6 13:32:30 put_file: going to send from filename /var/lib/condor/spool/Versio n.12311.up 12/6 13:32:30 put_file: Found file size 6 12/6 13:32:30 put_file: sending 6 bytes 12/6 13:32:30 ReliSock: put_file: sent 6 bytes 12/6 13:32:30 utilSafePutFile finished successfully 12/6 13:32:30 UploadReplicaTransferer::uploadFile /var/lib/condor/spool/Accounta ntnew.log.12311.up started 12/6 13:32:30 utilSafePutFile /var/lib/condor/spool/Accountantnew.log.12311.up s tarted 12/6 13:32:30 utilSafePutFile MAC created 6^A�9�2y^]_�zKESC�� with actual length16, total bytes read 2 12/6 13:32:30 put_file: going to send from filename /var/lib/condor/spool/Accoun tantnew.log.12311.up 12/6 13:32:30 put_file: Found file size 171302 12/6 13:32:30 condor_write(): Socket closed when trying to write 65536 bytes to unknown source, fd is 7, errno=104 12/6 13:32:30 ReliSock::put_bytes_nobuffer: Send failed. 12/6 13:32:30 ReliSock::put_file: failed to put 65536 bytes (put_bytes_nobuffer( ) returned -1) 12/6 13:32:30 utilSafePutFile unable to send file /var/lib/condor/spool/Accounta ntnew.log.12311.up, MAC or to code the end of the message 12/6 13:32:30 UploadReplicaTransferer::uploadFile failed, unlinking /var/lib/con dor/spool/Accountantnew.log.12311.up
Fix for this will be in 7.2.0-0.9
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0036.html