Bug 475020

Summary: HA Negotiator not propagating accounting information
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: gridAssignee: Matthew Farrellee <matt>
Status: CLOSED ERRATA QA Contact: Jeff Needle <jneedle>
Severity: high Docs Contact:
Priority: high    
Version: 1.0CC: pmackinn
Target Milestone: 1.1   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-02-04 16:04:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matthew Farrellee 2008-12-06 19:21:35 UTC
condor-7.2.0-0.8, with NAD etc configured on 2 machines (1 RHEL4 & 1 RHEL5)

The negotiator transfer from one machine to the other happens, but the associated accounting information is not.

You can observe this by running some jobs when machineA is the negotiator, e.g.

# condor_status -negotiator
Name                 Machine
machineA             machineA 

Perform a condor_userprio to see priorities, e.g.

# condor_userprio
Last Priority Update: 12/6  12:45
                                    Effective
User Name                           Priority 
------------------------------      ---------
testmonkey@blah                     1268.34
------------------------------      ---------
Number of users shown: 1

Then get the negotiator transferred to machineB, e.g. condor_off -negotiator machineA

(wait a bit, a few min at most, see HAD_CONNECTION_TIMEOUT)

# condor_status -negotiator
Name                 Machine             
machineB             machineB

So far this demonstrates the HAD daemon is working.

Now to make sure the transfer was complete check the user priorities, e.g.

# condor_userprio
Last Priority Update: 12/6  12:47
                                    Effective
User Name                           Priority 
------------------------------      ---------
------------------------------      ---------
Number of users shown: 0

That's a failure, the output should be similar to when the negotiator was running on machineA


* * *


The condor_transferer is crashing on RHEL5

(from LOG/TransfererLog)
12/6 12:33:24 utilSafeGetFile .../Version.24007.down started
12/6 12:33:24 ERROR "Assertion ERROR on (s == __null)" at line 1883 in file stream.cpp
Stack dump for process 24007 at timestamp 1228588404 (16 frames)
/usr/bin/condor_transferer(dprintf_dump_stack+0xc0)[0x499b8f]
/usr/bin/condor_transferer[0x499e62]
/lib64/libc.so.6[0x33cf8301b0]
/lib64/libc.so.6(gsignal+0x35)[0x33cf830155]
/lib64/libc.so.6(abort+0x110)[0x33cf831bf0]
/usr/bin/condor_transferer(_EXCEPT_+0x1a5)[0x49837b]
/usr/bin/condor_transferer(Stream::get(char*&)+0x5a)[0x507dd6]
/usr/bin/condor_transferer(Stream::code(char*&)+0x50)[0x508c28]
/usr/bin/condor_transferer(utilSafeGetFile(ReliSock&, MyString const&)+0xab)[0x46745b]
/usr/bin/condor_transferer(DownloadReplicaTransferer::downloadFile(MyString&, MyString&)+0xa9)[0x46673d]
/usr/bin/condor_transferer(DownloadReplicaTransferer::download()+0x54)[0x4668ba]
/usr/bin/condor_transferer(DownloadReplicaTransferer::initialize()+0x3a)[0x466f1e]
/usr/bin/condor_transferer(main_init(int, char**)+0x3f3)[0x468137]
/usr/bin/condor_transferer(main+0x188c)[0x49376c]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x33cf81d8b4]
/usr/bin/condor_transferer[0x465589]


The condor_transferer is logging garbage on the machine with the negotiator (this example is RHEL4)

(from LOG/TransfererLog)
12/6 13:32:30 utilSafePutFile .../Version.12311.up started
12/6 13:32:30 utilSafePutFile MAC created ^Z��[��AimqESCT&0 with actual length 
16, total bytes read 6
12/6 13:32:30 put_file: going to send from filename /var/lib/condor/spool/Versio
n.12311.up
12/6 13:32:30 put_file: Found file size 6
12/6 13:32:30 put_file: sending 6 bytes
12/6 13:32:30 ReliSock: put_file: sent 6 bytes
12/6 13:32:30 utilSafePutFile finished successfully
12/6 13:32:30 UploadReplicaTransferer::uploadFile /var/lib/condor/spool/Accounta
ntnew.log.12311.up started
12/6 13:32:30 utilSafePutFile /var/lib/condor/spool/Accountantnew.log.12311.up s
tarted
12/6 13:32:30 utilSafePutFile MAC created 6^A�9�2y^]_�zKESC�� with actual length16, total bytes read 2
12/6 13:32:30 put_file: going to send from filename /var/lib/condor/spool/Accoun
tantnew.log.12311.up
12/6 13:32:30 put_file: Found file size 171302
12/6 13:32:30 condor_write(): Socket closed when trying to write 65536 bytes to 
unknown source, fd is 7, errno=104
12/6 13:32:30 ReliSock::put_bytes_nobuffer: Send failed.
12/6 13:32:30 ReliSock::put_file: failed to put 65536 bytes (put_bytes_nobuffer(
) returned -1)
12/6 13:32:30 utilSafePutFile unable to send file /var/lib/condor/spool/Accounta
ntnew.log.12311.up, MAC or to code the end of the message
12/6 13:32:30 UploadReplicaTransferer::uploadFile failed, unlinking /var/lib/con
dor/spool/Accountantnew.log.12311.up

Comment 1 Matthew Farrellee 2008-12-07 01:58:17 UTC
Fix for this will be in 7.2.0-0.9

Comment 4 errata-xmlrpc 2009-02-04 16:04:40 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0036.html