Bug 475020 - HA Negotiator not propagating accounting information
HA Negotiator not propagating accounting information
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: grid (Show other bugs)
1.0
All Linux
high Severity high
: 1.1
: ---
Assigned To: Matthew Farrellee
Jeff Needle
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-12-06 14:21 EST by Matthew Farrellee
Modified: 2009-02-04 11:04 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-02-04 11:04:40 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Matthew Farrellee 2008-12-06 14:21:35 EST
condor-7.2.0-0.8, with NAD etc configured on 2 machines (1 RHEL4 & 1 RHEL5)

The negotiator transfer from one machine to the other happens, but the associated accounting information is not.

You can observe this by running some jobs when machineA is the negotiator, e.g.

# condor_status -negotiator
Name                 Machine
machineA             machineA 

Perform a condor_userprio to see priorities, e.g.

# condor_userprio
Last Priority Update: 12/6  12:45
                                    Effective
User Name                           Priority 
------------------------------      ---------
testmonkey@blah                     1268.34
------------------------------      ---------
Number of users shown: 1

Then get the negotiator transferred to machineB, e.g. condor_off -negotiator machineA

(wait a bit, a few min at most, see HAD_CONNECTION_TIMEOUT)

# condor_status -negotiator
Name                 Machine             
machineB             machineB

So far this demonstrates the HAD daemon is working.

Now to make sure the transfer was complete check the user priorities, e.g.

# condor_userprio
Last Priority Update: 12/6  12:47
                                    Effective
User Name                           Priority 
------------------------------      ---------
------------------------------      ---------
Number of users shown: 0

That's a failure, the output should be similar to when the negotiator was running on machineA


* * *


The condor_transferer is crashing on RHEL5

(from LOG/TransfererLog)
12/6 12:33:24 utilSafeGetFile .../Version.24007.down started
12/6 12:33:24 ERROR "Assertion ERROR on (s == __null)" at line 1883 in file stream.cpp
Stack dump for process 24007 at timestamp 1228588404 (16 frames)
/usr/bin/condor_transferer(dprintf_dump_stack+0xc0)[0x499b8f]
/usr/bin/condor_transferer[0x499e62]
/lib64/libc.so.6[0x33cf8301b0]
/lib64/libc.so.6(gsignal+0x35)[0x33cf830155]
/lib64/libc.so.6(abort+0x110)[0x33cf831bf0]
/usr/bin/condor_transferer(_EXCEPT_+0x1a5)[0x49837b]
/usr/bin/condor_transferer(Stream::get(char*&)+0x5a)[0x507dd6]
/usr/bin/condor_transferer(Stream::code(char*&)+0x50)[0x508c28]
/usr/bin/condor_transferer(utilSafeGetFile(ReliSock&, MyString const&)+0xab)[0x46745b]
/usr/bin/condor_transferer(DownloadReplicaTransferer::downloadFile(MyString&, MyString&)+0xa9)[0x46673d]
/usr/bin/condor_transferer(DownloadReplicaTransferer::download()+0x54)[0x4668ba]
/usr/bin/condor_transferer(DownloadReplicaTransferer::initialize()+0x3a)[0x466f1e]
/usr/bin/condor_transferer(main_init(int, char**)+0x3f3)[0x468137]
/usr/bin/condor_transferer(main+0x188c)[0x49376c]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x33cf81d8b4]
/usr/bin/condor_transferer[0x465589]


The condor_transferer is logging garbage on the machine with the negotiator (this example is RHEL4)

(from LOG/TransfererLog)
12/6 13:32:30 utilSafePutFile .../Version.12311.up started
12/6 13:32:30 utilSafePutFile MAC created ^Z��[��AimqESCT&0 with actual length 
16, total bytes read 6
12/6 13:32:30 put_file: going to send from filename /var/lib/condor/spool/Versio
n.12311.up
12/6 13:32:30 put_file: Found file size 6
12/6 13:32:30 put_file: sending 6 bytes
12/6 13:32:30 ReliSock: put_file: sent 6 bytes
12/6 13:32:30 utilSafePutFile finished successfully
12/6 13:32:30 UploadReplicaTransferer::uploadFile /var/lib/condor/spool/Accounta
ntnew.log.12311.up started
12/6 13:32:30 utilSafePutFile /var/lib/condor/spool/Accountantnew.log.12311.up s
tarted
12/6 13:32:30 utilSafePutFile MAC created 6^A�9�2y^]_�zKESC�� with actual length16, total bytes read 2
12/6 13:32:30 put_file: going to send from filename /var/lib/condor/spool/Accoun
tantnew.log.12311.up
12/6 13:32:30 put_file: Found file size 171302
12/6 13:32:30 condor_write(): Socket closed when trying to write 65536 bytes to 
unknown source, fd is 7, errno=104
12/6 13:32:30 ReliSock::put_bytes_nobuffer: Send failed.
12/6 13:32:30 ReliSock::put_file: failed to put 65536 bytes (put_bytes_nobuffer(
) returned -1)
12/6 13:32:30 utilSafePutFile unable to send file /var/lib/condor/spool/Accounta
ntnew.log.12311.up, MAC or to code the end of the message
12/6 13:32:30 UploadReplicaTransferer::uploadFile failed, unlinking /var/lib/con
dor/spool/Accountantnew.log.12311.up
Comment 1 Matthew Farrellee 2008-12-06 20:58:17 EST
Fix for this will be in 7.2.0-0.9
Comment 4 errata-xmlrpc 2009-02-04 11:04:40 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0036.html

Note You need to log in before you can comment on or make changes to this bug.