Bug 787846 - FS authentication broken with btrfs n_link semantics
Summary: FS authentication broken with btrfs n_link semantics
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: condor
Version: 16
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Brian Bockelman
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-02-06 22:08 UTC by Don Moore
Modified: 2013-02-14 00:54 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-02-14 00:54:28 UTC
Type: ---


Attachments (Terms of Use)
this log for schedd (3.96 KB, application/octet-stream)
2012-02-06 22:08 UTC, Don Moore
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Condor 2583 0 None None None Never

Description Don Moore 2012-02-06 22:08:20 UTC
Created attachment 559767 [details]
this log for schedd

Description of problem:
simple jobs fail with authentication error on default install,
no mods.


Version-Release number of selected component (if applicable):
rpm -qa|grep condor
condor-classads-7.7.3-0.2.fc16.x86_64
condor-procd-7.7.3-0.2.fc16.x86_64
condor-7.7.3-0.2.fc16.x86_64
lsb_release -a
LSB Version:	:core-4.0-amd64:core-4.0-noarch
Distributor ID:	Fedora
Description:	Fedora release 16 (Verne)
Release:	16
Codename:	Verne
uname -a
Linux gcr1 3.2.3-2.fc16.x86_64 #1 SMP Fri Feb 3 20:08:08 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux



How reproducible:

create simple job
cat sleepv1.jbl 
cmd = /bin/sleep
args = 30
should_transfer_files = if_needed
when_to_transfer_output = on_exit
queue 8

install condor, condor_submit 
Steps to Reproduce:
1. yum install condor condor-procd condor-classads
2. systemctl start condor.service
3. condor_submit $PWD/sleepv1.jbl 

---


Actual results:
 condor_submit ./sleepv1.jbl
Submitting job(s)
ERROR: Failed to connect to local queue manager
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate.  Globus is reporting error (851968:32).  There is probably a problem with your credentials.  (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS


Expected results:

yum erase  condor condor-procd condor-classads
yum install perl-XML-Simple libvirt
rpm -ivh /dist/condor-7.7.4-1.rhel6.1.x86_64.rpm
/etc/init.d/condor start

---
: donmoore@gcr1 jbl; condor_submit ./sleepv1.jbl
Submitting job(s)........
8 job(s) submitted to cluster 1.
: donmoore@gcr1 jbl; condor_q


-- Submitter: gcr1.utdallas.edu : <10.200.50.31:48736> : gcr1.utdallas.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   donmoore        2/6  15:48   0+00:00:13 R  0   0.0  sleep 30          
   1.1   donmoore        2/6  15:48   0+00:00:13 R  0   0.0  sleep 30          
   1.2   donmoore        2/6  15:48   0+00:00:13 R  0   0.0  sleep 30          
   1.3   donmoore        2/6  15:48   0+00:00:13 R  0   0.0  sleep 30          
   1.4   donmoore        2/6  15:48   0+00:00:00 I  0   0.0  sleep 30          
   1.5   donmoore        2/6  15:48   0+00:00:00 I  0   0.0  sleep 30          
   1.6   donmoore        2/6  15:48   0+00:00:00 I  0   0.0  sleep 30          
   1.7   donmoore        2/6  15:48   0+00:00:00 I  0   0.0  sleep 30          

8 jobs; 4 idle, 4 running, 0 held
: donmoore@gcr1 jbl; condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1 LINUX      X86_64 Claimed   Busy     0.000  1973  0+00:00:04
slot2 LINUX      X86_64 Claimed   Busy     0.000  1973  0+00:00:05
slot3 LINUX      X86_64 Claimed   Busy     0.380  1973  0+00:00:05
slot4 LINUX      X86_64 Claimed   Busy     0.000  1973  0+00:00:06
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX     4     0       4         0       0          0        0

               Total     4     0       4         0       0          0        0


Additional info:

I also tried  /rawhide/source/SRPMS/c/condor-7.7.3-0.3.fc17.1.src.rpm
with same result.

I don't use any of the other condor packages, condor-cloud, condor-ec2-enhanced*, condor-wallaby* .

Comment 1 Brian Bockelman 2012-02-06 22:17:39 UTC
Hi Dan,

Judging by your output, I *think* the upstream bug is here:
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2583

btrfs has different semantics for the value of the n_link field of struct stat.  btrfs use of this value appears to be valid POSIX, but unfortunately is different from what Condor depended on.

The fix has been committed to the 7.7.5 series, and we are waiting for that release.

The workaround is to enable an alternate authentication method besides FS security (difficult, even for expert users - FS is used in many places) or to use a different filesystem for /tmp.

Brian

Comment 2 Don Moore 2012-02-06 22:43:38 UTC
That is a surprise. yes, you're right (btrfs)- I did a bind mount for /tmp
from ext4 filesys. I verified that I can run a job . 

Did you notice- that the condor build 
condor-7.7.4-1.rhel6.1.x86_64.rpm - also works on f16 (as shown above in expected results).

Comment 3 Brian Bockelman 2012-02-06 23:03:44 UTC
Don,

I'm not sure I understand your comment about the RHEL6 build.  For the dependencies we use, there's not a large difference between RHEL6 and F16.

Brian

Comment 4 Don Moore 2012-02-07 14:40:38 UTC
Sorry -
Substituting condor-7.7.4-1.rhel6.1.x86_64.rpm on F16 works -
maybe the bug crept into the F16 condor-packaging, or perhaps
the bug already fixed in 7.7.4.

Comment 5 Brian Bockelman 2012-02-07 15:46:40 UTC
Ah, gotcha.  Looking through git, the patch is applied in 7.7.4, but is only documented as being fixed in the stable series (7.6.5).

Comment 6 Don Moore 2012-02-07 16:04:23 UTC
enable an alternate authentication,  I find using PASSWORD fails:
: donmoore@gcr1 jbl; condor_submit ./sleepv1.jbl 
Submitting job(s)
ERROR: Failed to connect to local queue manager
AUTHENTICATE:1003:Failed to authenticate with any method
: donmoore@gcr1 jbl; cat /etc/condor/config.d/14sec 
SEC_PASSWORD_FILE = /var/lib/condor/condor_credential
SEC_DAEMON_AUTHENTICATION = REQUIRED
SEC_DAEMON_INTEGRITY = REQUIRED
SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD
SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED
SEC_NEGOTIATOR_INTEGRITY = REQUIRED
SEC_NEGOTIATOR_AUTHENTICATION_METHODS = PASSWORD
SEC_CLIENT_AUTHENTICATION_METHODS = PASSWORD

ls -ld /var/lib/condor/condor_credential
-rw-r--r-- 1 root root 256 Feb  7 09:19 /var/lib/condor/condor_credential



Testing GSI is more difficult -because I don't have certificates-
I will have to setup dogtag, unless you have some suggestions
on where to get certs - .. 
-/Don 


--

Comment 7 Brian Bockelman 2012-02-07 16:19:18 UTC
Hi Don,

The Condor auth stuff gets complicated quickly.  At least, you'll need to set:

SEC_WRITE_AUTHENTICATION_METHODS=PASSWORD

There may be others.  The condor logs typically have very good error messages about security failures.

You'll likely want to follow this section:

http://research.cs.wisc.edu/condor/manual/v7.5/3_6Security.html#SECTION00463400000000000000

Additionally, since condor daemons authenticate with each other as "condor_pool" instead of "condor", you'll need to have:

QUEUE_SUPER_USERS = condor, condor_pool

Comment 8 Don Moore 2012-02-07 22:54:36 UTC
I have had my share of frustrations with condor - but my config works
as is, for PASSWORD on f15, where the CONDOR_HOST is on f16 / (root) on ext4. 
Maybe the st_nlink bug also relates when 
'SEC_PASSWORD_FILE = /var/lib/condor/condor_credential' is on btrfs. 

I tried several other test - FS_REMOTE, FS_REMOTE_DIR and setting SEC_PASSWORD_FILE to ext4, but reporting this might get confusing.

-/Don

Comment 9 Fedora End Of Life 2013-01-16 22:22:22 UTC
This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 10 Fedora End Of Life 2013-02-14 00:54:32 UTC
Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.