| Summary: | FS authentication broken with btrfs n_link semantics | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Don Moore <donmoore> | ||||
| Component: | condor | Assignee: | Brian Bockelman <bbockelm> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 16 | CC: | bbockelm, matt, tomspur, tstclair | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2013-02-14 00:54:28 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
Hi Dan, Judging by your output, I *think* the upstream bug is here: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2583 btrfs has different semantics for the value of the n_link field of struct stat. btrfs use of this value appears to be valid POSIX, but unfortunately is different from what Condor depended on. The fix has been committed to the 7.7.5 series, and we are waiting for that release. The workaround is to enable an alternate authentication method besides FS security (difficult, even for expert users - FS is used in many places) or to use a different filesystem for /tmp. Brian That is a surprise. yes, you're right (btrfs)- I did a bind mount for /tmp from ext4 filesys. I verified that I can run a job . Did you notice- that the condor build condor-7.7.4-1.rhel6.1.x86_64.rpm - also works on f16 (as shown above in expected results). Don, I'm not sure I understand your comment about the RHEL6 build. For the dependencies we use, there's not a large difference between RHEL6 and F16. Brian Sorry - Substituting condor-7.7.4-1.rhel6.1.x86_64.rpm on F16 works - maybe the bug crept into the F16 condor-packaging, or perhaps the bug already fixed in 7.7.4. Ah, gotcha. Looking through git, the patch is applied in 7.7.4, but is only documented as being fixed in the stable series (7.6.5). enable an alternate authentication, I find using PASSWORD fails: : donmoore@gcr1 jbl; condor_submit ./sleepv1.jbl Submitting job(s) ERROR: Failed to connect to local queue manager AUTHENTICATE:1003:Failed to authenticate with any method : donmoore@gcr1 jbl; cat /etc/condor/config.d/14sec SEC_PASSWORD_FILE = /var/lib/condor/condor_credential SEC_DAEMON_AUTHENTICATION = REQUIRED SEC_DAEMON_INTEGRITY = REQUIRED SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED SEC_NEGOTIATOR_INTEGRITY = REQUIRED SEC_NEGOTIATOR_AUTHENTICATION_METHODS = PASSWORD SEC_CLIENT_AUTHENTICATION_METHODS = PASSWORD ls -ld /var/lib/condor/condor_credential -rw-r--r-- 1 root root 256 Feb 7 09:19 /var/lib/condor/condor_credential Testing GSI is more difficult -because I don't have certificates- I will have to setup dogtag, unless you have some suggestions on where to get certs - .. -/Don -- Hi Don, The Condor auth stuff gets complicated quickly. At least, you'll need to set: SEC_WRITE_AUTHENTICATION_METHODS=PASSWORD There may be others. The condor logs typically have very good error messages about security failures. You'll likely want to follow this section: http://research.cs.wisc.edu/condor/manual/v7.5/3_6Security.html#SECTION00463400000000000000 Additionally, since condor daemons authenticate with each other as "condor_pool" instead of "condor", you'll need to have: QUEUE_SUPER_USERS = condor, condor_pool I have had my share of frustrations with condor - but my config works as is, for PASSWORD on f15, where the CONDOR_HOST is on f16 / (root) on ext4. Maybe the st_nlink bug also relates when 'SEC_PASSWORD_FILE = /var/lib/condor/condor_credential' is on btrfs. I tried several other test - FS_REMOTE, FS_REMOTE_DIR and setting SEC_PASSWORD_FILE to ext4, but reporting this might get confusing. -/Don This message is a reminder that Fedora 16 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 16. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '16'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 16's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 16 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |
Created attachment 559767 [details] this log for schedd Description of problem: simple jobs fail with authentication error on default install, no mods. Version-Release number of selected component (if applicable): rpm -qa|grep condor condor-classads-7.7.3-0.2.fc16.x86_64 condor-procd-7.7.3-0.2.fc16.x86_64 condor-7.7.3-0.2.fc16.x86_64 lsb_release -a LSB Version: :core-4.0-amd64:core-4.0-noarch Distributor ID: Fedora Description: Fedora release 16 (Verne) Release: 16 Codename: Verne uname -a Linux gcr1 3.2.3-2.fc16.x86_64 #1 SMP Fri Feb 3 20:08:08 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux How reproducible: create simple job cat sleepv1.jbl cmd = /bin/sleep args = 30 should_transfer_files = if_needed when_to_transfer_output = on_exit queue 8 install condor, condor_submit Steps to Reproduce: 1. yum install condor condor-procd condor-classads 2. systemctl start condor.service 3. condor_submit $PWD/sleepv1.jbl --- Actual results: condor_submit ./sleepv1.jbl Submitting job(s) ERROR: Failed to connect to local queue manager AUTHENTICATE:1003:Failed to authenticate with any method AUTHENTICATE:1004:Failed to authenticate using GSI GSI:5003:Failed to authenticate. Globus is reporting error (851968:32). There is probably a problem with your credentials. (Did you run grid-proxy-init?) AUTHENTICATE:1004:Failed to authenticate using KERBEROS AUTHENTICATE:1004:Failed to authenticate using FS Expected results: yum erase condor condor-procd condor-classads yum install perl-XML-Simple libvirt rpm -ivh /dist/condor-7.7.4-1.rhel6.1.x86_64.rpm /etc/init.d/condor start --- : donmoore@gcr1 jbl; condor_submit ./sleepv1.jbl Submitting job(s)........ 8 job(s) submitted to cluster 1. : donmoore@gcr1 jbl; condor_q -- Submitter: gcr1.utdallas.edu : <10.200.50.31:48736> : gcr1.utdallas.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 donmoore 2/6 15:48 0+00:00:13 R 0 0.0 sleep 30 1.1 donmoore 2/6 15:48 0+00:00:13 R 0 0.0 sleep 30 1.2 donmoore 2/6 15:48 0+00:00:13 R 0 0.0 sleep 30 1.3 donmoore 2/6 15:48 0+00:00:13 R 0 0.0 sleep 30 1.4 donmoore 2/6 15:48 0+00:00:00 I 0 0.0 sleep 30 1.5 donmoore 2/6 15:48 0+00:00:00 I 0 0.0 sleep 30 1.6 donmoore 2/6 15:48 0+00:00:00 I 0 0.0 sleep 30 1.7 donmoore 2/6 15:48 0+00:00:00 I 0 0.0 sleep 30 8 jobs; 4 idle, 4 running, 0 held : donmoore@gcr1 jbl; condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1 LINUX X86_64 Claimed Busy 0.000 1973 0+00:00:04 slot2 LINUX X86_64 Claimed Busy 0.000 1973 0+00:00:05 slot3 LINUX X86_64 Claimed Busy 0.380 1973 0+00:00:05 slot4 LINUX X86_64 Claimed Busy 0.000 1973 0+00:00:06 Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX 4 0 4 0 0 0 0 Total 4 0 4 0 0 0 0 Additional info: I also tried /rawhide/source/SRPMS/c/condor-7.7.3-0.3.fc17.1.src.rpm with same result. I don't use any of the other condor packages, condor-cloud, condor-ec2-enhanced*, condor-wallaby* .