Hide Forgot
Created attachment 574607 [details] katello-debug-20120402193318.tar.gz Description of problem: Getting a pulp request timeout when attempting to promote a changeset that includes RHEL 5Server, 6Server and CloudForms. See attached katello-debug log file for details. It appears that a createrepo process is still running, even after the pulp timeout was received. Version-Release number of selected component (if applicable): * candlepin-0.5.26-1.el6.src.rpm * katello-0.1.307-1.el6.src.rpm * katello-candlepin-cert-key-pair-1.0-1.src.rpm * katello-certs-tools-1.0.4-1.el6.src.rpm * katello-cli-0.1.107-1.el6.src.rpm * katello-configure-0.1.107-1.el6.src.rpm * katello-qpid-broker-key-pair-1.0-1.src.rpm * katello-qpid-client-key-pair-1.0-1.src.rpm * katello-selinux-0.1.10-1.el6.src.rpm * pulp-1.0.2-1.el6.src.rpm How reproducible: * I did *not* see this when running the *same* script on a system with 5G memory * I did see the problem when running the *same* script on a system with 8G memory * I did *not* see this when running the *same* script on a system with 24G memory Steps to Reproduce: 1. katello -u admin -p admin provider synchronize --name "Red Hat" Provider [ Red Hat ] synchronized 2. katello -u admin -p admin changeset create --name "02 - rht repos" Successfully created changeset [ 02 - rht repos ] for environment [ Dev ] 3. katello -u admin -p admin changeset update --name "02 - rht repos" --add_product "Red Hat Enterprise Linux Server" Successfully updated changeset [ 02 - rht repos ] 4. katello -u admin -p admin changeset update --name "02 - rht repos" --add_product "Red Hat CloudForms" Successfully updated changeset [ 02 - rht repos ] 5. katello -u admin -p admin changeset promote --name "02 - rht repos" Actual results: Changeset [ 02 - rht repos ] promotion failed: Pulp::Repository: Request Timeout (POST /pulp/api/repositories/redhat-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_5_Server_RPMs_x86_64_5Server/clone/) Error (65): command failed. Expected results: Changeset [ 02 - rht repos ] promotion successfully Additional info: katello> changeset info --name "02 - rht repos" > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Changeset Info > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Id: 1 > Name: 02 - rht repos > Description: > None > Updated At: 2012/04/02 19:27:46 > State: failed > Environment Id: 3 > Environment Name: Dev > Errata: > > Products: > Red Hat Enterprise Linux Server > Red Hat CloudForms > Packages: > > Repositories: > > System Templates: > > Distributions: katello> repo list --org redhat --environment Library > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Repo List For Org redhat Environment Library > > Id Name Package Count > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > 656 Red Hat Enterprise Linux 5 Server RPMs x86_64 5Server 13437 > 677 Red Hat CloudForms Tools for RHEL 6 Beta RPMs x86_64 6Server 53 > 760 Red Hat Enterprise Linux 6 Server RPMs x86_64 6Server 7061 > 978 Red Hat CloudForms Tools for RHEL 6 RPMs x86_64 6Server 0 > 1050 Red Hat CloudForms Tools for RHEL 5 RPMs x86_64 5Server 0 > 1087 Red Hat CloudForms Tools for RHEL 5 Beta RPMs x86_64 5Server 54 > 654 Red Hat Enterprise Linux 5 Server RPMs i386 5Server 10298 > 1048 Red Hat CloudForms Tools for RHEL 5 RPMs i386 5Server 0 > 1085 Red Hat CloudForms Tools for RHEL 5 Beta RPMs i386 5Server 54 > 762 Red Hat Enterprise Linux 6 Server RPMs i386 6Server 5434 > 980 Red Hat CloudForms Tools for RHEL 6 RPMs i386 6Server 0 > 680 Red Hat CloudForms Tools for RHEL 6 Beta RPMs i386 6Server 50 > 505 Red Hat CloudForms System Engine Beta RPMs x86_64 6Server 432 > 601 Red Hat CloudForms Cloud Engine Beta RPMs x86_64 6Server 474 > 592 Red Hat CloudForms Cloud Engine RPMs x86_64 6Server 0 > 550 Red Hat CloudForms System Engine RPMs x86_64 6Server 0
Attempting to retry the failed promotion also fails ... # katello -u admin -p admin changeset promote --name "02 - rht repos" Changeset [ 02 - rht repos ] promotion failed: undefined method `sync' for nil:NilClass
The re-promotion failure in comment#1 looks similar to bug#804610. However, this changeset does not include repositories with non-ASCII characters in the name.
So this is problem in our default setting - our Katello REST client has default timeout set to 30 seconds, and I guess in very overloaded scenarios this is not enough. I suggest you to set rest_client_timeout: 150 in the katello.yml file and try to reproduce it.
So this is what we have in our katello codebase: vim /usr/share/katello/app/models/glue/pulp/repo.rb +145 if self.is_cloned_in?(to_env) #repo is already cloned, so lets just re-sync it from its parent return self.get_clone(to_env).sync else #repo is not in the next environment yet, we have to clone it there ... end While is_cloned_in? returns true, get_clone method returns nil. So Pulp reports a clone exists, but when Katello asks for it, it does not find it. In this case its: redhat-Dev-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_5_Server_RPMs_x86_64_5Server So for some reason Pulp has inconsitency - when repo was cloned its id was stored in cloned_ids array, but it was not created. I would like to ask pulp guys to confirm if this is done in one "transaction" so there will not be inconsistency like this.
cc'ing Pradeep. Pradeep, can you answer the question from comment#4?
Which cloned_ids array are you referring to? the one on the parent repo? if so, pulp only stores the clone id in the cloned array after the clone repo id is created, so dont see a clone id exists in that errata and not exists. Unless when we delete the cloned repo we're not cleaning up the clone id of the parent repo's clone list? just a guess..
So the issue is Pulp did not acceptour clone repository request and our REST client timed out after 150 seconds. Pulp::Repository: Request Timeout (POST /pulp/api/repositories/redhat-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_5_Server_RPMs_x86_64_5Server/clone/) [ERROR: 2012-04-02 19:27:46 #9421] /usr/lib/ruby/gems/1.8/gems/rest-client-1.6.1/lib/restclient/request.rb:174:in `transmit' Please note from the comment above I can see Pulp actually cloned the repository. So we had a timeout, but it cloned it. The issue is this should be asynchronous in Pulp, we are expecting immediate response with a job id. What is the reason it timed out? There is no error log in pulp logs. It could be a deadlock, or overload. So the solution would be: Pulp team: Investigate why Pulp locked for 150 seconds on an asynchronous REST request. I mean from the code perpective, is there anything that could block the request? Is it possible to return valid response instead letting it to timeout when this scenario is reached (e.g. mongo write block is catched)? Katello team: Write a script that would compare pulp and katello repositories called "katello-check". It's too difficuld to compare backend engines and katello database states.
Maybe this one is relevant: https://bugzilla.redhat.com/show_bug.cgi?id=807720
Created attachment 575524 [details] katello-debug-20120405183708.tar.gz Retesting on a fresh installed system using: * pulp-1.0.4-1.el6.noarch * katello-0.1.309-1.el6.noarch I am still seeing: > katello -u admin -p admin provider synchronize --name "Red Hat" > Provider [ Red Hat ] synchronized > > katello -u admin -p admin changeset create --name "02 - rht repos" > Successfully created changeset [ 02 - rht repos ] for environment [ Dev ] > > katello -u admin -p admin changeset update --name "02 - rht repos" --add_product "Red Hat Enterprise Linux Server" > Successfully updated changeset [ 02 - rht repos ] > > katello -u admin -p admin changeset update --name "02 - rht repos" --add_product "Red Hat CloudForms" > Successfully updated changeset [ 02 - rht repos ] > > katello -u admin -p admin changeset promote --name "02 - rht repos" > Changeset [ 02 - rht repos ] promotion failed: Pulp::Repository: Request Timeout (POST /pulp/api/repositories/redhat-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_5_Server_RPMs_x86_64_5Server/clone/) I see a genpkgmetadata.py (createrepo) is still inprogress, even after the failure. > root 31299 1 0 17:01 ? 00:00:00 /usr/sbin/httpd > apache 31755 31299 44 17:03 ? 00:42:19 \_ (wsgi:pulp) > apache 5125 31755 70 18:31 ? 00:04:40 | \_ /usr/bin/python -t /usr/share/createrepo/genpkgmetadata.py --database --checksum sha -g /var/lib/pulp//repos/redhat/Dev/content/dist/rhel/server/5/5Server/x86_64/os/repodata/comps.xml --update /var/lib/pulp//repos/redhat/Dev/content/dist/rhel/server/5/5Server/x86_64/os The system is available for remote debugging: flatline-katello.usersys.redhat.com (root/redhat)
So the root cause is the same as the existing bug. I am marking this as a duplicate and reopening the original. *** This bug has been marked as a duplicate of bug 773088 ***
Clearing the clone flag.
I thought it's a clone of a different (similar) bug, but it is not. SUMMARY: After the content synchronization, new promotion is started. It seems for some reason something gets wrong in Pulp. - 19:15 - the sync is finished: - 19:25 - repodata generation starting (this is the last pulp log) - 19:27 - katello is trying to start promotion and it fails Unable to reproduce with smaller repos, it seems Pulp gets "overloaded". Must be reproduced with more content as James recommends.
My first finding is Katello timed out AFTER httpd reported the response was returned. My wild guess would be Apache httpd, for some reason, kept the connection alive. Pulp does not set HTTP header field Content-Length, so Katello REST client must be waiting until the connection is closed. I prepared a patch that adds this content-length header to all pulp controller responses. We will see if this helps. @James - Please apply the patch and try to reproduce on your setup, thanks.
Created attachment 576756 [details] Patch that adds Content-Length HTTP header to all Pulp controllers
A few comments ... 1) I'd like input from Lukas on whether this bug qualifies as an RC blocker. I'm able to consistently hit this race condition on my setup. Unfortunately, I don't know the parameters of the race condition, or how common it may be. That makes me think this is worth reviewing in a broader context. 2) Regarding the pulp patch in comment#19 ... After installing the system, installing katello-all, applying the patch, starting my test (begins with katello-configure) ... I'm still seeing the failure. > :: [ FAIL ] :: Running 'katello -u admin -p admin changeset promote --name "Promote Red Hat provider"' (Expected 0, got 65) > Changeset [ Promote Red Hat provider ] promotion failed: Pulp::Repository: Request Timeout (POST /pulp/api/repositories/redhat-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_5_Server_RPMs_x86_64_5Server/clone/) > Error (65): command failed. Retry (r), Continue (c), stop (s)? s > [root@flatline-katello ~]# date > Wed Apr 11 17:12:56 EDT 2012 3) Is "CloudForms Common" "pymongo" the proper place for this bug? Should it be assigned to pulp?
I'm rerunning my test on several previously affected systems ...
On flatline-katello.usersys.redhat.com ... > :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: > :: [ LOG ] :: Sync and promote 'Red Hat' provider > :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: > > :: [ PASS ] :: Running 'katello -u admin -p admin provider synchronize --name "Red Hat"' > :: [ PASS ] :: Running 'katello -u admin -p admin changeset create --name "Promote Red Hat provider"' > :: [ PASS ] :: Running 'katello -u admin -p admin changeset update --name "Promote Red Hat provider" --add_product "Red Hat Enterprise Linux Server"' > :: [ PASS ] :: Running 'katello -u admin -p admin changeset update --name "Promote Red Hat provider" --add_product "Red Hat CloudForms"' > :: [ FAIL ] :: Running 'katello -u admin -p admin changeset promote --name "Promote Red Hat provider"' (Expected 0, got 65) Changeset [ Promote Red Hat provider ] promotion failed: Pulp::Repository: Request Timeout (POST /pulp/api/repositories/redhat-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_5_Server_RPMs_x86_64_5Server/clone/) > :: [ LOG ] :: Duration: 1h 9m 45s > :: [ LOG ] :: Assertions: 4 good, 1 bad > :: [ FAIL ] :: RESULT: Sync and promote 'Red Hat' provider > :: [ LOG ] :: JOURNAL XML: /tmp/beakerlib-4dqVhGV/journal.xml > :: [ LOG ] :: JOURNAL TXT: /tmp/beakerlib-4dqVhGV/journal.txt
I think this is a blocker, major feature is not working and you are able to reproduce it anytime we want. Correcting component, setting to "pulp". Thanks for retesting.
I am not sure if this is related, but we are suffering from the same bug as here on the testing box: [root@flatline-katello ~]# service httpd restart Stopping httpd: [FAILED] Starting httpd: Syntax error on line 10 of /etc/httpd/conf.d/katello.conf: SSLCertificateFile: file '/etc/candlepin/certs/candlepin-ca.crt' does not exist or is empty [FAILED] [root@flatline-katello ~]# ll /etc/candlepin/certs/candlepin-ca.crt ls: cannot access /etc/candlepin/certs/candlepin-ca.crt: No such file or directory https://bugzilla.redhat.com/show_bug.cgi?id=813812
Another finding, during the long script run I noticed there are some errors in the pulp.log: PulpException: Architecture must be one of [noarch, i386, i686, ppc64, ppc, s390x, x86_64, ia64]
Tried to sync this manually using pulp-admin and it worked. We should do the same with REST API. Sync for repository test_lzap2 started Sync: Finished 0/13514 new items downloaded 13514/13514 existing items processed Item Details: Tree Files: 15/15 RPMs: 13499/13499
(In reply to comment #29) > James will be updating the BZ later tonight after a test run of his completes > but most likely this will get moved to 1.1 Turns out the cause of this bug was using a cached /var/lib/pulp directory structure. In my testing, since I'm downloading >100G of data from the CDN, I keep /var/lib/pulp as a separate disk partition. During re-installs, I leave /var/lib/pulp unchanged. This helps reduce the network lag downloading the content. Unfortunately, this also introduces this bug. The presence of non-Library environment files on disk was causing problems. Thankfully, I'm able to continue to cache content to reduce network download delays. However, I need to remove all non-Library environment files before sync and promotion. The following commands will remove the offending previously cached content... > shopt -s extglob > for DIR in $(echo /var/lib/pulp/repos/redhat/!(Library) /var/lib/pulp/published/repos/redhat/!(Library)) > do > test -d "$DIR" && rm -rf "$DIR" > done This problem should not present on a fresh installed system with a fully formatted disk. I'm adjusting the target release accordingly. Pradeep suggested there was a difference in how pulp and katello were responding to the promotion request. I'm leaving this bug open to further review the difference in case some additional polling is needed by katello to handle this situation.
Uh. James please report all deviations from the standard installation in advance, because I really thought this is "clean" instance (without any re-sharing or something). ;-) I guess this worth sharing on list, because more folks may be doing this. Thanks!
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: When installing and preparing a CloudForms System Engine environment, re-using any previously cached non-Library pulp data may introduce failures during content promotion. Administrators are advised to start with a fresh file-system and not re-use any previously synced pulp data. Prior to installation, the directory /var/lib/pulp should not exist.
Release Note added. Link: http://documentation-stage.bne.redhat.com/docs/en-US/CloudForms/1.0/html-single/Release_Notes/index.html#sect-Release_Notes-System_Engine-System_Engine_Considerations-known_issues_09 Regards, Shikha
I did not see this in the training setup. Moving to ON_QA to verify.
This was delivered with Satellite 6.0 which was released on 10 September 2014.