Bug 2006307
Summary: | Pulp2-Pulp3 migration failed due to mongodb oom errors | ||
---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Stephen Wadeley <swadeley> |
Component: | Pulp | Assignee: | satellite6-bugs <satellite6-bugs> |
Status: | CLOSED NOTABUG | QA Contact: | Stephen Wadeley <swadeley> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 6.9.0 | CC: | ggainey, jjeffers, jsherril, mdepaulo, osousa, ttereshc |
Target Milestone: | 6.9.7 | Keywords: | Regression, Triaged |
Target Release: | Unused | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-10-07 10:24:09 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Stephen Wadeley
2021-09-21 13:20:42 UTC
I stood up a pulpcore-3.7.8/pulp-2to3-0.11.5/pulp-2.21.5 box on a VM with 15Gb of memory. Issuing the following steps: 711 pulp-admin login -u admin -p admin 713 ./rhel8_setup.bsh 716 pulp migration plan create --plan '{"plugins": [{"type": "rpm"}]}' 729 pulp migration plan run --href /pulp/api/v3/migration-plans/9299eeba-81e2-4863-bfee-963e5a2b2ab6/ The script "rhel8_setup" syncs RHEL8 baseOS and kickstart into pulp2 : #!/bin/bash -v BASE='rhel8-baseos' STREAM='rhel8-appstream' KS='rhel8-ks' REMOTE1='http://cdn.redhat.com/content/dist/rhel8/8/x86_64/baseos/os/' REMOTE2='https://cdn.redhat.com/content/dist/rhel8/8/x86_64/baseos/kickstart/' DEST='destination' pulp-admin rpm repo create --repo-id=$BASE --relative-url=$BASE --feed=$REMOTE1 \ --download-policy on_demand \ --generate-sqlite false --repoview false \ --feed-ca-cert /home/vagrant/devel/pulp_startup/CDN_cert/redhat-uep.pem \ --feed-cert /home/vagrant/devel/pulp_startup/CDN_cert/cdn.crt \ --feed-key /home/vagrant/devel/pulp_startup/CDN_cert/cdn.key pulp-admin rpm repo sync run --repo-id=$BASE pulp-admin rpm repo create --repo-id=$KS --relative-url=$KS --feed=$REMOTE2 \ --download-policy on_demand \ --generate-sqlite false --repoview false \ --feed-ca-cert /home/vagrant/devel/pulp_startup/CDN_cert/redhat-uep.pem \ --feed-cert /home/vagrant/devel/pulp_startup/CDN_cert/cdn.crt \ --feed-key /home/vagrant/devel/pulp_startup/CDN_cert/cdn.key pulp-admin rpm repo sync run --repo-id=$KS pulp-admin repo list After syncing into Pulp2, the machine was at 7.5Gb used. Across the migration, that rose to ~10.1GB used. Migration completed successfully. Questions: * What versions of pulpcore/pulp-2to3/pulp2 were running? * When "free -h" was run pre-migration, what were the results? * When you say "oom failures between each step" - does that mean steps 2, 4, and 6 all failed with OOM? (If so - something is def Not Right on that machine, since pulp2 hasn't changed, and has been able to sync RHEL8/RHEL8-KS just fine in that environment) * What are the journalctl entries from the minute before to the minute after the OOM report? * Why do we think this was a pulp problem? If you run a pulp2 system out of memory, for **any** reason, OOMKiller is probably going to choose Mongo to shoot, because it's the largest single-process memory use. The foreman log doesn't seem to have anything useful in this case - I can't even tell from it when the problem was encountered :( (In reply to Grant Gainey from comment #3) > Questions: > * What versions of pulpcore/pulp-2to3/pulp2 were running? python3-pulpcore-3.7.8-1.el7pc.noarch tfm-rubygem-pulpcore_client-3.7.1-1.el7sat.noarch python3-pulp-2to3-migration-0.11.4-1.el7pc.noarch pulp-client-1.0-2.noarch pulp-katello-1.0.3-1.el7sat.noarch pulp-maintenance-2.21.5.2-1.el7sat.noarch pulp-rpm-plugins-2.21.5.1-1.el7sat.noarch pulp-server-2.21.5.2-1.el7sat.noarch > * When "free -h" was run pre-migration, what were the results? ~]# free -h total used free shared buff/cache available Mem: 19G 11G 6.9G 83M 927M 7.4G Swap: 0B 0B 0B > * When you say "oom failures between each step" - does that mean steps 2, 4, > and 6 all failed with OOM? (If so - something is def Not Right on that > machine, since pulp2 hasn't changed, and has been able to sync > RHEL8/RHEL8-KS just fine in that environment) There were oom failures after step 6 in my testing, but in comment 0 I am putting in extra checks so we can track any leaks. > * What are the journalctl entries from the minute before to the minute after > the OOM report? I no longer have that system, I will retest to get you that info > * Why do we think this was a pulp problem? If you run a pulp2 system out of > memory, for **any** reason, OOMKiller is probably going to choose Mongo to > shoot, because it's the largest single-process memory use. interesting, let me reproduce on latest snap to get more logs thank you The log does help, yes thanks! The machine from the log has no swap, and still only 20Gb of memory: ==-= [root@host ~]# free total used free shared buff/cache available Mem: 20379856 16112344 185672 209452 4081840 3703988 Swap: 0 0 0 === I see a services-restart happening at 08:22:07, followed by kernel-thread1 invoking OOMKiller at 08:22:28 Sep 29 08:22:28 host kernel: thread1 invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 with Mongo being chosen and killed. I don't see a single pulp task started yet - the workers came up, but mongo died before I see any "task started" messages. At the point OOMKiller is invoked, it looks like there are only ~151Mb of storage free (assuming page-size is 4K? free:37818 Also, we're still running without swap, it looks like - is that intended? === Sep 29 08:22:28 host kernel: 0 pages in swap cache Sep 29 08:22:28 host kernel: Swap cache stats: add 0, delete 0, find 0/0 Sep 29 08:22:28 host kernel: Free swap = 0kB Sep 29 08:22:28 host kernel: Total swap = 0kB === MikeDep333 has added swap and restarted services, we'll see how it goes. So far, I don't see anything Pulp can be active on in this report. (In reply to Grant Gainey from comment #6) > The log does help, yes thanks! > good > The machine from the log has no swap, and still only 20Gb of memory: > that is the default SatLab test system, and RAM is as per spec in docs see bottom of comment 0 > Also, we're still running without swap, it looks like - is that intended? I only discovered this SatLab VM short coming last week when working on this, SatLab team will fix that in coming sprint. > > > MikeDep333 has added swap and restarted services, we'll see how it goes. So > far, I don't see anything Pulp can be active on in this report. OK, and if swap proves to be crucial, we should file PR against docs to make this line stronger: "In addition, a minimum of 4 GB RAM of swap space is also recommended." Note it only says "recommended". thank you |