Bug 1812634
| Summary: | noobaa-operator invoked oom-killer seen on Baremetal setup | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Tiffany Nguyen <tunguyen> |
| Component: | Multi-Cloud Object Gateway | Assignee: | Nimrod Becker <nbecker> |
| Status: | CLOSED DUPLICATE | QA Contact: | Raz Tamir <ratamir> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.3 | CC: | assingh, etamir, kramdoss, madam, ocs-bugs, sostapov, vakulkar |
| Target Milestone: | --- | Keywords: | Automation |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-03-12 08:36:11 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Following trace is seen during the execution of workload inactive_anon:0KB active_anon:487360KB inactive_file:0KB active_file:0KB unevictable:0KB [28294.970404] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [28294.973044] [785928] 0 785928 25557 873 61440 0 -998 pod [28294.975423] [786041] 0 786041 35709 639 172032 0 -999 conmon [28294.977995] [786054] 1000540000 786054 140639 128173 1097728 0 -998 noobaa-operator [28294.980931] Memory cgroup out of memory: Killed process 785928 (pod) total-vm:102228kB, anon-rss:2864kB, file-rss:628kB, shmem-rss:0kB [28295.422414] oom_reaper: reaped process 785928 (pod), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [28351.348606] noobaa-operator invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-998 [28351.352656] noobaa-operator cpuset=crio-e9522d04a14dd034a744cdb71362666bd1b49ec0a5095f59645c92f5550278e0.scope mems_allowed=0 [28351.354586] CPU: 7 PID: 786142 Comm: noobaa-operator Not tainted 4.18.0-147.5.1.el8_1.x86_64 #1 [28351.356102] Hardware name: Red Hat RHEV Hypervisor, BIOS 1.11.0-2.el7 04/01/2014 [28351.357365] Call Trace: [28351.357760] dump_stack+0x5c/0x80 [28351.358330] dump_header+0x6e/0x27a [28351.358873] oom_kill_process.cold.29+0xb/0x10 [28351.359596] out_of_memory+0x1ba/0x490 [28351.360343] mem_cgroup_out_of_memory+0x49/0x80 [28351.361015] try_charge+0x6fa/0x780 [28351.361604] ? __alloc_pages_nodemask+0xef/0x280 [28351.362409] mem_cgroup_try_charge+0x8b/0x1a0 [28351.363179] mem_cgroup_try_charge_delay+0x1c/0x40 [28351.364082] do_anonymous_page+0xb5/0x370 [28351.364742] __handle_mm_fault+0x66e/0x6b0 [28351.365573] handle_mm_fault+0xda/0x200 [28351.366219] __do_page_fault+0x22b/0x4e0 [28351.366866] do_page_fault+0x32/0x110 [28351.367522] ? async_page_fault+0x8/0x30 [28351.368136] async_page_fault+0x1e/0x30 [28351.368839] RIP: 0033:0x45cdb1 [28351.369419] Code: fc 89 07 89 4c 1f fc c3 48 8b 06 48 89 07 c3 48 8b 06 48 8b 4c 1e f8 48 89 07 48 89 4c 1f f8 c3 f3 0f 6f 06 f3 0f 6f 4c 1e f0 <f3> 0f 7f 07 f3 0f 7f 4c 1f f0 c3 f3 0f 6f 06 f3 0f 6f 4e 10 f3 0f changing the summary to reflect the actual issue. We already have an OOM issue on the operator, duping into that one. *** This bug has been marked as a duplicate of bug 1799077 *** |
Description of problem (please be detailed as possible and provide log snippests): SmallFile workload failing due to system crashed. Partial logs as below: [ 0.000000] Command line: BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-b982129bf5832cda2831ddb43e12456238cfa2dab888fbda92ab4d5898d6819d/vmlinuz-4.18.0-147.5.1.el8_1.x86_64 rhcos.root=crypt_rootfs console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal rd.luks.options=discard ostree=/ostree/boot.0/rhcos/b982129bf5832cda2831ddb43e12456238cfa2dab888fbda92ab4d5898d6819d/0 [ 0.000000] x86/fpu: x87 FPU will use FXSAVE [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdbfff] usable [ 0.000000] BIOS-e820: [mem 0x00000000bffdc000-0x00000000bfffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000023fffffff] usable [ 0.000000] NX (Execute Disable) protection: active [ 0.000000] SMBIOS 2.8 present. [ 0.000000] DMI: Red Hat RHEV Hypervisor, BIOS 1.11.0-2.el7 04/01/2014 [ 0.000000] Hypervisor detected: KVM [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00 [ 0.000000] kvm-clock: cpu 0, msr 7ba01001, primary cpu clock [ 0.000000] kvm-clock: using sched offset of 165951736759 cycles [ 0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns [ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved [ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable [ 0.000000] last_pfn = 0x240000 max_arch_pfn = 0x400000000 [ 0.000000] MTRR default type: write-back [ 0.000000] MTRR fixed ranges enabled: [ 0.000000] 00000-9FFFF write-back [ 0.000000] A0000-BFFFF uncachable [ 0.000000] C0000-FFFFF write-protect [ 0.000000] MTRR variable ranges enabled: [ 0.000000] 0 base 0000C0000000 mask 3FFFC0000000 uncachable [ 0.000000] 1 disabled [ 0.000000] 2 disabled [ 0.000000] 3 disabled [ 0.000000] 4 disabled [ 0.000000] 5 disabled [ 0.000000] 6 disabled [ 0.000000] 7 disabled [ 0.000000] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT [ 0.000000] last_pfn = 0xbffdc max_arch_pfn = 0x400000000 [ 0.000000] found SMP MP-table at [mem 0x000f6260-0x000f626f] mapped at [(____ptrval____)] [ 0.000000] Base memory trampoline at [(____ptrval____)] 99000 size 24576 Version of all relevant components (if applicable): * Ceph Version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable) * Cluster Version 4.3.0-0.nightly-2020-03-09-200240 * OCS operator v4.3.0-369.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Seen with all smallfile test cases. Steps to Reproduce: 1. Deploy OCS cluster on bare metal 2. Run SmallFile workload from run-ci on bare metal environment Actual results: All SmallFile workload test cases failed with: assert not get_logs_with_errors() E AssertionError: assert not {'dhcp-1-11-248.dsal.lab.eng.rdu2.redhat.com': '[ 0.000000] Linux version 4.18.0-147.5.1.el8_1.x86_64 (mockbuild@x8...different security settings for (dev mqueue, type mqueue)\n[50183.653702] device vethebcbec6c left promiscuous mode\n'} E + where {'dhcp-1-11-248.dsal.lab.eng.rdu2.redhat.com': '[ 0.000000] Linux version 4.18.0-147.5.1.el8_1.x86_64 (mockbuild@x8...different security settings for (dev mqueue, type mqueue)\n[50183.653702] device vethebcbec6c left promiscuous mode\n'} = get_logs_with_errors() Expected results: Additional info: * Full log: http://magna012.ceph.redhat.com/ocs-ci-logs-1583871218/tests/e2e/workloads/test_small_file_workload.py/TestSmallFileWorkload/test_smallfile_workload-4-50000-4-1-CephBlockPool/logs