Bug 1333667
| Summary: | 2606 salt-minion processes (fork bomb) after ceph package upgrade to Red Hat Ceph Storage 1.3.2 | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vikhyat Umrao <vumrao> |
| Component: | Calamari | Assignee: | Christina Meno <gmeno> |
| Calamari sub component: | Minions | QA Contact: | Tejas <tchandra> |
| Status: | CLOSED ERRATA | Docs Contact: | Bara Ancincova <bancinco> |
| Severity: | high | ||
| Priority: | high | CC: | bsingh, ceph-eng-bugs, gmeno, hnallurv, kdreyer, linuxkidd, mhackett, vumrao |
| Version: | 1.3.2 | ||
| Target Milestone: | rc | ||
| Target Release: | 2.2 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | RHEL: calamari-server-1.5.0-1.el7cp Ubuntu: calamari_1.5.0-2redhat1xenial | Doc Type: | No Doc Update |
| Doc Text: |
undefined
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-03-14 15:43:59 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Vikhyat Umrao
2016-05-06 06:45:26 UTC
- We have found in upstream similar discussion : https://github.com/saltstack/salt/issues/8435 - This issue was closed with upstream commits : https://github.com/saltstack/salt/pull/8222 - 'minutes': opts['mine_interval'] + 'minutes': opts['mine_interval'], + 'jid_include' : True, + 'maxrunning' : 2 ^^ these commits add *'maxrunning' : 2* but we are not sure if it is helping. - As we verified this is the latest *salt-minion* package we ship : salt-minion-2014.1.5-3.el7cp.noarch - And this packages has commits from https://github.com/saltstack/salt/pull/8222 but not in same lines of code may be downstream package salt-minion-2014.1.5-3.el7cp.noarch is little different in code base from upstream which has this fix : https://github.com/saltstack/salt/pull/8222 - When these Ceph systems were upgraded all packages were upgraded for Ceph and RHEL but not salt-minion as this is the latest version we have salt-minion-2014.1.5-3.el7cp.noarch . - If we check sos report of one of the system : *ceph* package is from Fri Apr 29 18:23:29 2016 $cat installed-rpms| grep ceph ceph-0.94.5-9.el7cp.x86_64 Fri Apr 29 18:23:29 2016 - But salt-minion is from Tue Jun 16 19:57:28 2015 as it was not upgraded. $cat installed-rpms| grep salt-minion salt-minion-2014.1.5-3.el7cp.noarch Tue Jun 16 19:57:28 2015 seems like the culprit is https://github.com/saltstack/salt/pull/32373 so we'll need to investigate bumping version of salt to 2015.8 or later (In reply to Gregory Meno from comment #9) > seems like the culprit is https://github.com/saltstack/salt/pull/32373 > > so we'll need to investigate bumping version of salt to 2015.8 or later Thank you Gregory. I have verified and in 2.0 we have salt packages version: 2015.5.5-1. # rpm -qa | grep salt salt-2015.5.5-1.el7.noarch salt-minion-2015.5.5-1.el7.noarch I am moving this bug for 2.2 release and we can rebase salt to 2015.8 or later in 2.2 downstream. Hello Gregory, We have another customer experiencing this issue after adding new monitors to their cluster running RHCS 1.3.3. A day after adding the new monitors the cluster started experiencing very slow/unresponsive MONS. On some of the MONs the customer was able to login and briefly run top which showed many salt-minions in 'D' state, so they could not kill them. The machines got so slow that that they required a reboot, and the problem cleared. This happened on several, but not all, MONs. It was not isolated to only the new MONs. The provided journal.ctl log shows large amounts of salt-minion processes and the rgw and mon services are victims taking OOM's. # rpm -qa | grep salt salt-2014.1.5-3.el7cp salt-minion-2014.1.5-3.el7cp minion logs show large amounts of the following messages: 2016-11-16 14:30:59,161 [salt.minion ][ERROR ] Exception [Errno 32] Broken pipe occurred in scheduled job I am moving this bug priority to HIGH as this customer is running co-located MONs and RGW's and needing to reboot the node periodically is impact to their GW production traffic. journalctl log file location: https://api.access.redhat.com/rs/cases/01742412/attachments/ed91a041-6dad-46ab-9d9c-5e2553a8ed08 Mike are they actively using calamari, if not we could disable it? That would be on each monitor: systemctl disable salt-minion systemctl stop salt-minion Adding a monitor seems to have caused it? Interesting... I don't understand why that is the case. Seems like Saltstack doesn't either -- https://github.com/saltstack/salt/issues/32349 I'll try adding a monitor in my test setup and see if I can reproduce. Here is what you can to to un-schedule that heartbeat task from the admin node: sudo su -c 'echo "" > /opt/calamari/salt/pillar/top.sls' sudo salt '*' pillar.items | grep ceph.heartbeat # verify it is gone Here is data I'd love to have if you see this again: run from the admin node sudo salt-run jobs.list_jobs | grep heartbeat -C2 sudo salt '*' ceph.get_heartbeats Re: Mike no luck reproducing, and I noticed that I missed a step in c22 This is what you should run if it happens again... sudo su -c 'echo "" > /opt/calamari/salt/pillar/top.sls' sudo salt '*' saltutil.sync_all sudo salt '*' pillar.items | grep ceph.heartbeat # verify it is gone Harish, You're probably right saltstack won't be installed in 2.2 That being said I don't think that I'm taking any action to remove it I probably need to test with an upgraded system so I won't close or retarget this just yet. @Gregory, a) what is the decision on this bug? b) please share the steps to verify the fix if it's going to be fixed in 2.2 A. Plan to fix it. B. Steps to reproduce take a cluster from RHCS 1.3 upgrade it to 2.2 and then add a monitor monitors should not be killed as out of memory. Check "free" command before and after addition of monitor results for memory used should be similar after 1 5 and 15 mins Steps followed: 1. Upgraded a ceph 1.3.3 cluster to 2.2. 2. added a mon after upgrade. 3. Checked the mem usage after addition of mon. The memory usage does not change much with time. Also the mon does not get killed due to OOM. So moving this bug to verfied. Thanks, Tejas Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0514.html |