Bug 1552699

Summary:

[ceph-ansible] [ceph-container] : newly added OSDs restarting continuously receiving signal Interrupt from PID: 0

Product:

[Red Hat Storage] Red Hat Ceph Storage

Reporter:

Vasishta <vashastr>

Component:

Ceph-Ansible

Assignee:

Sébastien Han <shan>

Status:

CLOSED NOTABUG

QA Contact:

Vasishta <vashastr>

Severity:

high

Docs Contact:

Aron Gunn <agunn>

Priority:

unspecified

Version:

3.0

CC:

adeza, agunn, aschoen, ceph-eng-bugs, gmeno, hnallurv, kdreyer, nthomas, sankarshan

Target Milestone:

Target Release:

3.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Known Issue

Doc Text:

.For Red Hat Ceph Storage deployments running within containers, adding a new OSD will cause the new OSD daemon to continuously restart Adding a new OSD to an existing Ceph Storage Cluster running within a container, will restart the new OSD daemon every 5 minutes. As a result, the storage cluster will not achieve a `HEALTH_OK` state. Currently, there is no workaround for this issue. This does not affect already running OSD daemons.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-03-08 17:51:34 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1494421, 1544643

Attachments:

Description	Flags
File contains contents of OSD journald log	none
File contains contents of ansible-playbook log	none

Description Vasishta 2018-03-07 14:53:00 UTC

Created attachment 1405395 [details]
File contains contents of OSD journald log

Description of problem:
Tried adding a new OSD to upgraded containerized cluster from 3.live to RC 3.0.z1, OSDs are getting shutdown continuously receiving signal Interrupt from  PID: 0 

Version-Release number of selected component (if applicable):
Container - rhceph:3-3

How reproducible:
Always (2/2)

Steps to Reproduce:
1. Upgrade a containerized cluster from 3.0 live to rhceph:3-3
2. Add a new OSD node with limit option 

Actual results:
OSDs (dmcrypt enabled + collocated journal) are getting restarted -
ceph-osd-run.sh[22803]: 2018-03-07 14:30:12.947659 7ff421d11d00 -1 osd.9 2164 log_to_monitors {default=true}
ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200026 7ff3faa6a700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory
ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200044 7ff3faa6a700 -1 received  signal: Interrupt from  PID: 0 task name: <unknown> UID: 0
ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200048 7ff3faa6a700 -1 osd.9 2229 *** Got signal Interrupt ***
ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200062 7ff3faa6a700 -1 osd.9 2229 shutdown



Expected results:
OSD services must be up and running

Additional info:

When this issue was tried on another cluster, services were stabilized after twi shutdown and restarts. 
limit option was used when playbook was initiated to add new OSD node.

Playbook eventually failed trying to restart OSDs saying  
"PGs were not reported as active+clean", "It is possible that the cluster has less OSDs than the replica configuration", "Will refuse to continue"]}"

Comment 3 Vasishta 2018-03-07 14:59:32 UTC

Created attachment 1405398 [details]
File contains contents of ansible-playbook log

Comment 4 Sébastien Han 2018-03-07 16:51:13 UTC

I'm looking into the machine at the moment

Comment 6 Sébastien Han 2018-03-08 10:22:07 UTC

Thanks for letting me know Ken. I'm still unsure of what the real error is, still investigating.

Comment 7 Sébastien Han 2018-03-08 17:51:34 UTC

After 1 day of hard debugging, it seems that the machine where these new OSDs are running has BAD firewall rules. That was sneaky because the OSD appeared up when they were started and down when stopped, so this was hard to believe fw was involved.

After disabling firewalld, everything works.

Please check your hardware setup before running a deployment. We can't afford to waste time on issues like this. Also, this is not the first time either, so please from now on be more careful.
Thanks.