Bug 1552699 - [ceph-ansible] [ceph-container] : newly added OSDs restarting continuously receiving signal Interrupt from PID: 0
Summary: [ceph-ansible] [ceph-container] : newly added OSDs restarting continuously re...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 3.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: z2
: 3.0
Assignee: Sébastien Han
QA Contact: Vasishta
Aron Gunn
URL:
Whiteboard:
Depends On:
Blocks: 1494421 1544643
TreeView+ depends on / blocked
 
Reported: 2018-03-07 14:53 UTC by Vasishta
Modified: 2018-03-08 17:51 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
.For Red Hat Ceph Storage deployments running within containers, adding a new OSD will cause the new OSD daemon to continuously restart Adding a new OSD to an existing Ceph Storage Cluster running within a container, will restart the new OSD daemon every 5 minutes. As a result, the storage cluster will not achieve a `HEALTH_OK` state. Currently, there is no workaround for this issue. This does not affect already running OSD daemons.
Clone Of:
Environment:
Last Closed: 2018-03-08 17:51:34 UTC
Embargoed:


Attachments (Terms of Use)
File contains contents of OSD journald log (457.42 KB, text/x-vhdl)
2018-03-07 14:53 UTC, Vasishta
no flags Details
File contains contents of ansible-playbook log (2.22 MB, text/plain)
2018-03-07 14:59 UTC, Vasishta
no flags Details

Description Vasishta 2018-03-07 14:53:00 UTC
Created attachment 1405395 [details]
File contains contents of OSD journald log

Description of problem:
Tried adding a new OSD to upgraded containerized cluster from 3.live to RC 3.0.z1, OSDs are getting shutdown continuously receiving signal Interrupt from  PID: 0 

Version-Release number of selected component (if applicable):
Container - rhceph:3-3

How reproducible:
Always (2/2)

Steps to Reproduce:
1. Upgrade a containerized cluster from 3.0 live to rhceph:3-3
2. Add a new OSD node with limit option 

Actual results:
OSDs (dmcrypt enabled + collocated journal) are getting restarted -
ceph-osd-run.sh[22803]: 2018-03-07 14:30:12.947659 7ff421d11d00 -1 osd.9 2164 log_to_monitors {default=true}
ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200026 7ff3faa6a700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory
ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200044 7ff3faa6a700 -1 received  signal: Interrupt from  PID: 0 task name: <unknown> UID: 0
ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200048 7ff3faa6a700 -1 osd.9 2229 *** Got signal Interrupt ***
ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200062 7ff3faa6a700 -1 osd.9 2229 shutdown



Expected results:
OSD services must be up and running

Additional info:

When this issue was tried on another cluster, services were stabilized after twi shutdown and restarts. 
limit option was used when playbook was initiated to add new OSD node.

Playbook eventually failed trying to restart OSDs saying  
"PGs were not reported as active+clean", "It is possible that the cluster has less OSDs than the replica configuration", "Will refuse to continue"]}"

Comment 3 Vasishta 2018-03-07 14:59:32 UTC
Created attachment 1405398 [details]
File contains contents of ansible-playbook log

Comment 4 Sébastien Han 2018-03-07 16:51:13 UTC
I'm looking into the machine at the moment

Comment 6 Sébastien Han 2018-03-08 10:22:07 UTC
Thanks for letting me know Ken. I'm still unsure of what the real error is, still investigating.

Comment 7 Sébastien Han 2018-03-08 17:51:34 UTC
After 1 day of hard debugging, it seems that the machine where these new OSDs are running has BAD firewall rules. That was sneaky because the OSD appeared up when they were started and down when stopped, so this was hard to believe fw was involved.

After disabling firewalld, everything works.

Please check your hardware setup before running a deployment. We can't afford to waste time on issues like this. Also, this is not the first time either, so please from now on be more careful.
Thanks.


Note You need to log in before you can comment on or make changes to this bug.