Bug 1552699

Summary: [ceph-ansible] [ceph-container] : newly added OSDs restarting continuously receiving signal Interrupt from PID: 0
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasishta <vashastr>
Component: Ceph-AnsibleAssignee: Sébastien Han <shan>
Status: CLOSED NOTABUG QA Contact: Vasishta <vashastr>
Severity: high Docs Contact: Aron Gunn <agunn>
Priority: unspecified    
Version: 3.0CC: adeza, agunn, aschoen, ceph-eng-bugs, gmeno, hnallurv, kdreyer, nthomas, sankarshan
Target Milestone: z2   
Target Release: 3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
.For Red Hat Ceph Storage deployments running within containers, adding a new OSD will cause the new OSD daemon to continuously restart Adding a new OSD to an existing Ceph Storage Cluster running within a container, will restart the new OSD daemon every 5 minutes. As a result, the storage cluster will not achieve a `HEALTH_OK` state. Currently, there is no workaround for this issue. This does not affect already running OSD daemons.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-08 17:51:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1494421, 1544643    
Attachments:
Description Flags
File contains contents of OSD journald log
none
File contains contents of ansible-playbook log none

Description Vasishta 2018-03-07 14:53:00 UTC
Created attachment 1405395 [details]
File contains contents of OSD journald log

Description of problem:
Tried adding a new OSD to upgraded containerized cluster from 3.live to RC 3.0.z1, OSDs are getting shutdown continuously receiving signal Interrupt from  PID: 0 

Version-Release number of selected component (if applicable):
Container - rhceph:3-3

How reproducible:
Always (2/2)

Steps to Reproduce:
1. Upgrade a containerized cluster from 3.0 live to rhceph:3-3
2. Add a new OSD node with limit option 

Actual results:
OSDs (dmcrypt enabled + collocated journal) are getting restarted -
ceph-osd-run.sh[22803]: 2018-03-07 14:30:12.947659 7ff421d11d00 -1 osd.9 2164 log_to_monitors {default=true}
ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200026 7ff3faa6a700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory
ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200044 7ff3faa6a700 -1 received  signal: Interrupt from  PID: 0 task name: <unknown> UID: 0
ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200048 7ff3faa6a700 -1 osd.9 2229 *** Got signal Interrupt ***
ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200062 7ff3faa6a700 -1 osd.9 2229 shutdown



Expected results:
OSD services must be up and running

Additional info:

When this issue was tried on another cluster, services were stabilized after twi shutdown and restarts. 
limit option was used when playbook was initiated to add new OSD node.

Playbook eventually failed trying to restart OSDs saying  
"PGs were not reported as active+clean", "It is possible that the cluster has less OSDs than the replica configuration", "Will refuse to continue"]}"

Comment 3 Vasishta 2018-03-07 14:59:32 UTC
Created attachment 1405398 [details]
File contains contents of ansible-playbook log

Comment 4 Sébastien Han 2018-03-07 16:51:13 UTC
I'm looking into the machine at the moment

Comment 6 Sébastien Han 2018-03-08 10:22:07 UTC
Thanks for letting me know Ken. I'm still unsure of what the real error is, still investigating.

Comment 7 Sébastien Han 2018-03-08 17:51:34 UTC
After 1 day of hard debugging, it seems that the machine where these new OSDs are running has BAD firewall rules. That was sneaky because the OSD appeared up when they were started and down when stopped, so this was hard to believe fw was involved.

After disabling firewalld, everything works.

Please check your hardware setup before running a deployment. We can't afford to waste time on issues like this. Also, this is not the first time either, so please from now on be more careful.
Thanks.