Bug 1503411

Summary:	[iSCSI]; Incorrect number of tcmu-runner daemons reported after GWs go down and come back up
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Tejas <tchandra>
Component:	iSCSI	Assignee:	Jason Dillaman <jdillama>
Status:	CLOSED WONTFIX	QA Contact:	Tejas <tchandra>
Severity:	medium	Docs Contact:	Erin Donnelly <edonnell>
Priority:	unspecified
Version:	3.0	CC:	ceph-eng-bugs, ceph-qe-bugs, edonnell, jdillama, tchandra
Target Milestone:	rc
Target Release:	3.*
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	.Incorrect number of `tcmu-runner` daemons reported after iSCSI target LUNs fail and recover After iSCSI target Logical Unit Numbers (LUNs) recover from a failure, the `ceph -s` command in certain cases outputs an incorrect number of `tcmu-runner` daemons.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-02-26 16:14:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1494421

Description Tejas 2017-10-18 04:42:19 UTC

Description of problem:


Version-Release number of selected component (if applicable):
ceph version 12.2.1-14.el7cp
libtcmu-1.3.0-0.4.el7cp.x86_64

In the 'ceph -s' command output the number of tcmu-runner daemons is reported. I am disabling the network interface on the GW nodes, and after a while bringing it back up.
Command used:
ifdown <eth>
ifup <eth>

Total luns: 122
expected tcmu daemons: 488

After 1 GW network dwon:
 ceph -s
  cluster:
    id:     2057393b-ce5e-4821-9eb0-96519e801921
    health: HEALTH_OK
 
  services:
    mon:         3 daemons, quorum havoc,mustang,skytrain
    mgr:         mustang(active)
    osd:         20 osds: 20 up, 20 in
    rgw:         1 daemon active
    tcmu-runner: 257 daemons active   <----------------
 
  data:
    pools:   13 pools, 842 pgs
    objects: 1140k objects, 3320 GB
    usage:   9960 GB used, 12284 GB / 22245 GB avail
    pgs:     842 active+clean




After all 4 GWs have gone down and come back up:
~]# ceph -s
  cluster:
    id:     2057393b-ce5e-4821-9eb0-96519e801921
    health: HEALTH_OK
 
  services:
    mon:         3 daemons, quorum havoc,mustang,skytrain
    mgr:         mustang(active)
    osd:         20 osds: 20 up, 20 in
    rgw:         1 daemon active
    tcmu-runner: 31 daemons active    <---------------
 
  data:
    pools:   13 pools, 842 pgs
    objects: 1140k objects, 3320 GB
    usage:   9961 GB used, 12284 GB / 22245 GB avail
    pgs:     842 active+clean
 
  io:
    client:   10743 B/s rd, 111 MB/s wr, 10 op/s rd, 511 op/s wr

Comment 3 Jason Dillaman 2017-10-18 13:16:44 UTC

@Tejas: the service daemons have a 60 second grace period (by default). Did you check the daemon state after 60 seconds had passed?