1503411 – [iSCSI]; Incorrect number of tcmu-runner daemons reported after GWs go down and come back up

Bug 1503411 - [iSCSI]; Incorrect number of tcmu-runner daemons reported after GWs go down and come back up

Summary: [iSCSI]; Incorrect number of tcmu-runner daemons reported after GWs go down a...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	iSCSI
Sub Component:
Version:	3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	3.*
Assignee:	Jason Dillaman
QA Contact:	Tejas
Docs Contact:	Erin Donnelly
URL:
Whiteboard:
Depends On:
Blocks:	1494421
TreeView+	depends on / blocked

Reported:	2017-10-18 04:42 UTC by Tejas
Modified:	2019-02-26 16:14 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	.Incorrect number of `tcmu-runner` daemons reported after iSCSI target LUNs fail and recover After iSCSI target Logical Unit Numbers (LUNs) recover from a failure, the `ceph -s` command in certain cases outputs an incorrect number of `tcmu-runner` daemons.
Clone Of:
Environment:
Last Closed:	2019-02-26 16:14:54 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Tejas 2017-10-18 04:42:19 UTC

Description of problem:


Version-Release number of selected component (if applicable):
ceph version 12.2.1-14.el7cp
libtcmu-1.3.0-0.4.el7cp.x86_64

In the 'ceph -s' command output the number of tcmu-runner daemons is reported. I am disabling the network interface on the GW nodes, and after a while bringing it back up.
Command used:
ifdown <eth>
ifup <eth>

Total luns: 122
expected tcmu daemons: 488

After 1 GW network dwon:
 ceph -s
  cluster:
    id:     2057393b-ce5e-4821-9eb0-96519e801921
    health: HEALTH_OK
 
  services:
    mon:         3 daemons, quorum havoc,mustang,skytrain
    mgr:         mustang(active)
    osd:         20 osds: 20 up, 20 in
    rgw:         1 daemon active
    tcmu-runner: 257 daemons active   <----------------
 
  data:
    pools:   13 pools, 842 pgs
    objects: 1140k objects, 3320 GB
    usage:   9960 GB used, 12284 GB / 22245 GB avail
    pgs:     842 active+clean




After all 4 GWs have gone down and come back up:
~]# ceph -s
  cluster:
    id:     2057393b-ce5e-4821-9eb0-96519e801921
    health: HEALTH_OK
 
  services:
    mon:         3 daemons, quorum havoc,mustang,skytrain
    mgr:         mustang(active)
    osd:         20 osds: 20 up, 20 in
    rgw:         1 daemon active
    tcmu-runner: 31 daemons active    <---------------
 
  data:
    pools:   13 pools, 842 pgs
    objects: 1140k objects, 3320 GB
    usage:   9961 GB used, 12284 GB / 22245 GB avail
    pgs:     842 active+clean
 
  io:
    client:   10743 B/s rd, 111 MB/s wr, 10 op/s rd, 511 op/s wr

Comment 3 Jason Dillaman 2017-10-18 13:16:44 UTC

@Tejas: the service daemons have a 60 second grace period (by default). Did you check the daemon state after 60 seconds had passed?

Note You need to log in before you can comment on or make changes to this bug.