Description of problem:
error: wildcard resolved to multiple address
2017-02-08 06:37:49,547 - ERROR - calamari Uncaught exception
Traceback (most recent call last):
File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/greenlet.py", line 534, in run
result = self._run(*self.args, **self.kwargs)
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 251, in _run
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 987, in listen
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 238, in on_job_complete
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 344, in on_sync_object
new_object = self.inject_sync_object(minion_id, data['type'], data['version'], sync_object)
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 324, in inject_sync_object
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/server_monitor.py", line 259, in on_osd_map
hostname_to_osds = self.get_hostname_to_osds(osd_map)
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/server_monitor.py", line 207, in get_hostname_to_osds
name_info = get_name_info('', osd['cluster_addr'])
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/server_monitor.py", line 186, in get_name_info
hostname = socket.gethostbyaddr(osd_addr)
File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/_socketcommon.py", line 280, in gethostbyaddr
File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/resolver_thread.py", line 67, in gethostbyaddr
return self.pool.apply(_socket.gethostbyaddr, args, kwargs)
File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/pool.py", line 300, in apply
return self.spawn(func, *args, **kwds).get()
File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/event.py", line 373, in get
File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/event.py", line 363, in get
File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/event.py", line 343, in _raise_exception
File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/threadpool.py", line 207, in _worker
value = func(*args, **kwargs)
error: wildcard resolved to multiple address
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Are we able to get the osd nodes at the server endpoint? This functionality is required for storage console to do cluster imports and did work in many cases in 2.1.
Would you please help me understand the state of this? Is the NotImplementedError reproducible? How?
@Gregory: We are able to get the osd nodes but not on the first service start (a calamari restart is sufficient to make it all work). All the subsequent starts work fine.
It is only ever reproducible on first calamari service start and only on some systems. It is what happened in this case once the hostname regression was fixed (this bugzilla has currently nothing to do with the hostname resolving code).
I was looking into this further. My notes:
In the Manager class, we initiate RequestCollection and a Ticker. The RequestCollection tick itself goes through all the (salt jids, minion ids) and tries to emit a saltutil.running to ping jobs with MonRemote.get_running. This fails because MonRemote.get_running is not implemented.
There is apparently more than one way we are doing this and the other one succeeds so it works from the second time on.
This would probably work if we just commented out the lines that call the function that is not implemented but that looks like a hack to me. Alternatively, we could try to implement the signal. I think this should implement it (the patch was not tested yet due to the nature of the reproducer):
Any better ideas/comments?
@Harish: Could you re-run the test with the old stable versions of calamari to confirm/verify whether it is a regression or not?
@John: Unless this is really a regression (which I don't believe it is) I would not consider this a blocker. That being said, I have already proposed a fix (although it was not tested) so ETA on this bugzilla should hopefully be not that far away.
Building the upstream PR for some testing now.
@Boris, as mentioned to you over IRC, this issue is not seen on Magna machines for the same calamari version. Seen only in Beaker machines
Ken would you please make a build for https://github.com/ceph/calamari/releases/tag/v1.5.3
v1.5.3 is in the latest RHEL and Ubuntu composes to QE.
Issue is Not resolved in calamari-server-1.5.3-1.el7cp.x86_64.
Followed these steps:
1) using ceph-ansible, brought up single mon and 3 osd cluster using machines from beaker lab
2) added a user 'tester' in the mon
3) as 'tester' issued "/api/v2/cluster/<fs-id>/server". It listed as follows:
4) Above output is incorrect as
a) it lists osd details under MON node.
b) it still does not list all osd nodes.
c) hostname is incomplete form of fqdn instead of shortname
d) incorrect frontend and backend addresses
will be updating the setup details in the next update.
Note: As mentioned earlier, when this issue is hit, restart of calamari service fixes the problem. Could we suggest it as the workaround for 2.2?
I examined the server and this all looks like you are running calamari daemon before the network setup is working properly (that is why a restart helps):
- the socket.getfqdn() returns null for the address as was visible from osd listing
- gethostbyaddr fails to get the hostname and throws an exception
- the only resolvable address is the localhost one...
Calamari can't work properly if the underlying network is not ready, yet.
All in all, this looks like a test (environment) problem, not a calamari issue. How/when do you run 'calamari-ctl initialize'? Is the networking working properly at that moment? Are the addresses resolvable at that moment?
I think a more suitable behavior would be for calamari to tolerate this condition. Please prepare a patch where we just don't add nodes that are un resolvable, IIRC this code is triggered by mon and osd maps and we'll be getting more opportunities to add correct data.
I don't think restarting the service to avoid this is a workaround that we can live with.
thanks to how python works, if we are not able to resolve an address once we will not be able to resolve it until python is restarted (the "bad" resolving code will already be in memory after first attempt to resolve and won't get removed until the restart). To test that out, you can e.g. start python with all dns servers masked out, try to resolve -> fail, unmask dns servers -> still fails, restart python -> resolves fine.
We kinda did that not-adding before when we failed on the not implemented function. At least that is what Harish was originally complaining about. Currently, the data gets populated with the things that can be known and the rest of the data gets fixed on next proper run when resolving finally works.
Also, we are already trying to skip empty name_info with the condition
name_info != ('', '')
we could maybe change that to
name_info and len(name_info) == 2 and name_info and name_info
to avoid other forms of empty name_info (although I don't see a path that would lead to those). Anyway, it looks like the new addresses are now populated by the code that was not previously implemented -- by the emitted signal. Maybe, we should revert that or even mask that code altogether -- it never worked (was never implemented) so if we didn't need it by now (all it caused was trouble) we might never miss it.
I was under the impression that python was just using the OS name resolver and will eventually come up with the correct name once the TTL expires, but I'm willing to accept that is still not something we can work with.
Here is what I propose, if we fail to resolve a name then both fqdn and hostname fields should contain the IP that we take as input. That should allow the programmatic users of the API to do the right thing and the humans will just have to deal with it till we can resolve names.
What do you think about that?
I was thinking we are actually doing that -- getfqdn(IP) returns the IP if it cannot find a record, the hostname would be close to a subnet because we strip down the last dot onwards. The view however showed a null fqdn and no hostname (maybe it was None?). Therefore I suspect that either the records are modified in on_osd_map (the code actually looks like it modifies it) or the table got populated by a different mechanism (this seems just as likely).
It would really help if I could get a better reproducer because now, until the actual package gets build I can usually only guess on what is going on. For starters, it would be nice to test with a build with debugging turned on (do we maybe have a support for that in ceph-ansible?).
This is not a new behaviour and it fails in the cases where the network was broken/lacking. I also suspect that this might end up requiring a larger code re-design to make it work (we would probably have to touch the on_*_map functions). As such, I don't think there is much sense postponing the current release because of this and I believe we should move this bugzilla to the next release.
I was also looking at the other ways to do resolving (there two python dns libraries) but neither of them seemed to be of much help, here.
Seeing same issue on a newly added mon via ceph-ansible. restarting calamari service fixes that issue.
Isn't ceph-ansible restarting the calamari service after it installs the calamari-server packages to avoid this issue?
Re-targetting for 2.3.
@Harish: It is likely that even if ceph-ansible did restart the service it would not help as this seems to be caused by missing dns records at the time the calamari is run (if you restart the service at the ~same time you tried to start it they are still likely to not be sorted out). It is the manual restart once they are sorted out that helps.
Can you elaborate on the previous comment? Did you add a mon and only then deployed calamari or was calamari running fine (after a previous restart), you deployed the mon node and that resulted in broken output? If the latter, did calamari get reinstalled in the meantime? If so in what way? (I tried reproducing by reinstalling just the calamari packages, dumping its db, etc but with no luck)
(In reply to Boris Ranto from comment #31)
> Can you elaborate on the previous comment? Did you add a mon and only then
> deployed calamari or was calamari running fine (after a previous restart),
> you deployed the mon node and that resulted in broken output? If the latter,
> did calamari get reinstalled in the meantime? If so in what way? (I tried
> reproducing by reinstalling just the calamari packages, dumping its db, etc
> but with no luck)
@Boris, here are the steps I followed:
1) Installed the ceph cluster[3 mon, 3 osd] using ceph-ansible (with "calamari" set "true" in mons.yml file).
2) /api/v2/cluster/<fsid>/server did not work on any of the mon nodes.
3) Restarted the calamari service on all the mon nodes. server API worked from all mon nodes.
4) Added another MON node using ceph-ansible with "calamari" set as "true" in the mons.yml file
5) /api/v2/cluster/<fsid>/server did not work on the newly added MON. It was working on all the other mon nodes.
6) Restarted the calamari service on the newly added Mon. server API worked.
> @Harish: It is likely that even if ceph-ansible did restart the service it
> would not help as this seems to be caused by missing dns records at the time
> the calamari is run (if you restart the service at the ~same time you tried
> to start it they are still likely to not be sorted out). It is the manual
> restart once they are sorted out that helps.
How about suggesting the users to restart the calamari after the installation?
We may want to add a Note saying something like:
If you are having trouble discovering all your nodes in calamari, please consider restarting the calamari service as the name resolution code might have already been deprecated once calamari tried to discover the new nodes.
I was finally able to reproduce and I believe I have a fix (well, at least it helped on my test cluster):
It turned out that the null fqdn was not caused by us not being able to resolve through getfqdn/gethostbyaddr but because (some) osd services were not registered properly and so Eventer was not able to resolve them -- _get_fqdn function checks whether the service was registered and returns None (null) if it was not.
This was also related to the way ceph-ansible deploys ceph and calamari. It first installs mon nodes, then installs and deploys calamari and only then it adds osd nodes. Therefore, calamari will use a different discovery method that is racy because some osd services do not get registered.
Given all of this, it would be nice if this was still able to make it into the 2.2 release (otherwise, I would probably nominate this for an early z-stream).
The patchset also contains a proper fix for the wildcard traceback. Previously, if we received (hostname, osd_addr) tuple like ('', ':/0') we did not strip CIDR notation before checking if the osd_addr is actually non-null. Therefore we ran socket.gethostbyaddr('') which will always fail with the wildcard traceback.
I'm mostly ok with the patch though I'd like to talk a little more about the "calamari will use a different discovery method that is racy because some osd services do not get registered." part.
let's talk tomorrow morning to decide what release we should pursue with this fix.
Without more information here, re-targeting to 2.3
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.