You have a "problem" device. This may manifest in different ways; here are just a few:
- Cannot see device main page in GUI - KeyError messages.
- Cannot re-add the device - it says it is already there
- Poskey errors on yellow flashes at top of GUI
- Run toolbox tools and they throw errors
- Attempt to install a ZenPack and get errors
- Error messages often that include primaryAq towards the end
You cannot select the device in the GUI and delete it - it errors in some way. The device is "half there".
You should always start by running the toolbox tools. These come as part of the standard build in later versions of Zenoss 5.x; earlier versions of 5 and previous versions, follow the link https://support.zenoss.com/hc/en-us/articles/203117595-How-To-Install-And-Use-the-zenoss-toolbox to get and install the tools. This suite has grown and improved over the years so you may not have all the utilities if you have an older version installed. Instructions for running the toolbox tools are now (October 2017) in appendix A of the Upgrade Guide. There are four main tools that can either be run in check mode or fix mode; always run in check mode first, which may take a long time but should not actually DO anything to the databases. They should be run in the following order:
The "-f" flag for each tool requests the problems to be fixed. The commands can also take a -v10 flag for verbose logging. If you are working with Zenoss 5.x, these should be run from within the zope container. For many problems, the toolbox tools can now fix the issue.
Another trick that sometimes works is to re-add the device. You may well get the "device already exists" message but it is worth trying and shouldn't cause any further harm.
Using the Zope Management Interface (ZMI)
Zenoss is built using the Zope application development environment. The ZMI, strictly, is part of Zope rather than Zenoss, and lets sufficiently authorised users of the GUI explore the Zope database (ZODB). In practise, this means you need the Manager role for your Zenoss GUI user (not just ZenManager). Point your browser at the normal Zenoss GUI, with the end of the URL as /zport/dmd/manage , for example https://zen42.class.example.org/zport/dmd/manage .
By navigating down the device class path hierarchy at the left-hand side you can inspect the instances of a particular class. Note that each Zenoss device class has a "devices" (lower-case d) link to follow before you see the actual device instances. Once you have navigated to the device instance, use the Properties tab at the far right to see attributes of this device.
If you have got this far, then the issue is more likely to be with a component of the device so go back to the Contents tab for the device and try to explore some of the relationships; most of the components are found under the os relationship or the hw relationship. Fundamentally you are looking for clues as to where it breaks.
The ZMI is most unlikely to cause further breakage unless you actively select to save any changes. Use it as an investigative tool.
Hacking using zendmd
Occasionally, the internal Zope Database (ZODB) has got messed up such that the toolbox tools can't fix it. Once you are in this scenario, you must make sure that your system is backed up and, preferably, you are in a maintenance window for your organisation.
zendmd is a tool that allows you to access and manipulate the ZODB database (and potentially other databases). It should be run as the zenoss user, in the zope container (for Zenoss 5 folk). It has the power to do bad things as well as good, but it is a good investigative tool and can sometimes work fix magic. Assume that the problem device is zen42.class.example.org and it's Zenoss device class is /Devices/Server/Linux.: First see if you can get to the device using find:
>>> d=find('zen42.class.example.org') >>> d <Device at /zport/dmd/Devices/Server/Linux/devices/zen42.class.example.org>
This shows that the device can be found in ZODB and that it is of object class Device (that's a Python object class, not a Zenoss device class, though they may look the same). Note the similarity in the path to what you saw in the ZMI - Devices - Server - Linux - devices - device instance.
>>> print d.id, d.title, d.manageIp zen42.class.example.org zen42.class.example.org 192.168.10.42
We can access various attributes of this object - id, title and manageIp - so at least some of it is there. You might try the deleteDevice method for the errant device - unlikely to work but worth a try:
Any object in the ZODB database has to be connected into this hierarchical datastructure so, typically, a component (like FileSystem) has a single (toOne) relationship with its owning device, whereas a device can have a toMany relationship with FileSystem (there can be multiple instances of the FileSystem object, one for each file system). Basically, these are links that connect the objects together. Often, the issue is that one half of these links get broken somehow. There are also two-way links that connect a device into the device class hierarchy and there are several different methods for checking these:
>>> for i in dmd.Devices.Server.Linux.getSubDevices(): ... print i ... <LogMatchDevice at zenny1.class.example.org> <Device at group-100-serv1.class.example.org> <Device at zenny.skills-1st.co.uk> <UserGroupDevice at pi> <UserGroupDevice at taplow-30.skills-1st.co.uk> <DirFileDevice at taplow-11.skills-1st.co.uk> <RedisDevice at i-93c30ddc> <DirFileDevice at zenny2.class.example.org>
If this barfs, then the issue is probably links between the device and its device class hierarchy. Another way of asking the same question that relies on fewer links is:
>>> for i in dmd.Devices.Server.Linux.devices(): ... print i
This may work where the previous request didn't. If so, repeat and get some more info:
>>> for i in dmd.Devices.Server.Linux.devices(): ... print i, i.id, i.title, i.manageIp
Other ways to test the device to device class hierarchy links:
>>> d=find('zen42.class.example.org') >>> d.getPrimaryParent() <ToManyContRelationship at /zport/dmd/Devices/Server/Linux/devices> >>> d.primaryAq() <Device at /zport/dmd/Devices/Server/Linux/devices/zen42.class.example.org>
Failure on either of these indicates broken links in the hierarchy.
There is a method to directly delete an object from within the class hierarchy; you need the device id as a parameter and use it as a string:
This often works. If not, Devices.Server.Linux.devices also has a built-in _objects attribute which may have been left with a reference to a deleted device:
>>> for i in dmd.Devices.Server.Linux.devices._objects: ... print i ... 10.0.0.37 deodar.skills-1st.co.uk elk.ourshack.com group-100-serv1.class.example.org i-0e1a1878 i-8e14f9c4 kea.ourshack.com pig.ourshack.com zen31.class.example.org zen42.class.example.org zenny.skills-1st.co.uk
If the offending device is shown here and nothing else has shown it up, temporarily move other devices to another device class so that the offending device is the only remaining device in that class (you almost certainly won't be able to move the offending device as it will error). Then use:
Don't forget to move any devices back to their correct class eventually.
If any of this works, then you have really only made temporary changes so far in zendmd. To persist it, you need to commit:
Once you have exited zendmd, run all the toolbox commands again - hopefully they will run clean.
Other references you might chase up are:
- https://gist.github.com/cluther/1076520 which removes non-existent devices from performance monitors
- http://zenpackers.readthedocs.io/en/latest/bad_relations.html helps with bad relationships