Restarting Stopped Processes or Services

From Zenoss Wiki
This is the approved revision of this page, as well as being the most recent.
Jump to: navigation, search

Unlike some other monitoring applications, Zenoss Core doesn't have a built-in method to restart process or IP services that it detects as down. However, in version 4.x, with a set of ssh keys and a notification/trigger set it is fairly straight forward to implement on Linux servers and with a user who has remote access this can also be done on Windows servers.

Linux

The first thing you will need to do is generate a new ssh key to use for remote system access. You can create this by running ssh-keygen as the zenoss user as follows:

$ ssh-keygen -t dsa -f ~/.ssh/id_dsa_restart

You will then need to create an authorized_keys file for the root user on each of your servers or add the public key to an existing one if it already exists. In order to secure the access, the key should be configured to only allow access from your zenoss server and only permit the zenoss user to run /sbin/service to restart services. The entry you will add to /root/.ssh/authorized_keys should look like the following:

 #
 # Zenoss key, limited to /sbin/service for restarting services
 #
 from="zenoss.example.com",command="/sbin/service $SSH_ORIGINAL_COMMAND",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty <contents of ~zenoss/.ssh/id_dsa_restart.pub>

(replace zenoss.example.com with your zenoss server, it can be a comma-separated lists of hosts)

Once the key is in place, you will need to create a trigger and a notification. The trigger should match the service you wish to have restarted, for example:

 Name: restart_splunkd
 Enabled: checked
 Rule: all of the following rules
   Event Class contains /Status/OSProcess
   Count equals 1
   Severity is greater than or equal to Warning
   Component (Sub-Element) contains splunkd

Then create a notification to act on that trigger. The notification will need to have a Commmand action, and the command should look like:

$ /usr/bin/ssh -oStrictHostKeyChecking=no -i ~zenoss/.ssh/id_dsa_restart root@${evt/device} splunk restart

Since the ssh key entry above already contans the "/sbin/service" part of the command, you should only pass the arguments to that command in the notification. The "splunk restart" above will cause the remote system to execute "/sbin/service splunk restart" as root on the remote system.

If you wish to receive notifications if the service restart fails, create a second trigger like the one above and define it with a count of 2. That way it will only trigger on the second failure, which will only occur if the restart fails.

Windows

The process is almost identical for Windows with the main difference being the command that you run. This uses the "net" command from the samba-common package (on RHEL).

For a generic service trigger you might use:

 Name: restart_win_service
 Enabled: checked
 Rule: all of the following rules
   Device Class contains /Server/Windows/WMI
   Event Class contains /Status/WinService
   Count equals 1
   Severity is greater than or equal to Error

Your notification should look something like:

/usr/bin/net rpc service start '${evt/component}' -I ${dev/id} -U '${dev/zWinUser}%${dev/zWinPassword}'

The above will restart whatever service component fails as long as the user/password combination in zWinUser/zWinPassword has the rights to remotely start services on your server.