wiki:HealthMonitoring

Version 7 (modified by jesseeichar, 13 years ago) ( diff )

--

Health an Monitoring of Server

Date 2012/03/26
Contact(s) Jesse Eichar
Last edited
Status draft
Assigned to release 2.8
Resources R&D Camptocamp

Overview

Provide a system for monitoring the health of a Geonetwork instance as well as metrics for some important functions.

Proposal Type

  • Type: Now Module
  • App: GeoNetwork
  • Module:

  • Email discussions:
  • IRC discussions:

Voting History

  • None as yet

Motivations

At the moment one must make several calls to a Geonetwork instance to ensure that the important functions are running and even that could not detect spurious or difficult to detect instabilities of Geonetwork. It would be useful to have a consistent way to both register and view such important characteristics like database connection, errors encountered, corrupt index. Failed logins, etc...

Proposal

The Metrics library (http://metrics.codahale.com/) by Yammer has excellent support for monitoring the performance and health of a system. It provides a consistent API for developers to register some vital statistics of an application. For example in Geonetwork we might want to have a monitor system (like nagios or collectd) check the health of the system which would include checking the database connection, ability to open files, check the index, etc... In addition we might want to attach a Metrics appended to the logging to track the number of errors being logged and the monitor system would be able to warn of a potentially unstable system based on the number of errors being logged.

Metrics has 2 Apis, one for configuring the health checks and another for performing the configured health checks. The 'out' Apis include JMX and JSON. For this proposal 4 new servlet mappings will be defined for accessing the monitor information:

  • /metrics?pretty=(true|false) - returns a json response with all of the registered metrics
  • /threads - returns a text representation of the stack dump at the moment of the call
  • /healthcheck - returns 200 if all checks pass or 500 Internal Service Error if one fails (and human readable response of the failures)

A link will be made from the Admin/config.info page will be made to these servlets so a administrator can easily access this data. In a future implementation we can possible add a more attractive UI for viewing the information.

It is important to realize that metrics is not exactly the same as statistics in my use case. While it could be used in some capacity for statistics, in this proposal metrics will be used as a standard API and utilities for creating a monitoring subsystem that is flexible, extensible and can interoperate with many existing monitoring systems.

Some monitors I propose to make are:

  • Database Gauge - checks that the database is accessible
  • Index Gauge - checks that the Lucene index is searchable
  • Error Meter - monitors the frequency that errors are logged
  • Request Meter - monitor the number of requests that made. This is to potentially detect DOS attacks
  • Pending Request Counter - Track the current number of requests being processed.

Backwards Compatibility Issues

A new dependency and new ma

Risks

Nothing notable

Participants

  • As above
Note: See TracWiki for help on using the wiki.