wiki:HealthMonitoring

Version 1 (modified by jesseeichar, 13 years ago) ( diff )

--

The Metrics library(http://metrics.codahale.com/) by Yammer has excellent support for monitoring the performance and health of a system. It provides a consistent API for developers to register some vital statistics of an application. For example in Geonetwork we might want to have a monitor system (like nagios or statd) check the health of the system which would include checking the database connection, ability to open files, check the index, etc... In addition we might want to attach a Metrics appended to the logging to track the number of errors being logged and the monitor system would be able to warn of a potentially unstable system based on the number of errors being logged.

Metrics has 2 Apis, one for configuring the health checks and another for performing the configured health checks. The 'out' Apis include JMX, JSON as well as HTML pages that could be integrated into the admin user interface.

It is important to realize that metrics is not exactly the same as statistics in my use case. While it could be used in some capacity for statistics, in this proposal metrics will be used as a standard API and utilities for creating a monitoring subsystem that is flexible, extensible and can interoperate with many existing monitoring systems.

Some monitors I propose to make are:

  • Database Gauge - checks that the database is accessible
  • Index Gauge - checks that the Lucene index is searchable
  • Error Meter - monitors the frequency that errors are logged
  • Request Meter - monitor the number of requests that made. This is to potentially detect DOS attacks
  • Pending Request Counter - Track the current number of requests being processed.
Note: See TracWiki for help on using the wiki.