wiki:HealthMonitoring

Health an Monitoring of Server

Date 2012/03/26
Contact(s) Jesse Eichar
Last edited
Status committed
Assigned to release 2.8
Resources R&D Camptocamp
Code https://github.com/jesseeichar/geonetwork/tree/monitoring
Ticket #846

Overview

Provide a system for monitoring the health of a Geonetwork instance as well as metrics for some important functions. Metrics will be made available via HTTP/JSON and JMX. A common useage would be to configure nagios or collectd to collect data from the Geonetwork service and warn administrators when system is becoming unstable.

Proposal Type

  • Type: Now Module
  • App: GeoNetwork
  • Module:

  • Email discussions:
  • IRC discussions:

Voting History

  • None as yet

Motivations

At the moment one must make several calls to a Geonetwork instance to ensure that the important functions are running and even that could not detect spurious or difficult to detect instabilities of Geonetwork. It would be useful to have a consistent way to both register and view such important characteristics like database connection, errors encountered, corrupt index. Failed logins, etc...

Proposal

The Metrics library (http://metrics.codahale.com/) by Yammer has excellent support for monitoring the performance and health of a system. It provides a consistent API for developers to register some vital statistics of an application. For example in Geonetwork we might want to have a monitor system (like nagios or collectd) check the health of the system which would include checking the database connection, ability to open files, check the index, etc... In addition we might want to attach a Metrics appended to the logging to track the number of errors being logged and the monitor system would be able to warn of a potentially unstable system based on the number of errors being logged.

Metrics has 2 Apis, one for configuring the health checks and another for performing the configured health checks. The 'out' Apis include JMX and JSON. For this proposal 4 new servlet mappings will be defined for accessing the monitor information:

  • /monitor/metrics?[pretty=(true|false)][class=metric.name] - returns a json response with all of the registered metrics
  • /monitor/threads - returns a text representation of the stack dump at the moment of the call
  • /monitor/healthcheck - returns 200 if all checks pass or 500 Internal Service Error if one fails (and human readable response of the failures)
  • /monitor - provide links to pages listed above.

A link will be made from the Admin/config.info page will be made to these servlets so a administrator can easily access this data. In a future implementation we can possible add a more attractive UI for viewing the information. All /monitor/* urls will be restricted by a Servlet-filter so that only administrators can access the information.

It is important to realize that metrics is not exactly the same as statistics in my use case. While it could be used in some capacity for statistics, in this proposal metrics will be used as a standard API and utilities for creating a monitoring subsystem that is flexible, extensible and can interoperate with many existing monitoring systems.

Some monitors I propose to make are:

  • Database Health Monitor - checks that the database is accessible
  • Index Health Monitor - checks that the Lucene index is searchable
  • Index Error Healther Monitor - checks that there are no index errors in index (documents with _indexError field == 1)
  • CSW GetRecords Health Monitor - Checks that GetRecords does not return an error for a basic hits search
  • CSW GetCapabilities Health Monitor - Checks that the GetCapabilities is returned and is not an error document
  • Database Access timer - Time taken to access a DBMS instance. This gives and idea of the level of contention over the database connections
  • Database Open Timer - Tracks the length of time a Database access is kept open
  • Database Connection Counter - Counts the number of open Database connections
  • Harvester Error Counter - Tracks errors that are raised during harvesting
  • Service timer - Track the time of service execution
  • Gui Services timer - Track the time of spend executing Gui services
  • XSL output timer - Track the time of output xsl transform
  • Log4j integration - monitors the frequency that logs are made for each log level so (for example) the rate that error are logged can be monitored. See http://metrics.codahale.com/manual/log4j
  • Webapp integration - monitors number of active requests, error codes returned and length of time requests take. See http://metrics.codahale.com/manual/webapps/

The Metrics and HealthService Monitors will be registered in the ServletContext so multiple Geonetwork instances can exist in the same webapplication without interfering with each other.

See below for an example of the JSON data accessible for the metrics

Backwards Compatibility Issues

A new dependency and new servlet and filter definitions in web.xml. Monitor Manager is added to ServiceContext, ResourceManager and ServiceManager.

Risks

Nothing notable

Participants

  • As above

Sample JSON reponse

{
  "jvm" : {
    "vm" : {
      "name" : "Java HotSpot(TM) 64-Bit Server VM",
      "version" : "1.6.0_29-b11-402-10M3527"
    },
    "memory" : {
      "totalInit" : 1.54341376E8,
      "totalUsed" : 2.32464064E8,
      "totalMax" : 1.273233408E9,
      "totalCommitted" : 6.4022528E8,
      "heapInit" : 1.30023424E8,
      "heapUsed" : 1.35404392E8,
      "heapMax" : 9.54466304E8,
      "heapCommitted" : 4.80313344E8,
      "heap_usage" : 0.1418639834979444,
      "non_heap_usage" : 0.30448637510600846,
      "memory_pool_usages" : {
        "Code Cache" : 0.18108113606770834,
        "PS Eden Space" : 0.0603823138059648,
        "PS Old Gen" : 0.14183750028609357,
        "PS Perm Gen" : 0.32762518525123596,
        "PS Survivor Space" : 0.453468281809034
      }
    },
    "daemon_thread_count" : 26,
    "thread_count" : 39,
    "current_time" : 1333479351258,
    "uptime" : 132,
    "fd_usage" : 0.03349609375,
    "thread-states" : {
      "runnable" : 0.1282051282051282,
      "waiting" : 0.2564102564102564,
      "new" : 0.0,
      "terminated" : 0.0,
      "blocked" : 0.0,
      "timed_waiting" : 0.6153846153846154
    },
    "garbage-collectors" : {
      "PS MarkSweep" : {
        "runs" : 1,
        "time" : 531
      },
      "PS Scavenge" : {
        "runs" : 16,
        "time" : 778
      }
    }
  },
  "jeeves.server.resources.ResourceManager" : {
    "Open_Resources" : {
      "type" : "counter",
      "count" : 0
    }
  },
  "org.apache.log4j.Appender" : {
    "all" : {
      "type" : "meter",
      "event_type" : "statements",
      "unit" : "seconds",
      "count" : 440,
      "mean" : 3.9474616538640612,
      "m1" : 5.649809351884407,
      "m5" : 9.426720754062472,
      "m15" : 10.905040893662465
    },
    "debug" : {
      "type" : "meter",
      "event_type" : "statements",
      "unit" : "seconds",
      "count" : 159,
      "mean" : 1.4264722458562058,
      "m1" : 1.524991253053399,
      "m5" : 0.46578148789059415,
      "m15" : 0.1690323413723673
    },
    "error" : {
      "type" : "meter",
      "event_type" : "statements",
      "unit" : "seconds",
      "count" : 35,
      "mean" : 0.3140032963886624,
      "m1" : 0.9673308340453434,
      "m5" : 3.552784585198555,
      "m15" : 4.4600438960559705
    },
    "fatal" : {
      "type" : "meter",
      "event_type" : "statements",
      "unit" : "seconds",
      "count" : 0,
      "mean" : 0.0,
      "m1" : 0.0,
      "m5" : 0.0,
      "m15" : 0.0
    },
    "info" : {
      "type" : "meter",
      "event_type" : "statements",
      "unit" : "seconds",
      "count" : 240,
      "mean" : 2.153160612351342,
      "m1" : 2.9804305103867503,
      "m5" : 4.70105813886709,
      "m15" : 5.385085852367142
    },
    "trace" : {
      "type" : "meter",
      "event_type" : "statements",
      "unit" : "seconds",
      "count" : 0,
      "mean" : 0.0,
      "m1" : 0.0,
      "m5" : 0.0,
      "m15" : 0.0
    },
    "warn" : {
      "type" : "meter",
      "event_type" : "statements",
      "unit" : "seconds",
      "count" : 6,
      "mean" : 0.05382894335273466,
      "m1" : 0.1770567543989148,
      "m5" : 0.7070965421062343,
      "m15" : 0.8908788038669864
    }
  },
  "org.fao.geonet.kernel.harvest.harvester.AbstractHarvester" : {
    "HarvestingErrors" : {
      "type" : "counter",
      "count" : 0
    }
  }
}
Last modified 12 years ago Last modified on Apr 26, 2012, 11:57:58 PM
Note: See TracWiki for help on using the wiki.