Context Navigation

Health an Monitoring of Server

Date	2012/03/26
Contact(s)	Jesse Eichar
Last edited
Status	committed
Assigned to release	2.8
Resources	R&D Camptocamp
Code	https://github.com/jesseeichar/geonetwork/tree/monitoring
Ticket	#846

Overview

Provide a system for monitoring the health of a Geonetwork instance as well as metrics for some important functions. Metrics will be made available via HTTP/JSON and JMX. A common useage would be to configure nagios or collectd to collect data from the Geonetwork service and warn administrators when system is becoming unstable.

Proposal Type

Type: Now Module
App: GeoNetwork
Module:

Voting History

None as yet

Motivations

At the moment one must make several calls to a Geonetwork instance to ensure that the important functions are running and even that could not detect spurious or difficult to detect instabilities of Geonetwork. It would be useful to have a consistent way to both register and view such important characteristics like database connection, errors encountered, corrupt index. Failed logins, etc...

Proposal

The Metrics library (http://metrics.codahale.com/) by Yammer has excellent support for monitoring the performance and health of a system. It provides a consistent API for developers to register some vital statistics of an application. For example in Geonetwork we might want to have a monitor system (like nagios or collectd) check the health of the system which would include checking the database connection, ability to open files, check the index, etc... In addition we might want to attach a Metrics appended to the logging to track the number of errors being logged and the monitor system would be able to warn of a potentially unstable system based on the number of errors being logged.

Metrics has 2 Apis, one for configuring the health checks and another for performing the configured health checks. The 'out' Apis include JMX and JSON. For this proposal 4 new servlet mappings will be defined for accessing the monitor information:

/monitor/metrics?[pretty=(true|false)][class=metric.name] - returns a json response with all of the registered metrics
/monitor/threads - returns a text representation of the stack dump at the moment of the call
/monitor/healthcheck - returns 200 if all checks pass or 500 Internal Service Error if one fails (and human readable response of the failures)
/monitor - provide links to pages listed above.

A link will be made from the Admin/config.info page will be made to these servlets so a administrator can easily access this data. In a future implementation we can possible add a more attractive UI for viewing the information. All /monitor/* urls will be restricted by a Servlet-filter so that only administrators can access the information.

It is important to realize that metrics is not exactly the same as statistics in my use case. While it could be used in some capacity for statistics, in this proposal metrics will be used as a standard API and utilities for creating a monitoring subsystem that is flexible, extensible and can interoperate with many existing monitoring systems.

Some monitors I propose to make are:

Database Health Monitor - checks that the database is accessible
Index Health Monitor - checks that the Lucene index is searchable
Index Error Healther Monitor - checks that there are no index errors in index (documents with _indexError field == 1)
CSW GetRecords Health Monitor - Checks that GetRecords does not return an error for a basic hits search
CSW GetCapabilities Health Monitor - Checks that the GetCapabilities is returned and is not an error document
Database Access timer - Time taken to access a DBMS instance. This gives and idea of the level of contention over the database connections
Database Open Timer - Tracks the length of time a Database access is kept open
Database Connection Counter - Counts the number of open Database connections
Harvester Error Counter - Tracks errors that are raised during harvesting
Service timer - Track the time of service execution
Gui Services timer - Track the time of spend executing Gui services
XSL output timer - Track the time of output xsl transform
Log4j integration - monitors the frequency that logs are made for each log level so (for example) the rate that error are logged can be monitored. See http://metrics.codahale.com/manual/log4j
Webapp integration - monitors number of active requests, error codes returned and length of time requests take. See http://metrics.codahale.com/manual/webapps/

The Metrics and HealthService Monitors will be registered in the ServletContext so multiple Geonetwork instances can exist in the same webapplication without interfering with each other.

See below for an example of the JSON data accessible for the metrics

Backwards Compatibility Issues

A new dependency and new servlet and filter definitions in web.xml. Monitor Manager is added to ServiceContext, ResourceManager and ServiceManager.

Risks

Nothing notable

Participants

As above

Sample JSON reponse

{
  "jvm" : {
    "vm" : {
      "name" : "Java HotSpot(TM) 64-Bit Server VM",
      "version" : "1.6.0_29-b11-402-10M3527"
    },
    "memory" : {
      "totalInit" : 1.54341376E8,
      "totalUsed" : 2.32464064E8,
      "totalMax" : 1.273233408E9,
      "totalCommitted" : 6.4022528E8,
      "heapInit" : 1.30023424E8,
      "heapUsed" : 1.35404392E8,
      "heapMax" : 9.54466304E8,
      "heapCommitted" : 4.80313344E8,
      "heap_usage" : 0.1418639834979444,
      "non_heap_usage" : 0.30448637510600846,
      "memory_pool_usages" : {
        "Code Cache" : 0.18108113606770834,
        "PS Eden Space" : 0.0603823138059648,
        "PS Old Gen" : 0.14183750028609357,
        "PS Perm Gen" : 0.32762518525123596,
        "PS Survivor Space" : 0.453468281809034
      }
    },
    "daemon_thread_count" : 26,
    "thread_count" : 39,
    "current_time" : 1333479351258,
    "uptime" : 132,
    "fd_usage" : 0.03349609375,
    "thread-states" : {
      "runnable" : 0.1282051282051282,
      "waiting" : 0.2564102564102564,
      "new" : 0.0,
      "terminated" : 0.0,
      "blocked" : 0.0,
      "timed_waiting" : 0.6153846153846154
    },
    "garbage-collectors" : {
      "PS MarkSweep" : {
        "runs" : 1,
        "time" : 531
      },
      "PS Scavenge" : {
        "runs" : 16,
        "time" : 778
      }
    }
  },
  "jeeves.server.resources.ResourceManager" : {
    "Open_Resources" : {
      "type" : "counter",
      "count" : 0
    }
  },
  "org.apache.log4j.Appender" : {
    "all" : {
      "type" : "meter",
      "event_type" : "statements",
      "unit" : "seconds",
      "count" : 440,
      "mean" : 3.9474616538640612,
      "m1" : 5.649809351884407,
      "m5" : 9.426720754062472,
      "m15" : 10.905040893662465
    },
    "debug" : {
      "type" : "meter",
      "event_type" : "statements",
      "unit" : "seconds",
      "count" : 159,
      "mean" : 1.4264722458562058,
      "m1" : 1.524991253053399,
      "m5" : 0.46578148789059415,
      "m15" : 0.1690323413723673
    },
    "error" : {
      "type" : "meter",
      "event_type" : "statements",
      "unit" : "seconds",
      "count" : 35,
      "mean" : 0.3140032963886624,
      "m1" : 0.9673308340453434,
      "m5" : 3.552784585198555,
      "m15" : 4.4600438960559705
    },
    "fatal" : {
      "type" : "meter",
      "event_type" : "statements",
      "unit" : "seconds",
      "count" : 0,
      "mean" : 0.0,
      "m1" : 0.0,
      "m5" : 0.0,
      "m15" : 0.0
    },
    "info" : {
      "type" : "meter",
      "event_type" : "statements",
      "unit" : "seconds",
      "count" : 240,
      "mean" : 2.153160612351342,
      "m1" : 2.9804305103867503,
      "m5" : 4.70105813886709,
      "m15" : 5.385085852367142
    },
    "trace" : {
      "type" : "meter",
      "event_type" : "statements",
      "unit" : "seconds",
      "count" : 0,
      "mean" : 0.0,
      "m1" : 0.0,
      "m5" : 0.0,
      "m15" : 0.0
    },
    "warn" : {
      "type" : "meter",
      "event_type" : "statements",
      "unit" : "seconds",
      "count" : 6,
      "mean" : 0.05382894335273466,
      "m1" : 0.1770567543989148,
      "m5" : 0.7070965421062343,
      "m15" : 0.8908788038669864
    }
  },
  "org.fao.geonet.kernel.harvest.harvester.AbstractHarvester" : {
    "HarvestingErrors" : {
      "type" : "counter",
      "count" : 0
    }
  }
}

Last modified 13 years ago Last modified on 04/26/12 23:57:58

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text