Tuesday, October 3, 2017

Scaling Observium horizontally

EXPERIMENTAL FEATURE! THERE ARE NO KNOWN LARGE INSTALLATIONS USING THIS SETUP IN PRODUCTION YET. IF YOU BUILD THIS KIND OF SETUP IN PRODUCTION PLEASE CONTACT OBSERVIUM DEVELOPERS WITH YOUR PERFORMANCE EXPERIENCE AND ANY BUGS THAT YOU MAY RUN IN TO.

For very large installations (1000+ devices) it can be hard to make a single server fast enough to poll all your devices in 5 minutes. Luckily you can break out the separate functions of Observium on different servers to make it scale horizontally, you will then be able to just add more servers to your pool of poller-servers when you need more polling-power. This quick guide will show you how to set it up.

Requirements

  • Observium Pro Edition
  • rrdtool 1.6.0 or greater
  • php 7.x
  • a fast dedicated MySQL-server
  • Very fast storage for RRD-files, SSDs highly recommended



MySQL-Server

Start by installing a MySQL-server on a dedicated server. If you installation will be very large its recommended to tweak the settings after the database have run for 48h as there sure is a lot of tuning that can be done. You could also use a MariaDB or Percona Server if you'd prefer or even a Amazon Aurora database if you would build this in AWS.

After installing the database you should make it listen for querys over the network interface. This is done by adding the following config to your config-file:
bind-address=<ip-of-your-server>
Or the corresponding configuration for any other database-server you choose.
Make sure the firewall allows connections to port 3306 for MySQL, you now have a database-server ready to serve over the network.

RRD-Server

Its now time to install the main Observium-server. This server will serve as RRD-storage and will receive RRD-data from all your pollers that it will need to write to disk. This will put the storage of this server under huge load and the bottleneck of you entire installation will probably boil down to how fast this server is capable of writing to disk.
Therefore dedicated SSD-drives for this server is highly recommended. You should also make sure to not use consumer-grade SSDs as the amount of data this server will write will probably make a consumer-SSD wear out in a couple of months.
Chose a pair of heavy duty enterprise class-SSDs and preferably put them in a RAID1 for resilience.

Then you can proceed to install this server as a standard Observium server, just fallow the installation instructions on www.observium.org expect for two details.
First, skip the part where you install a mysql-server. This server will not run mysql at all. When its time to create the mysql-user then instead do this on your MySQL-server and then make sure to change the database settings in config.php before you run discovery.php for the first time to:
$config['db_host']      = '<you-mysql-server-ip>'
Secondly, skip the part where you add discovery.php and poller-wrapper.py to cron, the only thing this server should run by cron is housekeeping.php

rrdcached

When your main Observium server is installed then its time to install rrdcached. rrdcached will serve as the interface for RRD-writes for all your pollers. It will receive RRD-data over the network from all the poller-machines, cache the data to be written in memory and then write it in larger bulks to disk. This will save your storage of the worst I/O bursts at the same time as it gives your pollers a simple way of sending the data over the network.
Follow my other guide for how to setup rrdcached here: http://blog.best-practice.se/2014/10/using-rrdcached-with-observium.html
After the installation is done you will need to add a few more flags to the rrdcached config, first make sure that you have the flags:
-BRO
These flags make sure its only possible to write to rrd-files in the directory you assigned and it also ignores any attempt to overwrite existing rrd-files with the create-command.
Next add the flag:
-L
This will make rrdcached listen to all network interfaces on the default port (42217)
Make sure the firewall allows connections on this port and then your RRD-server is ready to go.
Also make sure this machine is configured to use rrdcached in config.php

Poller-Servers

The poller-servers will be the machines actually doing the polling of SNMP-devices. You can have as many poller-servers as you need and you can also add more poller-servers later when you need to scale up.
For every poller-server install a standard Observium installation but without the Apache server and MySQL-part and then just as on the RRD-server make sure to change the database-setting in config.php to:
$config['db_host']      = '<you-mysql-server-ip>'
Then proceed to add the following two lines to config.php:
$config['rrdcached']    = "<your-rrd-server-ip";
$config['rrd']['no_local'] = TRUE;
This will tell Observium that there are no local RRDs on this installation and where to find the rrdcached-server to write all the RRDs to.
Next we edit the cronjob, delete all the housekeeping.php-jobs as this will be done by the RRD-server itself and then add the two flags -i and -n. The -i flag tells Observium how many poller-servers you are running and the -n flag tells it which of them this server is (Note that this number starts from 0).
So for example if you run 3 poller-servers your first server will have this cronjob:
*/5 *     * * *   root    /opt/observium/poller-wrapper.py -i 3 -n 0  >> /dev/null 2>&1
33  */6   * * *   root    /opt/observium/discovery.php -h all -i 3 -n 0 >> /dev/null 2>&1 
The next poller-server will have everything the same but -n 1 instead.

The last discovery-job that only discovers new devices (discovery.php -h new) can not be split on multiple pollers but this is a very tiny job that finish fast so just put this job on the first of your poller-servers and remove it from the other poller-servers.
Thats it! Your poller-servers should now start fetching the devicelist from the database, poll their respective part of the device list and then feed the results and rrd-data over the network back to your servers.
If you visit the "Polling Information" tab in Observium you should now see that you have a number of separate Wrapper Processes in the graph.

Updating

Be very careful when you update your installation. With many different processes all writing to the database at the same time its very important that all the processes are the same version.
Make sure that you update Observium on all your machines at the same time and that only on of them runs ./discovery.php -u directly afterwards so that the database is correctly updated.
There might even be a good idea to stop all cronjobs before updating to be on the safe side.

Optimizing

If you run a large enough install that you need this then do not forget to check out all the performance tuning that can be done in Observium: http://docs.observium.org/tuning/
PHP7 is a reuqirement as this gives a huge performance boost and also make sure the opcode caching is enabled in CLI.
As the database grow with a lot of devices and ports the web interface will soon be pretty slow so also make sure that you enable the fast userspace caching.
If your install is used by a lot of users then it might be nice to switch out the Apache webserver for nginx and enable HTTP/2-support as this will load resources in the web interface much faster.
You could also experiment with the -t flag on rrdcached. This sets the amount of write-threads that rrdcached uses, default is 4. Increasing this might improve disk write performance.

Installscript

As installing a lot of Observium-instances can become tiresome I decided to write a small shell-script that automates the process for you.
It works well with Ubuntu 16 and Observium Pro or CE. Just download the script to the server, make it executable and then run it.

4 comments:

  1. Awesome, I think the community has been asking for this for awhile. Especially if you don't have a cutting edge single server(s), but have a large network to monitor.

    ReplyDelete
  2. Also how does this change alerts? If you have alert checkers in the GUI, are there files to replicate? Do email based alerts originate from the individual pollers?

    ReplyDelete
  3. I have this error

    ERROR: rrdcached: RRD Error: creating '/opt/observium/rrd/poller-wrapper.rrd': File exists
    ERROR: rrdcached: RRD Error: creating '/opt/observium/rrd/poller-wrapper_count.rrd': File exists

    ReplyDelete
  4. Is there any way to limit one poller to a specific subnet, so that it won't poll all the devices in Mysql DB.

    https://docs.observium.org/partitioning/

    In this article they say "A partitioned poller is a poller server that has a distinct identity and can have devices assigned to it, and will automatically have any devices added or discovered by it assigned to it"

    But nowhere in the documentation specifies how to do it.

    Can anyone help ?

    ReplyDelete