Страница 10 из 37

IT Cloud

Shtoltc Eugeny

Monitoring with Prometheus

Monitoring – maintaining the continuity of work, tracking the current situation (identifying, localizing and sending about the incident, for example, in SaaS PagerDuty), predicting possible situations, visualization, building models for the normal operation of IAOps (Artificial Intelligence For It Operations, https: //www.gartner .com / en / information-technology / glossary / aiops-artificial-intelligence-operations).

Monitoring contains the following steps:

* identification of the incident;

* notification of the incident;

* localization;

* decision.

Monitoring can be classified by level into the following types:

* infrastructure (operating system, servers, Kubernetes, DBMS),;

* applied (application logs, traces, application events),;

* business processes (points in transactions, traces of transactions).

Monitoring can be classified according to the principle:

* distributed (traces),;

* synthetic (availability),;

* IAOps (forecasting, anomalies).

Monitoring is divided into two parts according to the degree of analysis: logging systems and incident investigation systems. An example of logging

serves as ELK stack, and incident investigation – Sentry (SaaS). For micro-services, a tracing system is also added.

requests such as Jeger or Zipkin. The logging system simply writes all the logs that are available.

The incident investigation system writes much more information, but writes it only in case of errors in the application, for example,

environment parameters, versions of installed packages, stack trace and so on, which allows you to get maximum information when viewing

by mistake, rather than collecting it piece by piece from the server and the GIT repository. But the set and format of information depends on the environment, therefore

the incident system needs to be integrated with various language platforms, and even better with specific frameworks. So Sentry

poisons environment variables, a piece of code and an indication of where the error occurred, parameters of the program and platform

environments, method calls.

Ecosystem monitoring can be divided into:

* Built into Cloud Cloud: Azure Monitoring, Amazon CloudWatch, Google Cloud Monitoring

* Provided as a service with support for various SaaS integrations: DataDog, NewRelic

* CloudNative: Prometheus

* For dedicated servers OnPremis: Zabbix

Zabbix was developed in 1998 and released to OpenSource under the GPL in 2001. At that time, the traditional interface:

without any design, with a lot of tabs, selectors and the like. Since it was developed for

own needs, it contains specific solutions. He is oriented

monitoring devices and their components such as disks, networks, printers, routers and the like. For

interactions can be used:

Agents – installed on servers, collecting many metrics and poisoning Zabbix server

HTTP – Zabbix makes requests over http, for example printers

SNMP – a network protocol for communicating with network devices

IPMI is a protocol for communicating with server devices such as routers

In 2019, Gratner presented a rating of monitoring systems in its square:

** Dynatrace;

** Cisco (AppDynamics);

** New Relic;

** Broadcom (CA Technologies);

** Riverbed and Microsoft;

** IBM;

** Oracle;

** SolarWinds;

** Micro Focus;

** ManageEngine and Tingyun.

Not included in the square:

** Correlsense;

** Datadog;

** Elastic;

** Honeycomb;

** Instant;

** Je

** Light Step;

** Nastel Technologies;

** SignalFx;

** Splunk;

** Sysdig.

When we run an application in a Docker container, all the standard output (what is displayed in the console) of the ru

An important criterion for ensuring smooth operation is the control of free space. So, if there is no space left, then the database will not be able to write data, with other components the situation can be more dire than the loss of new data. Docker has limit settings not only for individual containers, at least 10%. During imaging or container startup, an error may be thrown that the specified limits have been exceeded. To change the default settings, you need to tell the Dockerd server the settings, after stopping it with service docker stop (all containers will be stopped) and after resuming it with service docker start (the containers will be resumed). Settings can be set as options / bin / dockerd –storange-opt dm.basesize = 50G –stirange-opt

In Container, we have authorization, control over our containers, with the ability to create them for testing and see graphs on the processor and memory. More will require a monitoring system. There are quite a few monitoring systems, for example, Zabbix, Graphite, Prometheus, Nagios, InfluxData, OkMeter, DataDog, Bosum, Sensu and others, of which Zabbix and Prometheus are the most popular. The first is traditionally used, since it is the leading deployment tool, which admins love for its ease of use (all you need to do is to have SSH access to the server), low-level, which allows you to work not only with servers, but also with other hardware, such as routers. The second is the opposite of the first: it is focused exclusively on collecting metrics and monitoring, focused as a ready-made solution, and not a framework and fell in love with programmers, set it according to the principle, chose metrics and received graphs. The key feature between Zabbix and Prometheus is not in the preferences of some to customize in detail for themselves and others to spend much less time, but in the scope. Zabbix is focused on setting up work with a specific hardware, which can be anything, and often very exotic in a corporate environment, and for this entity, a manual collection of metrics is written, a schedule is manually configured. For a dynamically changing environment of cloud solutions, even if it is just a Docker container, and even more so if it is Kubernetes, in which a huge number of entities are constantly created, and the entities themselves, apart from the general environment, are not of particular interest, it is not suitable for this in Prometheus Service Discovery is built-in and navigation is supported for Kubernetes through the namespace, the balancer (service) and the group of containers (POD), which can be configured in Grafana in the form of tables. In Kubernetes, according to The News Stack 2017, Kubernetes User and Experience is used in 63% of cases, in the rest there are more rare cloud monitoring tools.

Metrics can be system (for example, CRU, RAM, ROM) and application (service and application metrics). System metrics are core metrics that are used by Kubernetes for scaling and the like and non-core metrics that are not used by Kubernetes. Here is an example of bundles for collecting metrics:

* cAdvisor + Heapster + InfluxDB

* cAdvisor + collectd + Heapster

* cAdvisor + Prometheus

* snapd + Heapster

* snapd + SNAP cluster-level agent

* Sysdig

There are many monitoring systems and services on the market. We will consider exactly OpenSource, which can be installed in your cluster. They can be divided according to the model of obtaining metrics: into those who collect logs by polling, and those who expect that metrics will be poisoned in them. The latter are simpler both in structure and in use on a small scale. An example would be InfluxDB, which is a database that you can write to. The downside of this solution is the difficulty of scaling both in terms of support and load. If all services write at the same time, then they can overload the monitoring system, especially since it is difficult to scale, since the endpoint is registered in each service. The first group to practice a pull model of interaction is Prometheus. It is also a database with a daemon that polls services based on their registrations in the configuration file and pulls labels in a specific format, for example: