Monitors for Masakari:
masakari-hostmonitorMonitorOverview
The masakari-hostmonitor provides compute node High Availability for OpenStack clouds by automatically
detecting compute nodes failure via monitor driver.
Howdoesitworkbasedonpacemaker&corosync?
• Pacemaker or pacemaker-remote is required to install into compute nodes to form a pacemaker cluster.
• The compute node’s status is depending on the heartbeat between the compute node and the cluster. Once
the node lost the heartbeat, masakari-hostmonitor in other nodes will detect the failure and send
notifications to masakari-api.
Howdoesitworkbasedonconsul?
• If the nodes in the cloud have multiple interfaces to connect to management network, tenant network or
storage network, monitor driver based on consul is another choice. Consul agents are required to
install into all noedes, which make up multiple consul clusters.
Here is an example to show how to make up one consul cluster.
ConsulUsageConsuloverview
Consul is a service mesh solution providing a full featured control plane with service discovery,
configuration, and segmentation functionality. Each of these features can be used individually as
needed, or they can be used together to build a full service mesh.
The Consul agent is the core process of Consul. The Consul agent maintains membership information,
registers services, runs checks, responds to queries, and more.
Consul clients can provide any number of health checks, either associated with a given service or with
the local node. This information can be used by an operator to monitor cluster health.
Please refer to ConsulAgentOverview.
TestEnvironment
There are three controller nodes and two compute nodes in the test environment. Every node has three
network interfaces. The first interface is used for management, with an ip such as ‘192.168.101.*’. The
second interface is used to connect to storage, with an ip such as ‘192.168.102.*’. The third interface
is used for tenant, with an ip such as ‘192.168.103.*’.
DownloadConsul
Download Consul package for CentOS. Other OS please refer to DownloadConsul.
sudo yum install -y yum-utils
sudo yum-config-manager --add-repo https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo
sudo yum -y install Consul
ConfigureConsulagent
Consul agent must runs on every node. Consul server agent runs on controller nodes, while Consul client
agent runs on compute nodes, which makes up one Consul cluster.
The following is an example of a config file for Consul server agent which binds to management interface
of the host.
management.json
{
"bind_addr": "192.168.101.1",
"datacenter": "management",
"data_dir": "/tmp/consul_m",
"log_level": "INFO",
"server": true,
"bootstrap_expect": 3,
"node_name": "node01",
"addresses": {
"http": "192.168.101.1"
},
"ports": {
"http": 8500,
"serf_lan": 8501
},
"retry_join": ["192.168.101.1:8501", "192.168.101.2:8501", "192.168.101.3:8501"]
}
The following is an example of a config file for Consul client agent which binds to management
interface of the host.
management.json
{
"bind_addr": "192.168.101.4",
"datacenter": "management",
"data_dir": "/tmp/consul_m",
"log_level": "INFO",
"node_name": "node04",
"addresses": {
"http": "192.168.101.4"
},
"ports": {
"http": 8500,
"serf_lan": 8501
},
"retry_join": ["192.168.101.1:8501", "192.168.101.2:8501", "192.168.101.3:8501"]
}
Use the tenant or storage interface ip and ports when config agent in tenant or storage datacenter.
Please refer to ConsulAgentConfiguration.
StartConsulagent
The Consul agent is started by the following command.
# Consul agent –config-file management.json
TestConsulinstallation
After all Consul agents installed and started, you can see all nodes in the cluster by the following
command.
# Consul members -http-addr=192.168.101.1:8500
Node Address Status Type Build Protocol DC
node01 192.168.101.1:8501 alive server 1.10.2 2 management
node02 192.168.101.2:8501 alive server 1.10.2 2 management
node03 192.168.101.3:8501 alive server 1.10.2 2 management
node04 192.168.101.4:8501 alive client 1.10.2 2 management
node05 192.168.101.5:8501 alive client 1.10.2 2 management
• The compute node’s status is depending on assembly of multiple interfaces connectivity status, which
are retrieved from multiple consul clusters. Then it sends notifition to trigger host failure recovery
according to defined HA strategy - host states and the corresponding actions.
Relatedconfigurations
This section in masakarimonitors.conf shows an example of how to configure the hostmonitor if you choice
monitor driver based on pacemaker.
[host]
# Driver that hostmonitor uses for monitoring hosts.
monitoring_driver = default
# Monitoring interval(in seconds) of node status.
monitoring_interval = 60
# Do not check whether the host is completely down.
# Possible values:
# * True: Do not check whether the host is completely down.
# * False: Do check whether the host is completely down.
# If ipmi RA is not set in pacemaker, this value should be set True.
disable_ipmi_check = False
# Timeout value(in seconds) of the ipmitool command.
ipmi_timeout = 5
# Number of ipmitool command retries.
ipmi_retry_max = 3
# Retry interval(in seconds) of the ipmitool command.
ipmi_retry_interval = 10
# Only monitor pacemaker-remotes, ignore the status of full cluster
# members.
restrict_to_remotes = False
# Standby time(in seconds) until activate STONITH.
stonith_wait = 30
# Timeout value(in seconds) of the tcpdump command when monitors
# the corosync communication.
tcpdump_timeout = 5
# The name of interface that corosync is using for mutual communication
# between hosts.
# If there are multiple interfaces, specify them in comma-separated
# like 'enp0s3,enp0s8'.
# The number of interfaces you specify must be equal to the number of
# corosync_multicast_ports values and must be in correct order with
# relevant ports in corosync_multicast_ports.
corosync_multicast_interfaces = enp0s3,enp0s8
# The port numbers that corosync is using for mutual communication
# between hosts.
# If there are multiple port numbers, specify them in comma-separated
# like '5405,5406'.
# The number of port numbers you specify must be equal to the number of
# corosync_multicast_interfaces values and must be in correct order with
# relevant interfaces in corosync_multicast_interfaces.
corosync_multicast_ports = 5405,5406
If you want to use or test monitor driver based on consul, please modify following configuration.
[host]
# Driver that hostmonitor uses for monitoring hosts.
monitoring_driver = consul
[consul]
# Addr for local consul agent in management datacenter.
# The addr is make up of the agent's bind_addr and http port,
# such as '192.168.101.1:8500'.
agent_manage = $(CONSUL_MANAGEMENT_ADDR)
# Addr for local consul agent in tenant datacenter.
agent_tenant = $(CONSUL_TENANT_ADDR)
# Addr for local consul agent in storage datacenter.
agent_storage = $(CONSUL_STORAGE_ADDR)
# Config file for consul health action matrix.
matrix_config_file = /etc/masakarimonitors/matrix.yaml
The matrix_config_file shows the HA strategy. Matrix is combined by host health and actions. The ‘health:
[x, x, x]’, repreasents assembly status of SEQUENCE. Action, means which actions it will trigger if host
health turns into, while ‘recovery’ means it will trigger one host failure recovery workflow. User can
define the HA strategy according to the physical environment. For example, if there is just 1 cluster to
monitor management network connectivity, the user just need to configurate $(CONSUL_MANAGEMENT_ADDR) in
consul section of the hostmontior’ configuration file, and change the HA strategy in
/etc/masakarimonitors/matrix.yaml as following:
sequence: ['manage']
matrix:
- health: ['up']
action: []
- health: ['down']
action: ['recovery']
Then the hostmonitor by consul works as same as the hostmonitor by pacemaker.
masakari-instancemonitorMonitorOverview
The masakari-instancemonitor provides Virtual Machine High Availability for OpenStack clouds by
automatically detecting VMs domain events via libvirt. If it detects specific libvirt events, it sends
notifications to the masakari-api.
Howdoesitwork?
• It runs libvirt event loop in a background thread.
• Invoking libvirt.virEventRegisterDefaultImpl() will register libvirt’s default event loop
implementation.
• Invoking libvirt.virEventRunDefaultImpl() will perform one iteration of the libvirt default event
loop.
• Invoking conn.domainEventRegisterAny() will register event callbacks against libvirt connection
instances. The callbacks registered will be triggered from the execution context of
libvirt.virEventRunDefaultImpl(), which will send notifications to the masakari-api.
• It will reconnect to libvirt and reprocess if disconnected.
Relatedconfigurations
This section in masakarimonitors.conf shows an example of how to configure the monitor.
[libvirt]
# Override the default libvirt URI.
connection_uri = qemu:///system
masakari-introspectiveinstancemonitorMonitorOverview
The masakari-introspectiveinstancemonitor provides Virtual Machine HA for OpenStack clouds by
automatically detecting the system-level failure events via QEMU Guest Agent. If it detects VM heartbeat
failure events, it sends notifications to the masakari-api.
Howdoesitwork?
• libvirt and QEMU Guest Agent are used as the underlying protocol for messaging to and from VM.
• The host-side qemu-agent sockets are used to detemine whether VMs are configured with QEMU Guest
Agent.
• qemu-guest-ping is used as the monitoring heartbeat.
• For the future release, we can pass through arbitrary guest agent commands to check the health of the
applications inside a VM.
Relatedconfigurations
This section in masakarimonitors.conf shows an example of how to configure the monitor.
[libvirt]
# Override the default libvirt URI.
connection_uri = qemu:///system
[introspectiveinstancemonitor]
# Guest monitoring interval of VM status (in seconds).
# * The value should not be too low as there should not be false negative
# * for reporting QEMU_GUEST_AGENT failures
# * VM needs time to do powering-off.
# * guest_monitoring_interval should be greater than
# * the time to SHUTDOWN VM gracefully.
guest_monitoring_interval = 10
# Guest monitoring timeout (in seconds).
guest_monitoring_timeout = 2
# Failure threshold before sending notification.
guest_monitoring_failure_threshold = 3
# The file path of qemu guest agent sock.
qemu_guest_agent_sock_path = \
/var/lib/libvirt/qemu/org\.qemu\.guest_agent\..*\.instance-.*\.sock
masakari-processmonitorMonitorOverview
The masakari-processmonitor, provides key process High Availability for OpenStack clouds by automatically
detecting the process failure. If it detects process failure, it sends notifications to masakari-api.
If your OpenStack service runs in container(pod), this processmonitor will not work as expected. It is
recommended not to deploy processmonitor.
Howdoesitwork?
• Processes to be monitored should be pre-configured in process_list.yaml file.
Define one process to be monitored as follows:
process_name: [Name of the process as it in 'ps -ef'.]
start_command: [Start command of the process.]
pre_start_command: [Command which is executed before start_command.]
post_start_command: [Command which is executed after start_command.]
restart_command: [Restart command of the process.]
pre_restart_command: [Command which is executed before restart_command.]
post_restart_command: [Command which is executed after restart_command.]
run_as_root: [Bool value whether to execute commands as root authority.]
Sample of definitions is shown as follows:
# nova-compute
process_name: /usr/local/bin/nova-compute
start_command: systemctl start nova-compute
pre_start_command:
post_start_command:
restart_command: systemctl restart nova-compute
pre_restart_command:
post_restart_command:
run_as_root: True
• If masakari-processmonitor detects one process failure, it will try to restart it firstly. After
several retries failed, it sends notification to masakari-api.
Relatedconfigurations
This section in masakarimonitors.conf shows an example of how to configure the monitor.
[process]
# Interval in seconds for checking a process.
check_interval = 5
# Number of retries when the failure of restarting a process.
restart_retries = 3
# Interval in seconds for restarting a process.
restart_interval = 5
# The file path of process list.
process_list_path = /etc/masakarimonitors/process_list.yaml