Healthcare Blocks uses a combination of tools, including a custom metrics collection agent, AWS CloudWatch, PagerDuty, and Slack to keep an eye on the uptime and overall health of your virtual machines and to escalate significant events to our DevOps team.
We monitor the following metrics:
- CPU usage
- Disk capacity (for root and data volumes)
- Memory usage
- System availability
We automatically receive an alert when one of the following conditions is met:
- CPU usage exceeds 90% for at least 15 minutes
- Disk capacity is less than 10%
- Server is down
If you'd like to receive a copy of these alerts, please create a ticket and provide the desired email address or alias.
We also monitor a variety of conditions that often lead to degraded or failed systems, including:
- Expiring SSL / TLS certificates (both internal and external)
- Uptime of core services (e.g. DNS, intrusion detection system, log shipping)
- Docker containers in an unstable state (e.g. stuck in "restarting")
- Database replication status (for high availability configurations)