Backup is often considered the black sheep of the IT Department, typically relegated to junior engineers or generalists. “Backup guys” get very little thanks for their expertise in what is an incredibly technical, challenging and time-consuming but absolutely vital cog in a very big wheel.
When backup is doing its job, the function is almost completely ignored unless there is a need for recovery. What many people don’t realize is that backups are the one component of IT that touches every part of your delivery, silently ensuring that bumps in the night can be recovered from.
Monitoring backups is just one part of that quiet, but critical function and is a key ingredient of a managed service. What makes our job exciting (yes, I said backup is exciting!), is the constant development and improvement of the tools of our craft. Logging and monitoring tools are evolving, allowing our engineers a greater depth of field when it comes to understanding complex environments.
Leveraging these data lakes to proactively inform our engineers about changes in backups and behavior has shifted from reactive problem solving, to proactive problem prevention not only for backups, but for all points within an IT infrastructure.
The Path to Proactive Monitoring
At Assured DP, we started monitoring our environment by collecting performance and up/down statistics across our environment, utilizing a distributed enterprise grade monitoring solution. Our engineers reviewed what we needed to know against the data we could collect based on individual metrics. What we soon discovered was that in order to watch all the things we wanted to, we’d have to have more monitoring points than could be handled by a reasonably sized environment.
Additionally, the data we were collecting was staying isolated in the monitoring system and wasn’t available to be compared across multiple environments. We started collecting additional data via the same scripting tools we used with our monitoring tool but put the data into Elasticsearch to allow for the running of unstructured data queries.
Making that shift allowed us to plug in other data evaluation and visualization tools to get a different perspective into how our environment operates.
Initially we were leveraging the data to find anomalies in how a customer utilizes the environment. Specifically, we were looking for sudden changes in consumption or changes in gross trends. We then started encountering circumstances where the data could help plan for changes in the environment.
A customer reached out as they wanted to schedule some intense indexing jobs and wanted to know when the backup quiet periods were so they could schedule around them. We were able to build a heatmap base that showed when backups were running and when they weren’t. This gave the customer definitive times to schedule their indexing.
Having control over what data we collect and how we view it also lets us consolidate data for alerting on situations.
For example, we collect seconds offset for all replication jobs across the cluster and surface the min, max, and median replication times by retention group. By surfacing all three values we’re able to determine if an individual retention group, or retention groups as a whole, are experiencing a slow-down in replication and drifting further behind. The min/max gives us a view into an individual object, the median tells about the entire environment.
By collecting these numbers outside of conventional monitoring platforms, we can take the load off our key-value polling system and shift the load into our data-lake, which is much better suited to handle those kinds of transactions.
Continuing to find improved and innovative ways to help our customers is what makes the technical people at Assured DP ‘tick’. Our shift to proactive monitoring has helped us convert backups from being valuable only when needed, to providing valuable insight into our customer’s production platforms.