
App Health Dashboard

Prompt
Of the hundreds of apps that had been built by product teams at the company, there was no clear and set way across the board for teams to monitor the health and efficiency of those apps. Every team had come up with their own process for doing this and while it was great that teams were taking initiative there was a huge lack of consistency when it came to monitoring health and efficiency. My team, Reliability Engineering, was responsible for creating one central location for the hundreds of product teams across the company to monitor their performance metrics.
Research
We had requirements from the business side, but of course we still needed to talk to our users, Reliability Engineers, and understand what they needed and wanted from this app. A Reliability Engineer works to identify and manage reliability risks that could adversely affect business operations.
I set up interviews with 4 Reliability Engineers and 6 Sr. Reliability Engineers that worked out of our Atlanta, GA and Austin, TX offices.
Research Findings
There was a lot of overlap with what the RE’s wanted and what the business wanted.
Volume, latency, number of open tickets, and number of orders (when available) were the most mentioned metrics on both sides.
There was some concern around having to trust in a brand new system. What if the information is inaccurate?
Teams are tracking metrics daily but are typically reporting on a weekly basis to leadership.
5/10 RE’s were working on teams that had already built their own tracking service. The other 5 were manually tracking or relying on reports from users to identify issues.
“As a Reliability Engineer, I need the process of monitoring the health of my app to be efficient and easy to comprehend.”
Solutioning and Usability Testing
With the research complete and organized, it was time to begin designing some mockups for the Service Level Objectives Dashboard that I would later test with 10 future users. I knew the design needed to address the some of the key findings from our research. Our solution needed to:
Give REs quick access to information that helps them to identify potential risks prior to a service going down.
Track metrics around volume, availability, latency, errors, tickets, and orders.
Streamline reporting for teams to leadership.
Allow teams to manually edit information when needed until trust is gained in the service.*
*The business was not particularly excited about allowing users to edit the information, so for now the design would not include this feature.
Usability Findings
I facilitated user testing sessions with 10 future users of the product. This time I recruited 5 Reliability Engineers and 5 Sr. Reliability Engineers out of the Atlanta and Austin offices. Based on the testing we had a lot of changes to make on the dashboard itself. Users seemed to be satisfied with the process of selecting the information they want to see. The findings from testing indicated that we needed to:
Move the table from the bottom of the page to the top of the page.
The REs felt that viewing the table first would help them to identify problems quicker than viewing the graphs first.
REs liked the graphs in general but felt the graphs would be more helpful to leadership
Remove the red highlight on “problem” cells and opt for red font instead.
The red highlight creates a sense of panic, while the font simply calls attention to potential problem areas.
If displaying more than 7 dates on the table, the number of days needs to be clearly indicated.
Add an option that allows users to edit the data.
Users felt very strongly that they should have an option to edit the data until the system proved that it would always provide accurate results to them.
Every user I tested with stated that without this option they would not use the SLO Dashboard and they would continue to rely on the process they already had in place for testing.
Make the language for the option to register a new service on the site more clear.
Make the option for registering a new service more visible.

Mockups designed using Sketch. These Designs we created in 2017, please see below for an updated design.
What’s Next?
With the business eager to get this in the hands of users, the engineers began building the dashboard. I left the team to begin working on a different product so I don’t know if the product was iterated on. However, I decided it was time for dashboard to get a little facelift. Since working on this product, my design skills have evolved and I wanted to show that here.
In addition to updating the look and feel of the interface, I wanted to focus on simplifying the table view and consolidating the visual data. I removed the table grid along with 2 of the table rows. I removed the 2 rows (SLOs and VALET Metrics at the top) because this was information the user was already familiar with. Rather than provide them with unnecessary info, I want them to be able to focus on the actual numbers which helps them identify potential problem areas more efficiently.
I consolidated the the visual data because it was never that important for the users (they preferred the table view), yet it took up a very large amount of real estate on the screen.

Design crated using Sketch
Software Used
Sketch
Marvel
Appear.in
Quicktime