Categories: Uncategorized

New Sheriffing feature and significant updates to KPI reporting queries

A year ago I was sharing how a Mozilla Performance Sheriff catches performance regressions, the entire Workflow they go through, and the incoming improvements. Since I joined the Performance Tools Team (formerly Performance Test), almost five years ago, a whole lot of improvements have been made, and features have been added.

In this article, I want to focus on a special set of features, that give the Performance Sheriffs more control over the Sheriffing Workflow (from when an alert is triggered, triaged to when the regression bug is filed and linked to the alert). We call them time-to-triage (from alert to triage) and time-to-bug (from alert to bug). They are actually the object of our Sheriffing Team’s KPIs, the KPIs that measure the performance of the Performance Sheriffs team (I like puns).

The time-to-triage KPI measures the time since an alert was triggered by a performance change to when it was triaged (basically first-time analysis). It is at most 3 days, and at least 80% of the sheriffed alerts have to meet this deadline (or 20% is allowed not to). However, our team does not work weekends and they have to be excluded. For example, if an alert was created on a Friday (any), the three-day-triage time ends on Monday instead of Wednesday when the three business days actually expire. This means we basically only get a single day to triage it. So every time something like this happens, we have to manually exclude those alerts from the old queries of the KPI report that do not exclude the weekends from those times. The new queries do this exclusion automatically.

 

Triage Response Times (time-to-triage)Year To Date

Triage Response Times (time-to-triage)
Year To Date

Triage Response Times (New Query)Year To Date

Triage Response Times (New Query)
Year To Date

Alerts Exceeding Triage TargetYear To Date

Alerts Exceeding Triage Target
Year To Date

The same thing is true for an alert created on a weekend, where a part of the alert-to-triage time falls on the weekend. Actually, the only alerts that can not capture weekends are the ones created Monday and Tuesday.

The time-to-bug KPI measures the time since an alert was triggered by a performance change to when a bug was linked to the alert. It is at most 5 days, and at least 80% of the valid regression alerts must meet this deadline (or 20% is allowed not to). The only alerts that can not capture weekends within this KPI are the ones created on Monday, the first hour in the morning, whose KPI ends Friday in the last hour of the day.

Regression Bug Response TimesYear To Date

Regression Bug Response Times
Year To Date

Regression Bug Response Times (New Query)Year To Date

Regression Bug Response Times (New Query)
Year To Date

Regressions Exceeding Bug TargetYear To Date

Regressions Exceeding Bug Target
Year To Date

In the images above, you can see a difference in the percentages of time-to-triage (86.9% vs. 97.9% old query vs. new query) and time-to-bug (75.7% vs. 97% old query vs. new query). This is not because the Sheriffing Team is doing a better job, they were doing this the whole time. It is because the feature we developed helps measure the percentages accurately by excluding the weekends from the calculated times. According strictly to the percentages, the impact of this feature is significant, taking us from an average – maybe struggling – performance, to a really good one. Of course, the inclusion of weekends in the report of the KPIs was known a while ago, but having a bigger picture and concrete metrics is more revealing.

The development of these time-to-triage/time-to-bug features is full-stack and involved:

  • Helping our manager’s Sheriffing report calculate the times more accurately (to whom I am grateful for supporting this initiative);
  • Modifying the performance_alert_summary database table to store due dates;
  • Implementing the accurate calculation in the backend as described above;
  • Showing in the UI the countdown until the alert goes overdue gives the Performance Sheriffs more control and the ability to organize themselves throughout the Sheriffing Workflow better.

I didn’t mention the countdown feature yet. It is shown in the image below, right next to the status dropdown of the alert summary (top-right corner). Here are displayed:

  • The type of due date that is in effect (Triage in this case);
  • The amount of time. When the time goes under 24 hours, the timer will switch to showing the hours left.

The alert will become triaged and the counter will switch from triage to bug when the first-time analysis is performed on it (star, assign, add tag, add note).

Alert with Triage due date status

Alert with Triage due date status

 

Below is an example of a time-to-bug timer (the time left until linking the alert to a bug will go due). By default the timer counter is green, but when the timer goes under 24 hours, it will go orange.

Alert with Bug due date status

Alert with Bug due date status

When the timer goes overdue, we can see in the image below that the counter icon becomes red and the “Overdue” status is shown up.

Alert with Overdue status (this is for demo purposes only, the alert wasn’t overdue for real)

Alert with Overdue status
(this is for demo purposes only, the alert wasn’t overdue for real)

Lastly, after the alert is finally linked to a bug, the counter will turn into a green checkmark and the countdown status will be “Ready for acknowledge”.

Alert with Ready for acknowledge status

Alert with Ready for acknowledge status

Now, instead of manually excluding the times inflated by the weekends, we have an automated feature to closely control the alert lifecycle and report the KPI percentages more accurately.

The development of this feature was a personal initiative, encouraged by our manager and by the whole team (without their support I couldn’t have done this). This is part of a wider initiative I support, improvements to Performance Sheriffing Workflow. It improves the developer experience while working with performance regressions and helps the Performance Sheriffs be more efficient by improving their tools and automating as much as possible their workflow.

No comments yet

Comments are closed, but trackbacks are open.