Working with IT: Bug submissions

cshields

9

During a recent Mozilla all-hands event Laura Thomson held a short presentation, titled “Working with IT”.  Laura was the right person to give it and the feedback that we have gathered is that we need to help people understand how to work with IT, and help you all understand how our infrastructure works.  Expect more brownbags and posts around this topic.

So, let’s start by talking about bugs.

Whether or not bugzilla is the right tool to track and manage IT projects and requests is up for debate. The benefit to using bugzilla is that it integrates with the rest of the project, since it is used for everything else at Mozilla.  That said, let’s talk about how IT works with bugs and how you can help us when you file bugs:

Please do not assume tribal knowledge in bugs.

In the past 30 days, the IT Systems and Ops teams have grown by 5 new sysadmins.  This is great, and they are all ramping up quickly. While we are throwing out numbers, we have seen 119 new bugs added to the Server Operations component in the past 7 days.  We want our new guys to help out in these bugs as quickly as they can.  When submitting a bug, please assume as little tribal knowledge as possible on the other side.  For instance, asking for a setting change in a production site without telling us which site you work on delays the bug while someone either asks for clarification or has to ask the team what you mean.  These are minor delays of course, but when this happens multiple times a day this becomes very inefficient.  If you have a doc to link to giving background on the request you are making, please do it.  If you know the system you are asking for a change on, please make note of it.

Where does my bug go?

The IT team is growing quickly, as is the need to sort our bugs into components lest we spin our wheels all working from one component.  Here is the layout of our components for bugs coming from you as it stands today (note the change in Web Operations):

  • Server Operations: Web Operations – this is where all web related bugs should go.  This is new, and is modified from the old “web content push” component to encompass web server problems, new web projects, and any general request regarding the serving of our websites.
  • Server Operations: Desktop Issues – this is where the desktop team currently works. Laptop issues, software license requests, and help with the office environment should all go here.
  • Server Operations: RelEng – Any issues regarding the release engineering build systems (aka “the build network”) should go here.
  • Server Operations: Netops – Network requests and issues should be filed here
  • Server Operations: Labs – Mozilla Labs IT requests go in here
  • Server Operations: ACL Request – Firewall requests for Netops
  • Server Operations – Everything else that did not fall into one of the above.

Priority and escalation

The default priority for our bugs is “normal”.  We will get to these as soon as we can, and by nature of your request we assume that you want them done as soon as possible.  If this is a request that does not fall under that assumption and you want it to fall under the “nice to have someday” category, mark it as an enhancement. Anything higher than normal demands attention soon.  Our SLA for addressing bugs higher than normal is such:

  • Major – 24 hours
  • Critical – 8 hours
  • Blocker – immediately

These timers work around the clock, and if a bug sits unaddressed beyond those times, our oncall is paged. Blocker IT bugs will page oncall immediately.  We can not guarantee that the request will be resolved within this time (ie: if you file a critical bug for a new cluster of servers, it will take us time to procure them first), but we will have admins aware of it and start working on it.  In addition, we have our own internal prioritization of issues that come in.  If a critical bug in a dev site comes in, that may have to wait for work that we are doing on a production site.

That was a lot to read..

And if you are still with me, thanks for taking the time to understand how we work in bugzilla. By getting bugs filed more efficiently we can spend less of our time refining the bugs and more time fixing them.

9 responses

  1. David Ascher wrote on :

    For things beyond the 24h window, do you have timers that actually enforce the SLA and bump up things that will take time to implement, so that they get addressed before it becomes urgent? I’ve seen some bugs filed early to give plenty of time for y’all to plan, but because they never reach Major status, keep getting deferred, until people get annoyed enough that they get bumped up to Major just to try and get attention.

    1. cshields wrote on :

      No, nothing automated like that.. What you are referring to is a lack of human resources to get our entire load done, which is another post altogether. I won’t go in to how many bugs we are sitting on right now, but we realize that we are far behind what is expected of us.

      One of the points made in this post is that it hurts to have to go back and forth a few times for details, tribal knowledge, etc. rather than letting our new team mates just run with bugs and get stuff done. We are looking at little ways to improve our efficiency like that.

  2. David Ascher wrote on :

    @cshields: Actually, I think it’s really important to, somehow, make the SLAs you advertise (e.g. “next week”) operational — right now you’re incenting people to put everything in the urgent bucket, which we know is wrong. You’ll never have enough resources to get the entire job done, but there’s no reason not to have clarity (and maybe even transparency) on workload.

  3. cshields wrote on :

    I’m a bit confused.. Resources are finite and would be spent by the same time whether or not a bug is filed initially with a certain severity.

    The SLA is that the bug will be addressed, not necessarily resolved. In some cases this means “assigned” as it could be a 2-week project that we need to fit into our schedule. But you are right, if I am confusing people into thinking that the higher severity will lead to a quicker resolution I might need to fix that. :) The hope was to educate people for the opposite result, we’ve had a lot of stuff come in as blockers and critical bugs when they really aren’t, or the submitter did not understand what that meant in the end.

  4. Dave Dash wrote on :

    Maybe it’s because I work on lot of sites, but I usually try to prefix bugs with the project’s name. e.g.:

    [input][input-stage] Update SITE_URL in settings/initial_local.py

  5. Jake Maul wrote on :

    @davedash: that type of thing is very helpful to us. it makes it immediately obvious which servers/sites are going to be involved, without even having to open the bug. Heck, you don’t even have to read the bugmail, just see the subject line. :)

    I <3 bugs that have a subject like your example.

  6. Axel Hecht wrote on ::

    I’ve come across a few IT bugs where “some guy in IT” asked me what “some other guy in IT” did. Which makes me comment on your quest to not rely on “tribal knowledge”. Totally, we shouldn’t require tribal knowledge, but please, document what you’re doing instead of relying on tribal knowledge outside of IT.

    Concrete examples: My tribal knowledge has been confusing you before on one hand. Like whether ldap issues are ldap issues or mail issues or mail talking to ldap issues.
    Or, IT asking me if I’m using some VM just because it happens to have 4 magic letter “l10n” somewhere in its name. I had no idea. Also got questions in which cluster or colo particular stuff would be. Awesomebar failed me on bug links, otherwise I would have added them.

    Also, “documentation” doesn’t need to be a consice wiki page. Just marking up bugs such that you find a bug responsible for the existence of a resource. Say, you’re wondering what elmo4.stage.seamicro.phx1.mozilla.com or even node86.seamicro.phx1.mozilla.com is doing. I’d rather have you set up things that you easily find bug 652792 than having to hunt me or laura down and ask. Do you have something like that?

    1. cshields wrote on :

      Axel: Totally agree.. In the past our documentation has been sub par. We’ve been working on it in the past year. It is a big improvement in my opinion. We have the combination of an inventory system, internal wiki, and other resources to serve as documentation. ie, searching our wiki space for elmo4 should pull up the elmo page. As for the questions you get, keep in mind that only 5 out of 18 of the people in the Systems and Ops teams alone were here a year ago. So, while we continue to count on the veterans to fill in the gaps, we ask for those of you who have been here longer than the rest of us to continue to help. Thanks!

      The tribal knowledge I allude to in the post is not always related to the setup of the service, but could be related to the workflow and culture of the day. If someone emails a request to “push prod”, a few of us might have that knowledge of which site the requestor works on from day to day, but with close to a couple hundred web properties it would help to specify which one on the bug. This connection of requestor-to-request is not something I wish to document fully, especially given that some people move around and teams swap services.

  7. Axel Hecht wrote on ::

    Good to hear, and makes sense.