Notes from Nagios World Conference 2013

Ashish Vijayaram

Nagios World Conference 2013 was held between Sep 30th and Oct 3rd at St. Paul, MN. I represented Mozilla IT/SRE along with Sheeri Cabral, who spoke about MySQL plugins. I wanted to share some observations and my best takeaways from the conference. I attended about 10 talks in all and spent more time discussing setups and best practices.

The biggest draw at the conf this year was Nagios 4.0 that was announced at last year’s keynote. 4.0 brings in some long awaited and much needed rocket power to Nagios. The changelog has detailed information about the big features but the ones that interested me the most were:

  • Core Workers – I have been researching on how to scale up service check execution on some of our bigger instances. mod-gearman has till now been the tool of choice but with Core Workers, Nagios natively steps up to the task. The legacy forking-for-each-check model was unsurprisingly hitting limits in some places and 4.x replaces it with worker processes that get check execution delegated to them. There is a massive performance gain and I’m looking to leverage that vs. integrating with mod-gearman.
  • Query Handlers – This feels like baked-in MK Livestatus. It’s made available via a socket. Unlike livestatus, it doesn’t yet have a lot of fancy and it’s mostly basic at the moment. I’d expect it would get a lot of attention in future versions.

Among other things I’m looking forward to integrating Multisite in our infrastructure. We have close to a dozen Nagios instances here at Mozilla and our primary interface to each is via IRC bots. As one would imagine, it doesn’t scale well and isn’t ideal for dealing with mass changes. This is where Multisite comes in very handy. Along with Livestatus, Multisite provides for a supercharged way to deal with multiple instances and multiple service/hosts within each. Do try out the demo because it’s hard to put awesome in words :)

Some nice talks that stood out:

  • Effective monitoring by Rodrigue Chakode where he spoke about filtering false alerts and actionable alerts and using business processes to monitor the most effective elements of a system.
  • Nagios at Etsy by Avleen Vig, who had an eventful road trip to the conference and discussed some cool things Etsy has done, particularly measuring alert fatigue by correlating alerts and sleep inputs from fitbit worn by oncalls. He also spoke at length about “Monitoring hygiene” and how Etsy went from 300 alerts/day to 45 alerts over the course of two years.

In all, it was a great conference, like last year. Looking forward to a year of 4.x and trying to get the in-house puppet module out on github ;)

MySQL Workbench “Clean up SQL” Feature

Sheeri

2

I was playing around with MySQL Workbench earlier in the week, and ran across the “clean up SQL” feature, which I thought was neat. Here’s a picture-based demonstration – you can click on the pictures to make them bigger, so they are more readable.

Here is a typical complex query that looks pretty good formatted in the results from a performance schema query:
query from performance schema

Simply click the “broom” icon and watch as your SQL is cleaned up, with one field in the SELECT per line and the JOINs indented and formatted prettily:
nicer, cleaned up SQL

Pretty cool, for just the click of a button!

IT goings-on

phrawzty

For those curious what IT has been up to lately, wonder no longer – here’s a quick status update from the past week, highlighting just some of the great stuff the Mozilla IT team has been working on recently.

First up, our team is growing:

  • Welcome Chris Knowles, who will be officially helping us out as a Storage and Virtualisation admin (unofficially, he’s going to help us with an even greater challenge: safely landing our Kerbals on the Mun).
  • A big hello to the entire OpSec team, who are now part of the IT team proper.  As separate entities we already had a close and fruitful working relationship – now that we’re all together, things are only going to get better!  Expect more updates about all the interesting stuff we’ll be working on together – as long as it’s not classified top secret. ;)

Our fabulous SRE team had an epic bug squashing session during which they reduced their open bugs by over 40%.  Given how diverse their queue is at any given moment, this is quite the accomplishment – great work guys!

Speaking of SRE’s, our very own Dumitru Gherman, along with developer Emma Irwin, made the long trip from Mountain View out to London to host a session on “Hacking your online safety” at MozFest.

Local Windows machine deployments at our San Francisco office used to take hours, but thanks to the efforts of Mike Poessy at the SFO service desk, via a combination of templating and new imaging techniques, this has been reduced down to as little as 20 minutes.  This newly streamlined process will almost certainly find its way to our other offices as well.

And last, but most certainly not least, the Release Operations team has been very busy:

  • Jake Watkins wrote a module for Windows called “metric-collective” that polls system stats and forwards the results to graphite/statsd.  This is going to be rolled out to all the Windows build and test systems during the next release cycle.  Combined with related initiative to roll out collectd across all of the OS X and Linux machines, we’re going to have a whole new level of graphing and trending available across the entire release infrastructure.
  • Mark Cornmesser and Q Fortier have been hard at work on a brand new imaging and management mechanism for our Windows 2008 build hosts.  The biggest change is a move away from our old, manually maintained monolithic image, to a modular image complete with proper change and configuration management.  This new platform is currently being tested on a number of project branches and is expected to be rolled out to mozilla-central, try, and inbound in the next few weeks.
  • Dustin Mitchell worked with the Auto Tools team to set up an independent Puppet instance, itself destined to automate the management of the entire QA Mozmill CI infrastructure.  He also set up local Python and NPM mirrors for use by Mozillians, providing a compelling model and implementation for other use-cases going forward.

As always, if you have any questions or want to learn more, feel free to comment below or hop on to #it on irc.mozilla.org.  See you next time!

Learn MySQL for Free with MySQL Marinate, Season 3!

Sheeri

The 3rd season of MySQL Marinate begins October 1st, 2013*. Join the meetup group and RSVP at season 3 to join! You can do the work on MySQL, or if you prefer, MariaDB or Percona.

If you do not have the book yet, you can still do the first week by using the online material from “Browse Contents” on the O’Reilly book page for Learning MySQL. There is homework for week 1, see the master list for all the information.

If you would like to learn MySQL from the ground up, consider joining us. This is for beginners – If you have no experience with MySQL, or if you are a developer that wants to learn how to administer MySQL, or an administrator that wants to learn how to query MySQL, this course is what you want.

If you are not a beginner, you are welcome to join too – maybe you need a refresher, or maybe you just want to test your knowledge or earn badges. That’s OK too!

The format of a virtual self-study group is as follows:

Each participant acquires the same textbook (Learning MySQL, the “butterfly O’Reilly book”, published 2007). You can acquire the textbook however you want (e.g. from the libary or from a friend) but if you buy the book, we ask that you buy it from our Amazon Store, to help pay for meetup fees.

Each participant commits to read one chapter per week, complete the exercises and post a link to the completed work. Tweet using the hashtag #mysqlmarinate.

Each participant obtains assistance by posting questions to a discussion area set up on the Virtual Tech Self Study Message Board for each chapter.

Each participant receives a badge upon finishing each chapter and all assignments.

Note: There is no classroom or video instruction.

How do I get started?

Become a member of the Virtual Tech Self Study Meetup Group.

Register for MySQL Marinate. RSVP to this event: Yes

Acquire the book (the only item that may cost money). Get your hands Learning MySQL – see if your local library has it, if someone is selling their copy, or buy it from our Amazon Store (this helps pay for meetup fees).

When your book arrives, start your virtual learning by reading one chapter per week. Complete the exercises; if you have any questions, comments or want to learn more in-depth, that’s what the forums are for!

Learning MySQL

FAQs:

Q: How long will the course last?

A: We will cover 12 lessons (chapters) in the book, so 12 (twelve) weeks starting October 1st, though we will have one week that is a break so that you can catch up if you need to or you have a week off if you need it. Refer to the MySQL Marinate Season 3 Master Discussion List for specific dates.

By January 1st, 2014, you will know MySQL!!

Q: Can I get ahead?

A: Sure! This is go-at-your-own-pace. To prevent spoilers, please put comments in the appropriate chapter threads.

Q: Does this cover the Percona patch set or MariaDB forks?

A: This covers the basics of MySQL, which are immediately transferable to Percona’s patched MySQL or MariaDB builds.

Q: What do I need in order to start the course?

A: All you need is the book and access to a computer, preferably one that you have control over. Installing MySQL is chapter 2, so really, all you need is the book and a computer to start, you don’t have to worry about any prerequisites. If you do not have the book yet, you can still do the first week by using the online material from “Browse Contents” at the

Categories: Databases

Mozilla IT at Agile 2013

bburton

I had the privilege of attending and speaking at the Agile 2013 conference on August 6th – 9th, in Nashville, TN. Agile 2013 was the 12th annual conference of the Agile Alliance. It was a huge conference, with over 2000 people in attendance. Quite a bit bigger than I’m used to attending/presenting at, and it was my first time visiting Nashville, so that was also a treat.

DevOps Track:

This year they added a DevOps track, which I submit a talk to about Mozilla IT WebOps’ work on enabling a devops culture through self service tools, in particular, about our work building an in-house PaaS offering based on ActiveState’s Stackato product.

The DevOps track was excellent and attracted a great crowd of speakers and attendees interested in what DevOps is about. I attended almost all the DevOps track talks and also got some great hallway track time with folks like Gene Kim, John Willis, Andrew Clay Shafer, Karthik Gaekwad, Gareth Bowles, Dominica DeGrandis, and many more. I thoroughly enjoyed all the talks, but I wanted to highlight a few in particular.

Talk: How DevOps changed Everything

Karthik Gaekwad, a web engineer at Mentor Graphics Embedded, shared their devops transformation in a talk entitled “How DevOps changed Everything”. He explained how they started as with an environment where Waterfall Development was being done and the Dev, Ops, QA, etc teams were all silos, and this wasn’t working. To try and improve this they started by adopting Scrum with two week iterations, this helped improve things on the dev side and some improvements with IT as well, but IT was still mostly thinking “how can we be agile when it takes weeks just to order servers?”. At the same time, management starts asking about this “Cloud” stuff, so now IT is trying to figure out how to be more “agile” and now how to do “Cloud”. Enter DevOps! They focused on two things, improving culture and automation.

They worked on culture by committing to do Scrum and even though they knew it’d be difficult, they stuck with it. “Our 1st sprint(s) sucked!” To improve automation they focused on building a Platform for their developers. One that would have have automation, APIs, instrumentation, easy deploys, etc baked right in. They found that “Devs and ops understand good architecture” and a platform helped everyone speak the same language.

Finally, Karthik offered some thoughts on what DevOps looks like as you make progress with it.

DevOps 101

  • Config Management – find easy wins, config files
  • Monitoring – use tools that make it easy, CloudKick, New Relic, StatsD
  • Log Aggregation – make logs easy to view, search, chart, Logstash, Graylog2, Splunk, Sumologic

DevOps 201

  • Culture – every team is different, find what works for you
  • Stuff breaks – no blame, fix issues, do postmortems soon and often
  • Testing – test everything you can
  • Continuous – try to do more things continuously, testing, delivery, monitoring

DevOps 301

  • Get Security in the loop
  • Find other teams to include
  • Don’t say “WE’RE 100% DEVOPS TODAY”

DevOps 401

He closed by re-iterating that focusing on creating small wins and getting people working together/speaking the same language was the key to their devops transformation. “Let’s work together, and solve the problems that our business wants us to!”

Talk: DevOps Transformation at Salesforce.com

I am a huge fan of DevOps success stories in more traditional or enterprisey environments, so the talk by Dave Mangot and Karthik Rajan of Salesforce.com TechOps, entitled “Effecting a DevOps Transformation at Salesforce.com”, was a treat for me.

They started with some history of Salesforce.com, which in 2000 was three people and did four major releases per year, but fast forward through seven years of rapid growth on all fronts and delivering on releases got harder. 2006 saw one major release. In 2007, they embarked on an agile transformation which they coined Adaptive Development Methodology (ADM), essentially their take on agile development. This turned things around for development, but what about infrastructure? ADM didn’t have the same impact for their TechOps teams as it had for development, making your infrastructure agile poses different challenges from writing code. At first they tried scaling through hiring, but hiring takes time, meanwhile rapid success and growth continue.

So how do they innovate at scale? It’s now 2012 and there is now this DevOps thing going on. So how can we blend ADM and DevOps? Since DevOps encompasses agile development, automation and “infrastructure as code”, that means a lot of code needs to be written, so they secured time from developers to support Operations and build the tools Ops needed. This helped but they quickly found competing priorities slowed down development Ops needed. So they started creating cross functional teams with skills from data center and hardware to infrastructure to dedicated developers and unleashed them on the most needed improvements.

Their approach to DevOps was based inspired by Gene Kim’s “Three Ways”

  • The First Way emphasizes the performance of the entire system
  • The Second Way is about creating the right to left feedback loops.
  • The Third Way is about creating a culture that fosters two things: continual experimentation, taking risks and learning from failure;

They realized all four key parts of John Willis and Damon Edward’s CAMS ideas through the following actions/principles

  • Culture: Agile (through ADM) is the core of their engineering culture
  • Automation: Automation is key, dedicated developers and cross-functional teams ensure the right automation is being made and improved
  • Metrics: “You can’t change what you can’t measure” and “Measure everything”. Graphite is a big part of their metrics strategy
  • Sharing: Salesforce Chatter is at the center of the company’s social atmosphere. Essentially an internal social network site that people enjoy using.

So why are they succeeding? First, they found their “DevOps Kata“, which consists of

  • Daily Standup – “Encourage effective two way communication and other means to drive out fear throughout the organization so that everybody may work effectively and more productively for the company.”
  • Sprint Retrospective – “Institute a vigorous program of education and self-improvement”
  • Sprint Demo – “Break down barriers between departments. People in research, design, sales, and production must work as a team, in order to foresee problems of production and usage that may be encountered with the product or service.”

Two key things for getting people interested in and spread the “devops culture” were having regular Hack Days and they held an Internal DevOps Mini-Conference. Some other key points were:

  • Building an Infrastructure Development Lifecycle – infrastructure code is developed with the same process and rigor as application code
  • Virtualization – They invested in the tooling and infra to make it so folks can develop for both infrastructure and application code locally, using Vagrant
  • Scrum Master Training – “If someone is interested, send them! Yesterday!”
  • Accepting that we’ll still “fail”

They’re extremely happy with their progress and DevOps transformation so far, but still recognize that their present and future challenges include:

  • Bringing Agile into traditional IT Ops
  • Bringing IT Ops in with Infra Eng and R&D
  • Re-educating workforce
  • Recruiting
  • Scaling Securely

Andrew moderated a panel on Tuesday evening which he called the “DevOps AMA” (Ask Me Anything), which included John Willis, Gene Kim, and Mandi Walls. Andrew did a fantastic job of including the audience and the conversation ranged from “what is a good definition of devops?” to “doing devops in the enterprise” to “how can I do continuous delivery with mainframes?”.

Talk: Keynote – Why Everyone Needs DevOps Now

One of the biggest highlights of Agile 2013 for me was Gene Kim’s closing keynote “Why Everyone Needs DevOps Now: A Fourteen Year Study Of High Performing IT Organizations”.

Gene wanted to answer the question “Where Do High Performers Come From?” and so began studying this. He began working with the IT Process Institute, which has been studying high performing IT organizations since 1999, and the first result of this work was the The Visible Ops Handbook, which is described as “a methodology designed to jumpstart implementation of controls and process improvement in IT organizations needing to increase service levels, security, and auditability while managing costs”.

IT Operations has traditionally had the role of “fixing fragile artifacts”, is extremely interrupt driven, and it’s no surprise why terms like “special snowflake server” are widespread. Unfortunately, we’ve also seen a world were “IT Ops and Devs are at War!”. Surely there must be a better way.

It’s 2009, John Allspaw and Paul Hammond give a presentation at the O’Reilly Velocity conference on how they’re doing 10 deploys a day at Flickr. Change is afoot.

So who is doing DevOps?

  • Google, Amazon, Netflix, Etsy, Twitter, Facebook, Pinterest, …
  • BNY Mellon, Bank of America, World Bank, Paychex, Intuit, …
  • The Gap, Nordstrom, REI, GameStop, …
  • Portland State University, Seton Hill University, Kansas State University, …

What makes a high-performing DevOps teams?

  • They’re more agile
  • 30x more frequent deployments
  • 8,000x faster cycle time than their peers
  • They’re more reliable
  • 2x the change success rate
  • 12x faster MTTR

So what’s the outcome of all this studying Gene and other have been doing? An outcome of it is the book Gene co-wrote, “The Phoenix Project. A Novel about IT, DevOps, and helping your business win“. It is heavily inspired by Goldradtt’s novel The Goal.

At the core of this book is the idea of the three ways.

The first way is Flow. Seek to understand your flow of work. You need to define your work and make it visible. Create one step environment creation process. Improves development, testing, QA. Change the Agile sprint policy: “At the end of each sprint, we must have working code and the environment it runs in.” Deploy Smaller Changes, More Frequently.

The second way is Feedback. Seek to understand and respond to the needs of all customers, internal and external. Developers are IT Ops customers, help them get quicker feedback on thier code. Google requires that developers maintain their services themselves for the first six months and has a very thorough acceptance process before Ops takes over maintain a service. Metrics and monitoring improve feedback and situational awareness. Continuous delivery, failures must result in automated tests in the continuous deployment pipeline (Release, Config, Change).

The third way is Continual Experimentation And Learning. Foster a culture that rewards Experimentation (taking risks) and learning from failure. Repetition is the prerequisite to mastery. You need a culture that keeps pushing into the danger zone and have the habits that enable you to survive in the danger zone. You Don’t Choose Chaos Monkey … Chaos Monkey Chooses You! Allocate 20% Of Cycles To Technical Debt Reduction.

In closing, If Gene could wave a magic wand, everyone will,

  • See the dead bodies in IT, and have confidence that your intuitions have been right all along…
  • Become conversant with DevOps and recognize the practices when you see them
  • Be energized about how practitioners can contribute in this organizational journey
  • Leave with some concrete steps to get some great outcomes
  • Help create a team that starts putting DevOps practices into place

Gene’s keynote ended the conference on an excellent high note, really drove home the importance of DevOps to the whole business, and demonstrated how the Agile conference adding a DevOps track to the conference was relevant and very smart.

Parting thoughts

My talk, entitled “Enabling DevOps: The road to a better culture“, started with providing some context and history around Mozilla IT’s growth, the formation of the Web Operations team, the evolution of our technology and tools.  I then covered our current efforts to build self-service tools, spent a fair amount of time discuss why and how we’re building an in-house platform as a service offering, and talk about the bigger picture of the devops culture we’re building.

My talk was also well received. People really enjoyed hearing about how Mozilla IT is approaching building a devops culture, the tools we’re building and deploying, and our challenges and successes along the way. In particular, I got a few comments that it seems like we’re taking a very pragmatic approach and finding small wins that really help, which I definitely think is how WebOps approaches all this and I was glad this came through in my presentation and conversations.

Agile 2013 was a great conference, I enjoyed getting to present to a different audience and the conversations I had with folks who aren’t in the Operations world, and getting the opportunity to share about what Mozilla IT’s is up to.

If you enjoyed what I’ve written here, follow me on Twitter!

Indexing Talk Online

Sheeri

I am doing a quick blog post to announce that I have put an indexing talk online*. Most recently, I delivered this indexing talk at Confoo and Scale 11x.

The talk is on YouTube at Are You Getting the Best Out of Your MySQL Indexes? There are also PDF slides.
From the official conference description, if you want to know more:
MySQL indexes are often used to make performance better. However, they can make performance suffer if you are not using them properly. Oracle ACE Director Sheeri Cabral explains the pitfalls to avoid with indexes and how to utilize compound indexes to maximize index availability with the least amount of write overhead.

*I know I have not been posting blogs for a long time. This was a very busy year, and I took March through July off from conferences in order to buy a house and move.

Upcoming maintenance for lists.mozilla.org, 20130727 @ 1800 PDT/ 0400 UTC

Shyam Mani

Mozilla IT Operations Maintenance Notification:
———————————————————————-
(The following event has been scheduled.)

ISSUE STATUS: PLANNED
BUG IDS: 765289
DATE: 2013/07/27
START TIME: 1800 PDT (0400 UTC)
ESTIMATED END TIME: 1900 PDT (0500 UTC)
SITE: phx1
SERVICES: lists.mozilla.org, mail.mozilla.org
TYPE OF WORK: Mailman software Upgrade

IMPACT OF WORK: lists.mozilla.org and mail.mozilla.org will be
offline for the duration of the work.

We will be doing a minor version upgrade for the software that powers
our public email lists. Mail will queue up at the mail relays during
this period and there should be no loss of email.

FOR MORE INFORMATION : Feel free to comment on the post.