## Mozilla IT at Agile 2013

I had the privilege of attending and speaking at the Agile 2013 conference on August 6th – 9th, in Nashville, TN. Agile 2013 was the 12th annual conference of the Agile Alliance. It was a huge conference, with over 2000 people in attendance. Quite a bit bigger than I’m used to attending/presenting at, and it was my first time visiting Nashville, so that was also a treat.

DevOps Track:

This year they added a DevOps track, which I submit a talk to about Mozilla IT WebOps’ work on enabling a devops culture through self service tools, in particular, about our work building an in-house PaaS offering based on ActiveState’s Stackato product.

The DevOps track was excellent and attracted a great crowd of speakers and attendees interested in what DevOps is about. I attended almost all the DevOps track talks and also got some great hallway track time with folks like Gene Kim, John Willis, Andrew Clay Shafer, Karthik Gaekwad, Gareth Bowles, Dominica DeGrandis, and many more. I thoroughly enjoyed all the talks, but I wanted to highlight a few in particular.

Talk: How DevOps changed Everything

Karthik Gaekwad, a web engineer at Mentor Graphics Embedded, shared their devops transformation in a talk entitled “How DevOps changed Everything”. He explained how they started as with an environment where Waterfall Development was being done and the Dev, Ops, QA, etc teams were all silos, and this wasn’t working. To try and improve this they started by adopting Scrum with two week iterations, this helped improve things on the dev side and some improvements with IT as well, but IT was still mostly thinking “how can we be agile when it takes weeks just to order servers?”. At the same time, management starts asking about this “Cloud” stuff, so now IT is trying to figure out how to be more “agile” and now how to do “Cloud”. Enter DevOps! They focused on two things, improving culture and automation.

They worked on culture by committing to do Scrum and even though they knew it’d be difficult, they stuck with it. “Our 1st sprint(s) sucked!” To improve automation they focused on building a Platform for their developers. One that would have have automation, APIs, instrumentation, easy deploys, etc baked right in. They found that “Devs and ops understand good architecture” and a platform helped everyone speak the same language.

Finally, Karthik offered some thoughts on what DevOps looks like as you make progress with it.

DevOps 101

• Config Management – find easy wins, config files
• Monitoring – use tools that make it easy, CloudKick, New Relic, StatsD
• Log Aggregation – make logs easy to view, search, chart, Logstash, Graylog2, Splunk, Sumologic

DevOps 201

• Culture – every team is different, find what works for you
• Stuff breaks – no blame, fix issues, do postmortems soon and often
• Testing – test everything you can
• Continuous – try to do more things continuously, testing, delivery, monitoring

DevOps 301

• Get Security in the loop
• Find other teams to include
• Don’t say “WE’RE 100% DEVOPS TODAY”

DevOps 401

He closed by re-iterating that focusing on creating small wins and getting people working together/speaking the same language was the key to their devops transformation. “Let’s work together, and solve the problems that our business wants us to!”

Talk: DevOps Transformation at Salesforce.com

I am a huge fan of DevOps success stories in more traditional or enterprisey environments, so the talk by Dave Mangot and Karthik Rajan of Salesforce.com TechOps, entitled “Effecting a DevOps Transformation at Salesforce.com”, was a treat for me.

They started with some history of Salesforce.com, which in 2000 was three people and did four major releases per year, but fast forward through seven years of rapid growth on all fronts and delivering on releases got harder. 2006 saw one major release. In 2007, they embarked on an agile transformation which they coined Adaptive Development Methodology (ADM), essentially their take on agile development. This turned things around for development, but what about infrastructure? ADM didn’t have the same impact for their TechOps teams as it had for development, making your infrastructure agile poses different challenges from writing code. At first they tried scaling through hiring, but hiring takes time, meanwhile rapid success and growth continue.

So how do they innovate at scale? It’s now 2012 and there is now this DevOps thing going on. So how can we blend ADM and DevOps? Since DevOps encompasses agile development, automation and “infrastructure as code”, that means a lot of code needs to be written, so they secured time from developers to support Operations and build the tools Ops needed. This helped but they quickly found competing priorities slowed down development Ops needed. So they started creating cross functional teams with skills from data center and hardware to infrastructure to dedicated developers and unleashed them on the most needed improvements.

Their approach to DevOps was based inspired by Gene Kim’s “Three Ways”

• The First Way emphasizes the performance of the entire system
• The Second Way is about creating the right to left feedback loops.
• The Third Way is about creating a culture that fosters two things: continual experimentation, taking risks and learning from failure;

They realized all four key parts of John Willis and Damon Edward’s CAMS ideas through the following actions/principles

• Culture: Agile (through ADM) is the core of their engineering culture
• Automation: Automation is key, dedicated developers and cross-functional teams ensure the right automation is being made and improved
• Metrics: “You can’t change what you can’t measure” and “Measure everything”. Graphite is a big part of their metrics strategy
• Sharing: Salesforce Chatter is at the center of the company’s social atmosphere. Essentially an internal social network site that people enjoy using.

So why are they succeeding? First, they found their “DevOps Kata“, which consists of

• Daily Standup – “Encourage effective two way communication and other means to drive out fear throughout the organization so that everybody may work effectively and more productively for the company.”
• Sprint Retrospective – “Institute a vigorous program of education and self-improvement”
• Sprint Demo – “Break down barriers between departments. People in research, design, sales, and production must work as a team, in order to foresee problems of production and usage that may be encountered with the product or service.”

Two key things for getting people interested in and spread the “devops culture” were having regular Hack Days and they held an Internal DevOps Mini-Conference. Some other key points were:

• Building an Infrastructure Development Lifecycle – infrastructure code is developed with the same process and rigor as application code
• Virtualization – They invested in the tooling and infra to make it so folks can develop for both infrastructure and application code locally, using Vagrant
• Scrum Master Training – “If someone is interested, send them! Yesterday!”
• Accepting that we’ll still “fail”

They’re extremely happy with their progress and DevOps transformation so far, but still recognize that their present and future challenges include:

• Bringing Agile into traditional IT Ops
• Bringing IT Ops in with Infra Eng and R&D
• Re-educating workforce
• Recruiting
• Scaling Securely

Andrew moderated a panel on Tuesday evening which he called the “DevOps AMA” (Ask Me Anything), which included John Willis, Gene Kim, and Mandi Walls. Andrew did a fantastic job of including the audience and the conversation ranged from “what is a good definition of devops?” to “doing devops in the enterprise” to “how can I do continuous delivery with mainframes?”.

Talk: Keynote – Why Everyone Needs DevOps Now

One of the biggest highlights of Agile 2013 for me was Gene Kim’s closing keynote “Why Everyone Needs DevOps Now: A Fourteen Year Study Of High Performing IT Organizations”.

Gene wanted to answer the question “Where Do High Performers Come From?” and so began studying this. He began working with the IT Process Institute, which has been studying high performing IT organizations since 1999, and the first result of this work was the The Visible Ops Handbook, which is described as “a methodology designed to jumpstart implementation of controls and process improvement in IT organizations needing to increase service levels, security, and auditability while managing costs”.

IT Operations has traditionally had the role of “fixing fragile artifacts”, is extremely interrupt driven, and it’s no surprise why terms like “special snowflake server” are widespread. Unfortunately, we’ve also seen a world were “IT Ops and Devs are at War!”. Surely there must be a better way.

It’s 2009, John Allspaw and Paul Hammond give a presentation at the O’Reilly Velocity conference on how they’re doing 10 deploys a day at Flickr. Change is afoot.

So who is doing DevOps?

• BNY Mellon, Bank of America, World Bank, Paychex, Intuit, …
• The Gap, Nordstrom, REI, GameStop, …
• Portland State University, Seton Hill University, Kansas State University, …

What makes a high-performing DevOps teams?

• They’re more agile
• 30x more frequent deployments
• 8,000x faster cycle time than their peers
• They’re more reliable
• 2x the change success rate
• 12x faster MTTR

So what’s the outcome of all this studying Gene and other have been doing? An outcome of it is the book Gene co-wrote, “The Phoenix Project. A Novel about IT, DevOps, and helping your business win“. It is heavily inspired by Goldradtt’s novel The Goal.

At the core of this book is the idea of the three ways.

The first way is Flow. Seek to understand your flow of work. You need to define your work and make it visible. Create one step environment creation process. Improves development, testing, QA. Change the Agile sprint policy: “At the end of each sprint, we must have working code and the environment it runs in.” Deploy Smaller Changes, More Frequently.

The second way is Feedback. Seek to understand and respond to the needs of all customers, internal and external. Developers are IT Ops customers, help them get quicker feedback on thier code. Google requires that developers maintain their services themselves for the first six months and has a very thorough acceptance process before Ops takes over maintain a service. Metrics and monitoring improve feedback and situational awareness. Continuous delivery, failures must result in automated tests in the continuous deployment pipeline (Release, Config, Change).

The third way is Continual Experimentation And Learning. Foster a culture that rewards Experimentation (taking risks) and learning from failure. Repetition is the prerequisite to mastery. You need a culture that keeps pushing into the danger zone and have the habits that enable you to survive in the danger zone. You Don’t Choose Chaos Monkey … Chaos Monkey Chooses You! Allocate 20% Of Cycles To Technical Debt Reduction.

In closing, If Gene could wave a magic wand, everyone will,

• See the dead bodies in IT, and have confidence that your intuitions have been right all along…
• Become conversant with DevOps and recognize the practices when you see them
• Be energized about how practitioners can contribute in this organizational journey
• Leave with some concrete steps to get some great outcomes
• Help create a team that starts putting DevOps practices into place

Gene’s keynote ended the conference on an excellent high note, really drove home the importance of DevOps to the whole business, and demonstrated how the Agile conference adding a DevOps track to the conference was relevant and very smart.

Parting thoughts

My talk, entitled “Enabling DevOps: The road to a better culture“, started with providing some context and history around Mozilla IT’s growth, the formation of the Web Operations team, the evolution of our technology and tools.  I then covered our current efforts to build self-service tools, spent a fair amount of time discuss why and how we’re building an in-house platform as a service offering, and talk about the bigger picture of the devops culture we’re building.

My talk was also well received. People really enjoyed hearing about how Mozilla IT is approaching building a devops culture, the tools we’re building and deploying, and our challenges and successes along the way. In particular, I got a few comments that it seems like we’re taking a very pragmatic approach and finding small wins that really help, which I definitely think is how WebOps approaches all this and I was glad this came through in my presentation and conversations.

Agile 2013 was a great conference, I enjoyed getting to present to a different audience and the conversations I had with folks who aren’t in the Operations world, and getting the opportunity to share about what Mozilla IT’s is up to.

## Indexing Talk Online

I am doing a quick blog post to announce that I have put an indexing talk online*. Most recently, I delivered this indexing talk at Confoo and Scale 11x.

The talk is on YouTube at Are You Getting the Best Out of Your MySQL Indexes? There are also PDF slides.
From the official conference description, if you want to know more:
MySQL indexes are often used to make performance better. However, they can make performance suffer if you are not using them properly. Oracle ACE Director Sheeri Cabral explains the pitfalls to avoid with indexes and how to utilize compound indexes to maximize index availability with the least amount of write overhead.

*I know I have not been posting blogs for a long time. This was a very busy year, and I took March through July off from conferences in order to buy a house and move.

## Upcoming maintenance for lists.mozilla.org, 20130727 @ 1800 PDT/ 0400 UTC

———————————————————————-
(The following event has been scheduled.)

ISSUE STATUS: PLANNED
BUG IDS: 765289
DATE: 2013/07/27
START TIME: 1800 PDT (0400 UTC)
ESTIMATED END TIME: 1900 PDT (0500 UTC)
SITE: phx1
SERVICES: lists.mozilla.org, mail.mozilla.org
TYPE OF WORK: Mailman software Upgrade

IMPACT OF WORK: lists.mozilla.org and mail.mozilla.org will be
offline for the duration of the work.

We will be doing a minor version upgrade for the software that powers
our public email lists. Mail will queue up at the mail relays during
this period and there should be no loss of email.

Some folks are reporting that some etherpads are not working after a routine database switchover. We have figured out a way to recover the last known working revision, and have already done so for a handful of etherpads.

We are working to proactively find these etherpads and fix them, but if you have an etherpad that is broken that you want to call attention to, please put it in bug 894913 – https://bugzilla.mozilla.org/show_bug.cgi?id=894913.

## MySQL Workbench with Unix Socket only servers.

Here at Mozilla we like to keep our database systems secure and no more open than they need to be.  Like many database systems, privileged users are locked down to socket only connections. This can be a problem if you want to use database admin tools, like the MySQL Workbench or maybe even some remote performance monitoring software.

One cool trick is forwarding that socket to a port temporarily while you do your work. Below I’ll show you how to forward a socket to a port for use with the MySQL Workbench.

First, establish an ssh connection to your database server using any regular shell connector, such as Terminal for Mac devices, ssh for Linux devices, or PuTTy for Windows.

Once connected, assuming you’re using a socket of “/var/lib/mysql/mysql.sock” (the default), we’ll use a linux tool called socat to forward that socket to an unused port. Note that socat has packages for a wide variety of distributions and is available through both yum and apt-get for major distributions. For more info on socat, see here.

The command for this is:

An example, using the default socket path and port 3308 would be:

Tip: If you’re consistently connecting from the same IP address, consider adding the “range” option to the socat parameters for extra security.

What does this command do? It forwards all traffic being received from the port provided to the socket provided. It’s basically the socket equivalent to port forwarding.

Once you’ve created the socket forwarding, you can connect the MySQL Workbench to it by simply setting up a connection to the server using TCP/IP and your port, like the screenshot below (note that it will require your normal socket user’s password).

Once you’re configured like the above screenshot, voila! You can now connect to your socket only server.

## Brief Outage for Phoenix Data Center Chassis

One of the chassis in the PHX1 datacenter was experiencing issues which took many services, including those on the generic web cluster offline and degraded others for approximately half an hour. Fixing the issue took approximately 15 minutes. Services should be back to normal.

For reference, the following web services were either downgraded, or unavailable:

generic cluster (contains many web apps)

bouncer
elasticsearch
graphite
hangprocessor
input
input-celery
openshift
plugins and plugins memcached
puppetmaster
rabbit
socorro memcache

If you have any questions or concerns please address them to helpdesk@mozilla.com.

## Bugzilla Feeling Slow?

We have been experiencing intermittent Bugzilla slowness since Wednesday, June 12th 2013 at 5 pm UTC (10 am US/Pacific time). We have been working throughout the weekend to pinpoint the cause of this irregular, but noticeable, issue. The problem is performance only, there have been no reports and no evidence of data or functionality loss. We will release additional information as we have it.

Update 18 Jun 2013 18:40 pm UTC: The Phoenix chassis outage was completely unrelated to this Bugzilla slowness. Bugzilla is in a different data center and neither caused nor affected the chassis problem, and the only effect the chassis problem had was to pull resources away from figuring out and fixing the bugzilla issue.

## A Different Spin On the max_allowed_packet Problem

Back in November, I filed MySQL bug 67448, talking about a different type of max_allowed_packet problem.

See, an application had put data into the database, but could not retrieve it without getting max_allowed_packet. With the help of some really smart community folks (named Jesper Hansen, Brandon Johnson and Shane Bester), we determined that MySQL actually has 2 different max_allowed_packet settings: client and server.

When you change the max_allowed_packet variable, you are changing the server variable if it is in [mysqld] and the client variable if it is in [client] or [mysql] or whatever client you have. As far as we can tell, there’s no way to actually view what the client variable is, as looking at both the session and global max_allowed_packet variable shows you the server variable.

If max_allowed_packet is not set by the client, it defaults to 16M. The proposed solution is to allow it to be increased for non-interactive clients, and the bug has been verified as a “feature request”, though it has not been implemented yet.

## ulimits and upgrading from Oracle MySQL 5.0 to Percona patched MySQL 5.1

1

After upgrading to Percona’s patched MySQL 5.1*, end users were having connectivity problems, and reporting errors such as:

OperationalError: (2003, "Can't connect to MySQL server on 'db-amo-ro' (110)")

TimeoutError: Request timed out after 5.000000 seconds

OperationalError: (1135, "Can't create a new thread (errno 11); if you are not out of available memory, you can consult the manual for a possible OS-dependent bug")

We had these same problems a while back, before increasing ulimit settings in /etc/sysconfig/mysqld. Oracle’s MySQL startup script specifically sources this file:

[ -e /etc/sysconfig/$prog ] && . /etc/sysconfig/$prog

However, we saw these errors again when we upgraded to Percona’s MySQL 5.1. At first we thought that it was because Oracle’s startup script is /etc/init.d/mysqld and Percona’s is named /etc/init.d/mysql (so we would put ulimits in /etc/sysconfig/mysql). However, when we looked, we saw that Percona’s startup script does NOT source anything in /etc/sysconfig.

So then we put the following in /etc/security/limits.d/99-nproc-mysql.conf:
root soft nproc 32768
root hard nproc 65535

We restarted MySQL and all was good. Even though we are long past having this problem, I thought it was important enough to blog about.

* We finished upgrading all of our servers to MySQL 5.1 at the end of 2012. We ran into this interesting snag that I wanted to blog about, even though we’re in the middle of upgrading to MySQL 5.5 right now (and by the end of the year, we will upgrade to MySQL 5.6 – the performance schema stuff is definitely something we want to utilize).