Skip to content

BrowserID outage last week

15-May-12

What happened?

Last week (9 May 2012 23:24 PDT until the following morning at 07:20), BrowserID had an outage affecting 50% of login requests.

Embarrassingly, we didn’t find out until a user filed bug 753728 with us.

Root Cause

To put it bluntly, human error.

Specifically, the accidental draining of all nodes of a load balancer pool that governed verification of logins via BrowserID in our Northern California data center. A test-run of a load balancer management script was executed against a production pool instead of the stage pool.

There are some sites using a local verification system instead of calling back to the browserid.org systems – they were not affected.

The Fix

The timing of this issue was painfully coincidental, in that it occurred the week before we are planning to roll-out cepmon, which includes robust monitoring of the load balancer pools. It will catch situations like this (among many others) and alert the SRE team and us to the situation immediately.

Additionally, we have a new VPN config for internal use that doesn’t allow cross-talk from stage to prod to reduce the incidence of ‘oopses’ of this nature.

Sync meltdown presentation @ Mozilla

27-Oct-11

Before ‘The Sync Platform Meltdown, Explained‘ was posted, I did a “brown bag” presentation at Mozilla HQ explaining what happened.

Here is the recorded stream, if you’d prefer this instead of reading the long post about it:

The Sync Platform Meltdown, Explained

25-Oct-11

TL;DR: Recently, the Firefox Sync platform suffered serious performance degradation. We’d like to explain what happened, as well as the steps taken to deal with the issue from the perspective of Mozilla Services Operations (“Ops”).

The Sync platform melted down after the release of Firefox 7…

…to put it bluntly. This was the worst outage in Sync’s short (~1 year) post-Labs history, and it was caused by a confluence of factors including bugs, poor communication, and lack of proper database maintenance. In the interest of being as open as Mozilla is with their products & source code, we want to share with you, our community, what happened, why, and what we’re going to do from now on to prevent issues like this negatively affecting your experience with our products.

Build up

Obviously, all of what we know now to be contributing factors are in hindsight. Had we known even one of these lurking issues, we might not have seen the level of service disruption that occurred. We will leverage this experience to help head-off future incidents with this and other Services products.

Instant Sync

Instant Sync is a new feature in Firefox 7. It was first put into Beta sometime after the Firefox 5 release, which put it on track for the Firefox 7 release. Unfortunately, throughout the development and QA process, it wasn’t viewed as having a potential to have a negative impact on Sync’s performance, so neither Ops nor QA had any attention focused on it.

This meant that while the Ops team was aware of the existence of the Instant Sync feature, it wasn’t something that we were tracking or worried about. This lack of concern and/or assessment of impact was responsible for the missing metrics that would have properly modeled the behavior of Instant Sync “in the wild”.

Broken server-initiated back-off

Firefox’s Sync client code is designed so that Services Ops has the ability to force clients to back off from their scheduled Sync activity. What this means is that if, for some reason, the back-end Sync platform gets overloaded, we can reply with a certain HTTP header (either our custom X-Weave-Backoff or the HTTP-spec Retry-After) and instruct the client to delay the next check-in by whatever time we determine to be necessary. This functionality can give us breathing room to make repairs, take compensatory actions.

Our faith in the stability of the Sync product proved to be our undoing in this case. Because we’ve never needed to use the server-initiated back-off feature, we didn’t notice that it had been broken at some point during the last few months of Firefox’s development. Essentially, when we needed it most, it didn’t work.

Sync Maintenance

Among all the data Sync “syncs”, the “collection” (as it’s called in the code) that generates the most data is Browser History. Since your History is constantly changing, it constantly needs updating in Sync. Every time you use your browser, you are syncing more and more history data into our encrypted databases. To our detriment, and perhaps foolishly, we weren’t particularly rigid about our database cleanup of that data, and it backed way up. Until Firefox 7 (before Instant Sync existed), this old data was not a significant problem since we had more than enough storage overhead to deal with the existing backlog of old data and our processing resources were fast enough such that the presence of the old data had a negligible impact on our query load in MySQL.

Server Capacity

In Ops, we do not often discuss hard numbers when discussing capacity. Server capacity in any environment with a changing work load is difficult to quantify and as a result we use rough numbers. As of September 2011, we were reporting that we were using “about 40%” of our capacity in Northern California and “about 60%” in out Phoenix data center. Those numbers, while imprecise, were reasonably correct and were well within tolerances for perceived load.

The downfall is that we were not being precise in differentiating between “horizontal” capacity and “vertical” capacity. If these aren’t familiar terms to you, scaling horizontally is when you add more servers to your existing pool to better spread the load among more resources. Vertical scaling is when you add local (in-machine) resources that will better handle increased workload. These local resources might be an additional CPU, more RAM, faster, bigger hard drives, etc. To give a non-technical analogy, if you own a sedan and have two kids but have twins on the way, you could either buy a minivan (vertical scale) or another sedan and drive two cars when all 4 kids need to go somewhere (horizontal scale).

Services Ops had (and still have) plenty of horizontal capacity. If we suddenly gained 50% more new users this month we’d certainly be watching our metrics graphs carefully for bottlenecks, but we’re reasonably confident the platform would continue to deliver the same level of performance. However, if our existing users suddenly started sending 50% more traffic, we would be unable to handle that load because we aren’t operating with enough overhead to scale vertically to that level and performance would suffer. And, basically, that’s what happened. But before we start getting into those kinds of details, let’s go through the timeline of the event…

Meltdown

Firefox 7 released

On 27 September 2011, Firefox 7 came out on time. At this point, a Firefox release day is not a Red Alert, “all hands on deck” event like it has been in the past. We do check out our metrics graphs a little more than an average day, but other than that we’re business-as-usual. On this particular day, there was a small uptick in traffic to the Sync servers, but it wasn’t enough to generate any real concerns and was attributed to new users trying out the new release.

Load creeps up

Within 48 hours, it was clear that this new traffic wasn’t just new users. Traffic was steadily climbing, but the odd part was that we weren’t seeing any correlating increase in new user sign-ups. This means that the traffic we were seeing was from existing users. As luck would have it, at about the same time we start seeing an increase in drive failures in our Northern California data center. Equipment housed at that facility is a bit older than the equipment in use in Phoenix and as a result a certain amount of failure is expected, but the failure rate suddenly jumped more than four times more than the normal failure rate. What we now realize is that the increased disk i/o the database nodes were trying to handle was pushing old equipment right over the edge into failure.

By the 1st of October, we were starting to receive notifications from our monitoring system indicating individual databases were overloading to the point where they were unresponsive. In order to repair the errors users would experience from their assigned database node being overloaded, we began initiating “migrations” of those users to other less loaded database servers. The problem with this approach is that it involves forcing a “First Sync” of all the users data to a brand new node, inducing additional load on that server as well,
At this point, we’re sure there’s some sort of problem with Firefox 7 and how it’s interacting with the Sync servers, but we’re unable to find any correlation between failure rates and any functionality in Fx7. Outside of 503 errors occurring between our Zeus load-balancers and the back-end platform, we’re receiving nothing indicating what the root cause may be.

Firefox, both the project and the code, is an almost living thing. The nightly builds change nightly (hence the name) and it’s difficult to see what “Butterfly Effect” a change in some code will have on Aurora/Beta/Release in the following 6/12/18 weeks (respectively). Completely open communication between teams internally helps detect potential pitfalls, yes, but is not foolproof since some corners of the code are not as well-exercised as others (Example: if the Back button suddenly stopped working, everyone would notice. However, some bits of code are only triggered very rarely and without direct user input, making them harder to notice).

Keeping this in mind, when the release of FF7 was imminent, we asked our if anything “major” had changed in the Sync client portion of the Firefox code. From the perspective of the developers the changes that went into this version of the browser should not have resulted in a significant change in behavior, so no red flags were raised. Since we were operating under the impression that there were no major behavior changes expected, Ops continued to dig through logs and trace connections, desperately trying to correlate behavior and events, yet continuously coming up empty.

Eng/Ops side-bar: Mozilla Services is unlike any organization I’ve worked for previously. It’s not just that Engineering and Operations are friendly with each other (which I’ve encountered at a few other places), but we work very closely on a daily basis and are very cooperative. The two teams sit intermingled (those who are based out of the main Mountain View office are, anyway), the bulk of our online team chats happen in a combined chat room. When there are all-hands events at the Mozilla offices, our team dinners include both Eng & Ops. It’s fantastic, and makes for an excellent working environment even when things aren’t going to plan.

I want to provide this perspective to help illustrate that we aren’t at-odds with Engineering. So, while trying to pin down what’s going on, we’re not accusing Eng of anything, we’re investigating collaboratively and still coming up dry.

It was a long weekend of fire-fighting, and we were losing the battle as traffic was overloading more and more servers.

The culprit

On Monday, 3 Oct, after some trawling through the landed bugs listing for Fx7, the Ops Engineers noticed the Instant Sync feature bug. I asked in our intra-team IRC chat channel “Did Instant Sync land in Firefox 7?”

With the patch confirmed and after several pointed questions about specific expected Instant Sync behaviors, the investigation was on.
Knowing what to look for, finally, made it much easier to pin Instant Sync as the primary cause of the Sync storage (database) nodes’ load spike. At this point, server load was as high as a 50% increase over normal (which, for some nodes, meant load was pinned at 100%).

Mitigation

Warning: Technical details below.

We had already verified, via our personal Sync accounts and about:sync-log data, that Firefox 7 was doing the right thing and backing-off a little when it encountered timeouts connecting to the server. However, this wasn’t enough to quell the overall load increase and it also did nothing to prevent the Error Bar from showing up at the bottom of users’ Firefox browsers.

At this point we decided to start sending the back-off headers using X-Weave-Backoff only to discover that neither Firefox 7 nor Firefox 6 would respect it. Sadly, the same case happened with Retry-After headers.

Without the ability to force clients to back-off, our only available course of action is to attempt to compensate for the increased network traffic.

Prior to the release of Firefox 7, we began the process of upgrading our older MySQL installation with a new version of Percona. Our original plan was that this would be an interrupt-free, multi-week process, however we could plainly see that databases running Servers running older MySQL builds were unable to keep up and were completely loaded but servers running the new Percona builds were humming along nicely. Realizing this, we dove in, and started doing upgrades as fast as we could. From the perspective of the sync client, a server down for MySQL upgrade looks exactly the same as server overloaded because of no MySQL traffic can be shipped to that server. In that way, it’s a no worse than the existing situation for the user, so we just went for it.

While this was going on, another Operations engineer whipped up a better method to log the aforementioned 503 errors on the load balancers. This allowed us to see what was failing where more efficiently.

An additional problem-solving avenue that proved invaluable was Twitter. We asked users via @mozservices on Twitter to help by submitting links to pastebin’d copies of their about:sync-log files. This gave the Engineers the ability to hone in on exactly what was wrong as well as discover and fix several previously-unknown client and server bugs.

We also added a previously decommissioned web node back into the load balancer pool in our Phoenix data center to help handle the load there, since First Syncs take up more CPU and this server could help handle that additional load.

We did some OS tuning within all the web nodes, upping the the file descriptor limits from 1024 to 8192. This was done because the overload was causing nginx to hit the fd limit. We did some additional adjustment of the nginx config itself to allow for more current connections to the python back-ends (which run under gunicorn) due to the severe query latency. Finally, we changed some Linux kernel-level TCP parameters to allow processes to have a larger accept() queue, re-use TIME_WAIT sockets in a more timely manner, and generally take advantage of the fact we have lots of RAM (adjust buffer sizes, etc).

On the load balancers we dropped the LDAP-slaves VIP timeout to 0. LDAP could handle its own timeouts fine, and chopping them off short was just making Sync upset when it happened, causing more 503s to be sent to users’ Firefox browsers which triggered the error bar even more.

On the load balancers in our Northern California data center (which run on older/slower hardware than in Phoenix), we upped the conntrack limits, allowing for more connections inbound to the cluster and to the load-balancer back-ends (web servers and LDAP servers). The increased load and longer-lived connections made for a higher number of concurrent connections, which bumped us into the nf_conntrack limit (a Linux iptables thing). We also added monitoring so we can tell when we’re approaching the new set maximum. Then we lowered the pool_time for the Sync storage MySQL connection pool. There is a bug in the version of the Sync storage code we were running that would serve an error (“503″) to a user in the case of a dropped MySQL connection (instead of retrying), and we were seeing long-lived connections getting dropped (possibly by a firewall in the middle) — a newer version of Sync storage is slated to be live in production this week that fixes this problem.

All of this tuning bought us a small amount of breathing room, but we were still smacking right into the true bottleneck — that the databases simply could not churn through this increased query load that this additional traffic was requiring. So we developed a plan to very rapidly, very aggressively prune out the old data which was, as mentioned, primarily users’ old History data.

Bulking up the Infrastructure

While we hoped we could simply accomplish this “in place” on the databases, we knew it wouldn’t be too far-fetched to imagine that doing that maintenance work would both be blocked by the existing load as well as add its own load into the mix, as these were very intensive clean-up processes. To prepare for that potential scenario, we began to prep additional hardware that was being held in reserve in our Northern California data center. Some quick background on that: we had purchased a lot of gear in bulk from a bankruptcy sale that was handed overin situ. It was about double what we actually needed for Sync when we brought it live in March of 2011, so we had several cabinets full of servers powered-off, waiting to be needed. This ended up being very fortuitous for us, since if we had needed to buy new servers, there was no way we could have gotten them selected, shipped, racked, configured, built, and online within a reasonable period. With this gear sitting as it was, even with work of updating BIOS, getting the IPMI management configured (which has to be done one-by-one with a crash cart), we managed to get 60 servers prepped and kickstarted in less than 48 hours.

So, while some of the team got started prepping these servers, we tried to “live prune” the old data. On the least-loaded servers we saw some improvement, but was very slow to finish. On the most severely overloaded systems, the process failed as it simply couldn’t complete. So, with our fears realized, and we had to try Plan B.

It was theorized that if the problem was old data, we could just “nuke” all the data from the server so that the next time the client connected, Firefox would automatically initiate a “First Sync” and re-push all the data bundles (which would only include the user’s history of the last 5000 history entries, rather than months of old junk data). We started gingerly, by nuking the data of 10% of users on a few selected nodes. It was immediately apparent that this was a terrible idea. First, the database loads were already quite high from the other 90% of users. First Sync events push a LOT of data very quickly, so that just made it worse. Second, with Instant Sync in the mix, users were not exactly trickling in to attempt to First Sync. They were all trying to do it Right Now. And then fail, and then try again. And then fail…

Headway

By now (Thursday, 6 Oct) the newly built databases were built and able to be brought into the production pools. This allowed us to launch Plan C, mass user migration. This process entails marking the database as “down” via an administrative process on our side, which blocks all incoming traffic to that server. We then delete the user’s Sync storage node assignment in the LDAP system, clear the user’s data from our caches. Once complete, the next time the client tries to Sync, it will get a 401 response code and re-request the user’s Sync Storage node. We configured the the node-assignment server to only allocate users to these new nodes (including a few non-overloaded nodes in Phoenix) and the users were able to initiate their First Sync successfully. Having done this, Sync’s behavior returned to normal for them.
That was the theory, anyway.

Unfortunately, it takes a long time to do the above for the several million users (more than 3 hours per node, there are many dozens of nodes, and we could only do them in batches to avoid overloading the new servers). In spite of this, we felt confident that it would work based on our earlier less-successful attempts, and started the user migrations in earnest, targeting the nodes in the worst condition first and saw immediate positive impact. As nodes were cleared of users, we truncated the data and returned them to the database pool to accept new users.

Emerging into the sunlight

By early Friday (7 Oct) we saw substantial, sustained improvement in the Sync platform, with a very sharp reduction of the signature “503″ errors on the load balancer. After another 24 hours (36 hours in total) of migrations and cycling the cleaned databases back in, 503s were essentially eliminated, and our metrics showed we had transient error rates in-line with our pre-Firefox 7 numbers (and in some areas even better).

By Monday (10 Oct) we had complete confidence that Sync was back to normal, as we had 48 hours of normal metrics. We announced as such via the @mozservices Twitter stream and asked for any users having continued problems to contact us. While we have had ongoing conversations with some users experiencing Sync misbehavior and exposing new and interesting client bugs, we have had no recurrence of the 503 overloading problem.

From here

All of our mitigation efforts described above are permanent changes to the Sync platform. It’s stable, and we could rest to catch our breath. However, these changes only address the immediate problem and were implemented in a reactionary fashion. This is simply an unacceptable end-point in this process, having left us wondering “What if there’s another feature around the corner that increases load in another portion of the platform?” We would be just as unprepared as we were 3 weeks ago.

Procedure & bureaucracy

Our primary concern from the outset was “How did this happen?” The confluence of factors contributing to the melt-down have been laid out above, but that’s not what the question is asking. How, with all of the coordinated inter-team status meetings and load testing and roll-outs did such a huge change to the Sync client get “snuck” into production?

It didn’t sneak, really. The Engineering team followed protocol, they announced it in the weekly meeting, it went through peer review, and every other prescribed step needed to have this feature land. Following that, the client went through QA, Sync functionality was tested (and successful), and it rolled out. Only to have this be the result. What did they do wrong?

Nothing according to the procedure, and that is the problem: the procedure itself.

Most urgently, there was no formal Ops Review of upcoming features either for flagging expected behavior changes, nor anticipating misbehavior. There will now be regularly scheduled Ops reviews scheduled for all “major” features as well as an overview review before major version releases. This goes not only for the client, which was the antagonist in this case, but the server products as well.
Similarly, there was no explicit hand-off to QA to test these new features for functionality, much less a model-load test. This was exacerbated by the lack of testing the X-Weave-Backoff header “fail-safe” feature. To remedy this scenario, there will now be an explicit QA review of all feature changes, as well as coordinated test-case development of these features. Additionally, there will be a punch list of required-functionality tests that are absolute blockers to release (“respects the X-Weave-Backoff headers” being the item right after “Syncs stuff”).

First things last

Here is a list of our internal punch list, including to which team(s) the responsibility falls. Much of these are already done or are in-progress or ongoing, but this is directly from our internal Wiki:

  • Build and document model of current behavior, expected traffic (Ops/Eng)
  • Bring back and enforce system-review requirement for all client behavior changes (Eng)
  • Identify required metrics for reliability, create dashboard to compare Nightly/Aurora/Beta/Release channels (Ops)
  • Enhanced User-Agent data required (Eng)
  • Standing review, one week before Rapid Release cut-over day, of all changes landed and all metrics from above action(Eng/Ops)
  • Fix backoff handling for Fx8 + add automated tests (Eng) – Done: Bug 691612 Bug 691663 Bug 691988
  • Add back-off handling to pre-Aurora-merge extended QA validation (Eng/QA)
  • Document and make available current system capacity numbers (Ops)
  • Determine and implement standard for required headroom on each server (Ops)
  • Coordinate metrics-ops observation of Aurora/Beta traffic patterns (Ops/Metrics)
  • zeus log analysis and reporting improvements (Ops)
  • deploy TTL pruning cron (Ops)
  • push 2.6.x server-core update (Eng/Ops)
  • improve understanding of platform structure outside of Ops (Ops)
  • automated stack trace processing and reporting, top crasher equivalent, server-focused (Ops/Eng)
  • Re-triage of all outstanding bugs to identify future lurking issues (Ops/Eng)
  • 90-second minimum sync interval (Eng) — Done: Bug 694149
  • Create and document engineering contact list + escalation path for all applications (Eng)
  • Update new user activation load test, identify optimal+maximum user uptake rate per node (Ops/Eng)
  • gradual user reassignment per node, not per db (Ops)
  • Ensure ongoing reliability work is accounted for in planning, and deferral of maintenance (Eng/Ops) – Ongoing

Conclusion

We (Mozilla Services Operations) hope that this helps you to understand why this Sync melt-down took as long as it did to resolve. We strive to prevent even minor blips for our users, and any failure at that goal that results in a visible outage is an embarrassment to us, but this one was particularly frustrating since hour after hour we couldn’t dig ourselves out using the tools and techniques we had previously relied on.

We know we are better for it, though we still wish this problem hadn’t happened.

If you have any questions about this outage, please, ask them here, on #sync on irc.mozilla.org, or on @mozservices for any Mozilla Services questions.

-jv

Even more Debugging of Sync

12-Oct-11

TL;DR: Helping debug certain kinds of errors require increasing your log verbosity so we can see what’s going wrong. This explains how to adjust your Firefox config to do that.

In some cases, your default-verbosity Sync logs don’t have enough details.

Fortunately, Firefox has a lot of config options, and several of these options can help you (and, if you share your logs, us) see what’s going on when things behave badly.

In order to turn on logging for passwords and HTTP traffic, please enter “about:config” in your address bar and change these two config options to “Trace” (with a capital T) and restart Firefox:

services.sync.log.logger.engine.passwords
services.sync.log.logger.network.resources

Once that’s set up (and Firefox restarted), re-Sync (via “Sync Now” in the Tools menu) or do whatever it is that caused the error you’re trying to help us with, then following the previously-posted instructions on submitting logs to us, share your logs with us.

IMPORTANT: Note that personal information might appear in the log with these config options set to ‘Trace’; please be careful to change any sensitive data (for example, change your password to ‘xxx’ or ‘***’) so that the contents of the logs only show information you’re comfortable sharing. However, please don’t delete whole lines if you can help it.

Once you’re done reproducing & sharing your logs, you can (and should) change the config settings back to “Debug” and restart Firefox again.

Thanks for helping, and if you have any questions you can always fire up the old IRC client and visit us at irc://irc.mozilla.org/ in #sync.

-jv

Help Mozilla Services Operations debug what’s wrong with Sync

03-Oct-11

This post is for historical purposes only. Please do not send us any more log data at this time. Thank you!

 

TL;DR: Paste your most recent about:sync-log into a mozservices.pastebin.mozilla.org and send a link to @mozservices for us to check out.

Sometimes when there’s a problem with Firefox Sync, Mozilla Services Operations (@mozservices) needs some info to figure out the root cause

We strive to make sure Sync is an always-available service, but as with any complex infrastructure that is both growing rapidly & and being actively developed, we encounter the occasional glitch, outage, or obscure performance problem that we have trouble solving through what we can reproduce in our testing environments.

That’s where you, as a user of Sync, can help us out with very little effort.

Let’s get to it

  1. Get the Sync log from the browser having problems
    • Firefox 7+
      1. Enter “about:sync-log” in your Firefox address bar.
      2. Find the last error report file in the list and click on it. It should be called something like error-1317675754221.txt
        • If the log file list is empty or only contains outdated logs (you can tell by the dates), go to about:config, find the services.sync.log.appender.file.logOnSuccess preference, set it to true, and sync again. Then repeat these steps.
        • If you have a Sync problem, but there are no error reports but there are logs (named something like 1317425079289.log) or success reports (named something like success-1317547966983.txt), click on the most recent log file.
    • Firefox 6 or older
      First…you should *really* upgrade your Firefox. Why are you running this old version? But if you insist…

      1. Enter “about:config” in your Firefox address bar.
      2. Find the services.sync.log.appender.debugLog.enabled preference and set it to true
      3. Restart the browser. The preference change will not take effect otherwise!
      4. Sync again to reproduce the problem.
      5. Enter “about:sync-log” in your Firefox address bar
  2. Copy the contents into your clipboard
    • Mac: Command-A Command-C
    • Linux/Windows: Ctrl-A Ctrl-C
  3. Paste the logs into a pastebin
    1. Go to mozservices.pastebin.mozilla.org (or your alternate pastebin provider of choice)
    2. Paste (Command-V or Ctrl-V) your clipboard into the text box
    3. Change the retention to “a month”
    4. Click “Send”
    5. Copy the resulting URL from your address bar into your clipboard
  4. Link us to your pastebin
    • Pick your media:
      1. Twitter: @mozservices
        DM or just public msg to us — your call.
      2. IRC: #sync on irc.mozilla.org
    • Paste the URL from pastebin with a brief description of what you’re experiencing
  5. Done. If we have follow-up questions or suggestions, we’ll contact you (hopefully shortly). Of course, if your problem goes away (either through direct action or inaction), please let us know.

Conclusion:

Thanks for your help. There have been instances where a user’s logs provided exactly the information we needed to fix a problem that affected many other Sync users that, for whatever reason, we couldn’t duplicate or were missed by our testing.

References

-jv

Python Sync hotfix — Sync data not clearing correctly

12-Jul-11

Summary

Since the Python Sync deployment, we’ve been receiving user reports of issues with clearing Sync data in Account Portal. After a couple weeks of investigation, we determined the root cause this morning (bug 670975) and will be pushing a hotfix at 3pm PDT today (bug 670993).

The Problem

While reviewing logs a few days ago we observed that DELETE requests to the Sync server were being sent in the form “DELETE //1.1/username/storage“, with one extra slash. This is typically considered a low-impact (harmless) problem, as traditionally webservers simply consume the extra slash and continue to work as expected. The results of the observed DELETE were 404, which is an API-permissible response, when the user has no data.

However, while researching logs in bulk today, we found that *all* DELETE requests today received a 404 HTML status code, and that the body size returned was significantly larger than the normal single integer Sync status reply. The search was expanded to include GET requests (for quota) and determined that all GET requests were impacted as well.

Root Cause

We found that nginx/gunicorn were reacting poorly the extra slash, as opposed to Apache/PHP which simply ignores it. nginx does not remove the extra ‘/’ and passes it to Python, which considers it an invalid pathname and causes a response of an HTML 404 status code. This increases the returned data byte count that we see in the server logs and is the verified behavior in all staging and production Sync environments.

The Fix

Today’s hotfix is a two character change to Account Portal, removing the extra slash from both the GET and DELETE URL constructions (“%s/%s” -> “%s%s”). This resolves all known issues with the affected actions (clear data and get quota).

Ops Improvements

We did not detect this issue as the 404 status codes to GET and DELETE requests are API-compliant and are not a problem when they occur for any single user. The increased rate of 404s from the Sync server was invisible on all existing graphs, as Account Portal makes only a few requests per day out of millions total. In hindsight, a graph showing only account-portal HTTP status codes would have shown a dramatic change to 100% 404s when Python Sync was deployed. We are working on these graphs (bug 671017) as they provide critical visibility into Account Portal.

(content by Richard Soderberg, Sr. Operations Engineer in Mozilla Services)

Mozilla Services Operations is doing an IAmA on reddit

25-Mar-11

You can ask us anything here: http://redd.it/gbey4

Redirecting kickstart %pre/%post over serial console

31-Jan-11

TL;DR: To push output over console exec < /dev/console > /dev/console at the top of your %pre and/or %post sections of your kickstart config.

Debugging %pre & %post scripts in kickstart is difficult if you’re working on remote systems

I was running into a problem with one of the sections of my %post section, but since I was working on a remote system (over a redirected /dev/ttyS1, in this case) I was unable to see exactly what was wrong, since the only out put would be:

Running post-install scripts

This was followed by a long period (over an hour for one failure scenario) of dead air until whatever was hanging had timed out. This both left the newly-kickstarted system in an indeterminate state, and almost as importantly was certainly wasting a lot of time.

While I could let it fail and then investigate the logs post-install, that was far from ideal, and did nothing to solve the time waste.  So, in order to see what’s going on (and what’s hanging), we need to push the %post output to the console.

Let’s get to it

One way:
In your kickstart config file, add exec < /dev/console > /dev/console at the top of the %pre and/or %post sections.  For example:

%post
exec < /dev/console > /dev/console
echo "# Running Post Configuration   #"
#....Do stuff....
echo "End of post-install steps"

Pro: no bothering to wrap + tee, it all streams regardless
Con: None, really.

Alternative way:
In your kickstart config file, wrap your %pre/%post script in (), bundle STDERR to STDOUT, then tee to /dev/console at the bottom. Example (including double-tee so it’s also dumping to a log file):

%post
(
echo "# Running Post Configuration   #"
#....Do stuff....
echo "End of post-install steps"
) 2>&1 | /usr/bin/tee /root/post_install.log | tee /dev/console

Pro: Possibly more robust for purposes of other redirects / tees (to logs, etc).
Con: more steps / less lazy

Now, during your %pre or %post script, anything that normally prints to the screen (STDOUT) or STDERR will stream to your console.  This allows you to see what is going on.  Add extra debug lines via echo if you’re still not able to zero in on your problem (e.g. echo "Now trying to Foo /dev/bar")

Conclusion:

The end-result differences are minimal, but I prefer to just put the exec < /dev/console > /dev/console at the top of the section and be done with it. Doing so doesn’t prevent also wrapping & tee‘ing the script to dump it to a log file, which I take advantage of in the production kickstart configuration.

-jv

Working around missing –label option for RAID partitions in Kickstart

19-Jan-11

TL;DR – During kickstart, you can initiate “normal” partitions with a simple --label flag, why not a RAID partition? There are hardware setups that cause the on-disk OS to “see” your drives differently than kickstart did (a situation UUIDs normally solves, but referencing the UUID can be non-ideal in certain cases). This work-around fixes some of those cases.

RAID partitions are commonplace, yet RedHat forces them to be treated differently

The --label flag has been available for the partition option in kickstart configs for a long time (since at least RedHat v7.2). RAID partitions are quite common (I imagine more popular than ever, though I don’t know for sure), and labeling RAID partitions is done on the command line just like any single-drive ext2/ext3/xfs partition, yet this functionality isn’t [yet] in anaconda (the kickstart tool).

This had until recently seemed (to me) to just be a mild curiosity, but then I had to fight out a situation where kickstart was seeing the drives of these machines in one order, but the post-kickstart locally-installed OS sometimes saw them in another order. So kickstart sees the on-board SATA controller first (thus containing sda, sdb, …), but on some systems (of the same hardware config) the locally-installed OS sees the PCI-X card first. While it’s [as far as we can tell] non-deterministic which servers will exhibit this logical drive-swap, at least it’s stable and not random during boot.

If this was just a server or two, it’s certainly annoying, but not a huge problem. However, when we’re talking about getting hundreds (or thousands) of systems online, it becomes unmanageable. Fortunately, this situation is easily dealt with by using a non-changing reference to the RAID partition. The options left are UUID & labeling. Using the UUID is the default behavior these days, but for my particular use-case (PXE-initiated local drive booting…an unusual scenario I will lay out in another article) UUIDs are not manageable. Also, UUIDs, since they are just a string stored in the superblock, can and do change due to corruption [or corrective action thereof] — I have encountered a situation where a machine lost power, upon boot-up it runs fsck and then the OS can’t see the partition because the UUID is different. Fixed by booting a rescue image and altering /etc/fstab to mount the new UUID, but still a pain.

Labels, on the other hand, can be configured consistently on all systems and are still deterministic since they are tied to the partition no matter where it shows up in the drive order (just like UUID).

Notably, I’ve filed a bug with RedHat to correct this long-standing oversight, but in the meantime there is a work-around.

Let’s get to it

  1. Set up your kickstart partition configs to use RAID (examples here are / & /boot)
    part raid.0 --ondisk=sda --asprimary --fstype ext2 --size=100
    part raid.1 --ondisk=sdb --asprimary --fstype ext2 --size=100
    raid /boot --level=1 --device=md0 --fstype=ext2 raid.0 raid.1
    
    part raid.10 --ondisk=sda --asprimary --fstype ext3 --size=10000 --grow
    part raid.11 --ondisk=sdb --asprimary --fstype ext3 --size=10000 --grow
    raid / --level=1 --device=md1 --fstype=ext3 raid.10 raid.11
  2. Add a %post script block to label the RAID partition(s) plus a hack to alter the /etc/fstab. (Example is for ext2/ext3, for XFS use xfs_admin -L <LABEL> /dev/mdX instead of the e2label commands shown.)
    %post
    e2label /dev/md0 BOOT
    perl -p -i -e 's!^UUID=[[:alnum:]-]+[[:blank:]]+/boot[[:blank:]]!LABEL=BOOT /boot !' /etc/fstab
    
    e2label /dev/md1 ROOT
    perl -p -i -e 's!^UUID=[[:alnum:]-]+[[:blank:]]+/[[:blank:]]!LABEL=ROOT /!' /etc/fstab
    %end
  3. Kickstart your system(s)

Notes:

In this example, I labeled these drives “BOOT” and “ROOT” so that it was painfully obvious what part was the label.  In real life, I recommend using something less garish, like “/boot” and “/” respectively.

-jv

Incrementing alpha strings with Vim

12-Jan-11

TL;DR: Need to increment a MAC address (for example) in Vim? Do this::set nf=octal,hex,alpha
Now put the cursor over the 2nd char of the octet pair and press <CTRL-A> to increment. Add the ‘set‘ line to your .vimrc for permanence.

What do you do when you need to increment a MAC address in a config file (and don’t care to write some fancy parser in Perl or something)?

I ran into this need because we have servers with two ethernet interfaces, however we only have the MAC addresses for the first (eth0).  Knowing that they are numbered consecutively (eth1′s MAC is “one more” than eth0) we simply have to increment the last octet of the known (first) MAC.

The problem is that Vim defaults to increment/decrement only acting on numbers — MACs, being hex and therefore [potentially] containing alphabetical characters, don’t increment “correctly” without some tweaking of your Vim settings.

Let’s get to it

It’s quite simple, just instruct Vim to treat letters as part of a number format, and incrementing / decrementing will handle letters as well. This is accomplished by setting the ‘nrformats‘ (or ‘nf‘ for short):

:set nf=octal,hex,alpha

Now pressing <CTRL-A> will increment ‘a’ to ‘b’, as well as continue to increment ’5′ to ’6′, as before.  <CTRL-X> decrements.
So, using the MAC incrementing as an example, if you have “00:30:48:34:b9:9e” (cursor over the ‘e’) and press <CTRL-A>, you will get “00:30:48:34:b9:9f“.

If you want to make this a permanent feature, just add a set nf=octal,hex,alpha line to your ~/.vimrc file.

Caveats

This does change the behavior of <CTRL-A> and <CTRL-X> slightly (beyond the obvious).  With the defaults (nf=octal,hex), If you have ‘foo123′ (cursor over the ‘f’) and press <CTRL-A>, you will end up with ‘foo124‘.  If you change the nf setting as described, you will end up with ‘goo123′.  So, it’s important that you are more precise with your cursor placement.  This could be of particular concern when combining this with Vim macros.

Also, Vim is not as ‘smart’ as Perl (or any other language that you might like).  You can’t increment past ‘z’ or ‘Z’ (not a problem with MACs, but your use case my be different), nor decrement before ‘a’ or ‘A’.  If you expect either of those things to work, you’ll have to write a parser in a more powerful language to do what you expect.

-jv