Since the Python Sync deployment, we’ve been receiving user reports of issues with clearing Sync data in Account Portal. After a couple weeks of investigation, we determined the root cause this morning (bug 670975) and will be pushing a hotfix at 3pm PDT today (bug 670993).
While reviewing logs a few days ago we observed that DELETE requests to the Sync server were being sent in the form “DELETE //1.1/username/storage“, with one extra slash. This is typically considered a low-impact (harmless) problem, as traditionally webservers simply consume the extra slash and continue to work as expected. The results of the observed DELETE were 404, which is an API-permissible response, when the user has no data.
However, while researching logs in bulk today, we found that *all* DELETE requests today received a 404 HTML status code, and that the body size returned was significantly larger than the normal single integer Sync status reply. The search was expanded to include GET requests (for quota) and determined that all GET requests were impacted as well.
We found that nginx/gunicorn were reacting poorly the extra slash, as opposed to Apache/PHP which simply ignores it. nginx does not remove the extra ‘/’ and passes it to Python, which considers it an invalid pathname and causes a response of an HTML 404 status code. This increases the returned data byte count that we see in the server logs and is the verified behavior in all staging and production Sync environments.
Today’s hotfix is a two character change to Account Portal, removing the extra slash from both the GET and DELETE URL constructions (“%s/%s” -> “%s%s”). This resolves all known issues with the affected actions (clear data and get quota).
We did not detect this issue as the 404 status codes to GET and DELETE requests are API-compliant and are not a problem when they occur for any single user. The increased rate of 404s from the Sync server was invisible on all existing graphs, as Account Portal makes only a few requests per day out of millions total. In hindsight, a graph showing only account-portal HTTP status codes would have shown a dramatic change to 100% 404s when Python Sync was deployed. We are working on these graphs (bug 671017) as they provide critical visibility into Account Portal.
(content by Richard Soderberg, Sr. Operations Engineer in Mozilla Services)