On November 11, 2014 Mozilla announced the Polaris Privacy Initiative. One key part of the initiative is us supporting the Tor network by deploying Tor middle relay nodes. On January 15, 2015 our first proof of concept (POC) went live.
TL;DR; here are our Tor relays: https://globe.torproject.org/#/search/query=mozilla
When we started this POC, the requirements we had were:
- the Tor nodes should run on dedicated hardware
- the nodes should be logically and physically separated from our production infrastructure
- use low cost and commoditized hardware
- nodes should be operational within 3 weeks
Hardware and Infrastructure
- We chose to make use of our spare and decommissioned hardware. That included a pair of Juniper EX4200 switches and three HP SL170zG6 (48GB ram, 2*Xeon L5640, 2*1Gbps NIC)
- We dedicated one of our existing IP Transit providers to the project (2 X 10Gbps).
The current design is fully redundant. This allows us to complete maintenance or have node failure without impacting 100% of traffic. The worst case scenario is a 50% loss of capacity.
The design also allows us to easily add more servers in the event we need more capacity, with no anticipated impact.
Building and Learning
There is a large body of knowledge available on building Tor nodes. I read mailing lists archives, blog posts, and tutorials. I had exchanges with people already running large relays. There are still data points Mozilla needs to understand before our experiment is complete. This section is a “quick run down” on some of those data points.
- A single organization shouldn’t be running more than 10Gbps of traffic for a middle relay (and 5Gbps for an exit node).
This seems to be more of gut feeling from existing operators than a proven value (let me know if I’m wrong), but it makes sense. We do have available transit and capacity. Understanding throughput and resource utilization is a key criteria for us.
Important Note: An operator running relays must use the “MyFamily” option in torrc. This ensures a user doesn’t bounce through several of your servers.
- Slow ramp up
A new Tor instance (identified by its private/public key pair) will take time (up to 2 months) to use all its available bandwidth. This is explained in this blog post: The lifecycle of a new relay. We will be updating our blog posts and are curious how closely our nodes mirror the lifecycle.
- A Tor process (instance) can only push about 400Mbps.
This is based on mailing list discussions, as we haven’t reached that bandwidth yet. We run several instances per physical server.
- A single public IP can only be shared by 2 Tor instances
- Listen on well known ports like 80 or 443
This helps people behind strict firewall to access Tor. Don’t worry about running the process as root (needed to listen on ports < 1024), as long as you have the “User” option in torrc, Tor will drop the privileges after binding to the ports.
We decided to use Ansible for configuration management. A few things motivated us to make that choice.
- There was an existing ansible-tor role very close to what we needed to accomplish (and here is our pull request with our fixes and additions).
- Some of our teams are using Ansible in production and we (Network Engineering) are considering it.
- Ansible does not require a heavy client/server infrastructure which should make it more accessible to other operators.
And look! Mozilla’s Ansible configuration is available on GitHub!
The security team helped us a lot along this project. Together we have put together a list of requirements, such as
- strict firewall filtering
- hardening the operating system (disable unneeded services, good SSH configuration, automatic updates)
- hardening the network devices management plane
- implementing edge filtering to make sure only authorized systems can connect to the “network management plane”
The only place for the infrastructure administration is the jumphost. Systems don’t accept management connection from anywhere else.
It is important to note, that many of the security requirements align nicely with what’s considered a good practices in general system and network administration. Take enabling NTP or centralized syslog for example – equally important for some services to run smoothly, for troubleshooting and for Incident Response. Similar concepts apply with the principle “make sure the network devices security is at least as good as system’s one”.
We’ve also implemented a periodic security check to be run on these systems. All of them are scanned from inside for security updates and outside for opened ports.
One of the points we’re wondering are: how do we figure out if we’re running an efficient relay (in terms of cost, participation in the Tor network, hardware efficiency, etc). Which metrics to use and how to use them?
Looking around it seems like there is no “good answer”. We’re graphing everything we can about bandwidth and servers utilization using Observium. The Tor network already has a project to collect relays statistics called Tor metrics. Thanks to it, tools like Globe and others can exists.
Note that we have just started them and they are far from running at their maximal bandwidth (for the reasons listed above). We will share more information down the road about performances and scaling.
Depending on the results of the POC, we may move the nodes to a managed part of our infrastructure. As long as their private keys stay the same, their reputation will follow them wherever they go, no more ramp up period.
On a technical side there are a lot of possible things to do like adding IPv6 connectivity. We’re reviewing opportunities to more parts of the deployment (like iptables, logs, etc…).
Here are a few links that you might find interesting:
[blog] IPredator – building a Tor server
[mailing list] [tor-dev] Scaling tor for a global population
[mailing list] How to Run High Capacity Tor Relays
[wiki] tor – archwiki
[blog] Run A Tor-Relay On Ubuntu Trusty
[mailing list] [tor-relays] Someone broke the tor-relay speed record?
[tor website] Configuring a Tor relay on Debian/Ubuntu
[wiki] tor exit full setup
Of course, none of that would have been possible without the help of Van, Michal (who wrote the part about security) and Opsec, Javaun, James, Moritz and the people of #tor!