We recently had a situation where we needed to copy a lot of HBase data while migrating from our old datacenter to our new one. The old cluster was running Cloudera’s CDH2 with HBase 0.20.6 and the new one is running CDH3b3. Usually I would use Hadoop’s distcp utility for such a job. As it turned out we were unable to use distcp while HBase was still running on the source cluster. Part of the reason for this is that the HFTP will throw XML errors due to HBase modifying files (particularly the case if HBase removes a directory). And to transfer our entire dataset at the time was going to take well over a day. This presented a serious problem because we couldn’t accept that kind of downtime. We were also about 75% full in the source cluster so doing HBase export was out as well. Thus I created a utility called Backup.
Backup is designed to essentially do the same work as distcp with a few differences. The first being that Backup would be designed move beyond failures. Since we’re still running HBase on the source cluster we can actually expect quite a few failures as a matter of fact. So inside Backup’s MapReduce job will by design catch generic exceptions. This is probably a bit over-zealous, but I really needed it not to fail no matter what. Especially after a few hours in.
One of the other differences is that I designed Backup to always use relative paths. It does this by generating a common path between the source and destination via regular expression. Distcp on the other hand will do some really interesting things depending on what options you’ve enabled. If you use the -f flag for providing a file list, it will take all the files and write them directly to the target directory, rather than putting them in their respective sub-directories based on the source path. If you run with the -update flag it seems to put the source directory inside the destination rather than realizing that I want these two directories to look the same.
The last major difference is that Backup is designed to run in update mode always. This was found because our network connection could only push about 200 MB/s between datacenters. We later found that a firewall was the bottleneck, but we didn’t want to drop our pants to the world either. Distcp would take hours just to stat and compare the files. For context we had something on the order of 300K-400K files we were looking to transfer. This is because distcp currently does this in a single-thread before it runs its MapReduce job. This actually makes sense when considering that distcp is only a single MapReduce job and it wants to distribute the copy evenly. Since we needed to minimize downtime, the first thing I did was distribute the file stat comparisons. In exchange we currently take a hit on not being able to evenly distribute the copy work. Backup uses a hack to attempt to get better distribution, but it’s nowhere near ideal. Currently it looks at the top-level directories just under the main source directory. It then splits that list of directories into mapred.map.tasks number of files. Since the data is small (remember this is paths and not the actual data) you’re pretty much guaranteed MapReduce will take your suggestion for once. This splits up the copy pretty well especially for the first run. On subsequent runs however you’ll get bottlenecked by a few nodes doing all the work. You can always up the mapred.map.tasks even higher, but really I need to split it out into two MapReduce jobs. I also added a -f flag so that we could specify file lists. I’ll explain later on why this was really useful for us.
So back to our situation. I ran the first Backup job while HBase was running. This copied the bulk of our 28 TB dataset obviously with a bunch of a failures because HBase had deleted some directories. Now that we had most of the data we could do subsequent Backup’s within a smaller time window. We ingest about 300 GB/day so our skinny pipe between datacenters was able to make subsequent transfers in hours and not days. During scheduled downtime we would shutdown the source HBase. Then we copied the data to a secondary cluster in the new datacenter. As soon as the transfer was finished we would verify the source and destination matched. If so then we were all good to start up the source cluster again and resume normal production operation. Meanwhile we would copy the data from the secondary cluster to the new production cluster. The reason for doing this was because HBase 0.89+ would change the region directories, and we also needed to allow Socorro web developers to do their testing. So having the two separate clusters was a real blessing. It allowed us to keep a pristine backup at all times on secondary while testing against the new production cluster. So we did this a number of times the week before launch. Always trying to keep everything as up to date as we could before we threw the switch to cut over.
It was during this last week I added the -f flag which allowed giving Backup a source file list. We would run “hadoop fs -lsr /hbase” on both the source and the destination cluster. I wrote a simple python utility (lsr_diff) to compare these two files and figure out what needed to be copied and what needed to be deleted. The files to copy could be given to the Backup job while the deletes could be handled with a short shell script (Backup doesn’t have delete functionality). The process looked something like this:
RUN ON SOURCE CLUSTER:
hadoop fs -lsr /hbase > source_hbase.txt
RUN ON TARGET CLUSTER:
hadoop fs -lsr /hbase > target_hbase.txt
scp source_host:./source_hbase.txt .
python lsr_diff.py source_hbase.txt target_hbase.txt
sort copy-paths.txt -o copy-paths.sorted
sudo -u hdfs hadoop fs -put copy-paths.sorted copy-paths.sorted
nohup sudo -u hdfs hadoop jar akela-job.jar com.mozilla.hadoop.Backup -Dmapred.map.tasks=112 -f hdfs://target_host:8020/user/hdfs/copy-paths.sorted hftp://source_host:50070/hbase hdfs://target_host:8020/hbase
The number of map tasks I refined over time, but I started the initial run with (# of hosts * # of map task slots). On subsequent runs I ended up doubling that number. After the backup job completed each time we would run “hadoop fs -lsr” and diff again to make sure that everything copied over. I saw a lot of times that wasn’t the case when the source was HFTP from one datacenter to another. However when copying files from an HDFS source within our new datacenter I never saw an issue with copying.
Due to other issues (there always are right?) we had a pretty tight timeline and this system was pretty hacked together, but it worked for us. In the future I would love to see some modifications made to distcp. Here’s my wishlist based on our experiences:
1.) Distribute the file stat comparisons and then run a second MapReduce job to do the actual copying.
2.) Do proper relative path copies.
3.) Distribute deletes too.
To be honest though I found the existing distcp code a bit overly complex otherwise I might have made the modifications myself. Perhaps the best thing is that someone take a crack at a fresh rewrite of distcp altogether. I would love to hear people’s feedback.
7 Responses to “Migrating HBase: In the Trenches”
Leave a Reply
You must be logged in to post a comment.