S.E.A.N.I.C.U.S.

Monday, July 17, 2006

Ruby as Enterprise "Glue"

Any experienced Rubyist knows that the language can be used to create short, powerful shell scripts and integrate disparate systems, even if the community is just now realizing how powerful Ruby can be in the enterprise.

For example, today I was encountered with two problems. As a result of cleartext, unencrypted email addresses on the KCKCC website, many at the college are receiving lots of spam. We are in the process of obfuscating all of these addresses using a pretty standard Javascript function that composes the address from its parts. However, we have around 8000 items in our web directory structure, only some of which contain email addresses and only some of which contain cleartext ones (we've already had one pass at encrypting some). I thought immediately, "Ruby and Regular Expressions are the answer!" Around 30 lines later (including comments), I had a powerful Ruby script that could run from the shell and change all exposed addresses to their obfuscated counterparts.

However, several tests revealed that not all of the cases were covered. Luckily we have our webserver in a nightly-rsync backup configuration with another box, so I was able to test it on that box without breaking the website. As Eric and I tested the script and watched its output, we realized that even if the script works 90% of the time, there are occasionally going to be problems. We decided that it was long overdue to backup the website into a Subversion repository.

So I ran an svn import on the root web folder into a fresh repository. Little did I know that Apache/mod_svn has a 2GB limit on commits!
In the middle of adding files, it hung.

Ruby script to the rescue!

My first attempted solution was to add each top-level directory individually. Unfortunately, even some of those were over 2GB in size! So I modified the script to add and commit each file one-by-one, which quickly resulted in 900+ commits and it wasn't even finished. Then, I had an epiphany and realized I could keep track of how big each file was, and when I had accrued a large enough commit size, I could commit multiple files, thus saving in number of commits and transfer time (commit operations are atomic, thus expensive). I quickly reached the end of the day before I realized that I was committing the whole directory and not the accrued files, thus it had to recurse the whole directory structure to figure out what had been added. So I just left it to go, since it seemed to be doing fine, albeit having slow commits. Another caveat I found is that svn will not return an error code when something can't be added - only a warning.

Here's the code, roughly, for the final script (that I left running at the end of the day):
#!/usr/bin/env ruby
# usage: svnimport.rb [directory] [commitsize]
Dir.chdir(ARGV.shift)
@maxsize = ARGV.shift.to_i
@size = 0
Dir["./**/*"].sort.each do |file|
fork do
exec("svn add -N \"#{file}\"")
end
Process.wait
@size += File.size(file) if $? == 0
if @size >= @maxsize
fork do
exec("svn commit -m \"Initial commit\"")
end
@size = 0
Process.wait
end
end