Archive for August, 2006

Cleaning house with Ruby

Tuesday, August 8th, 2006

I started to consolidate all my old laptops today so I copied my iTunes library from the different macs and noticed that I had all sorts of duplicate files with slightly different names on em.

As usual the solution is a small shell script. In this case in Ruby. This tiny little script recurses thru a directory structure (my iTunes lib is what I used) and SHA-1 hashes the contents of every file. It then moves any dupe files to the trash for my perusal. I figure if I wrote this in C++ or in Cocoa it would have taken me much much longer, even though I’m a total rookie Rubyist.

Scripting for the win :)

#!/usr/local/bin/ruby
require 'digest/sha1'
require 'FileUtils'

map = Hash.new
dupes = Array.new

#parse the cmd lne args, removing any trailing dir seps
root = ARGV[0]
root = root.chomp("/")

#add the magic string to cause glob to recurse dirs
root += "/**/**"

totalSize = 0
start = Time.now

Dir.glob(root) do |path|
  unless File.directory?(path) #skip dirs, we just want files
    f = File.open(path)

    sz = File.size(path)
    if  sz > 0 # skip zero length files
      print "processing #{path}:"
      $stdout.flush
      result = Digest::SHA1::hexdigest(f.read)
      puts result

      totalSize += sz
      if map.has_key?(result)
        #it's a dupe, so make an array of the two paths and add it to the dupes array
        dupes << [path , map[result]]
      else
        map[result] = path #just add it to the hash
      end
    end
  end
end

elapsed = Time.now - start
puts "scan completed #{((totalSize/1024) / elapsed).to_i}K per sec"

dupes.each() do |a|
   puts "dupe: #{a[0]} and #{a[1]}"

   #our highly sophisticated algo..if we have a dupe..keep the one with the shorter name :)
   if a[0].length < a[1].length
     delme = a[1]
   else
     delme = a[0]
   end

   puts "moving #{delme} to the trash."
   bn = File.basename(delme)
   nn = File.expand_path("~/.Trash") + "/" + bn
   FileUtils.move(a[0],nn)
end

if dupes.empty?
  puts "**** No duplicates found ****"
end

Note that this script doesn’t prompt you for anything…so if you use it…be prepared and backup yer stuff first!