Groovin’ my iTunes Folder


*(At the suggestion of others I will start providing a link to the good stuff in my posts so you can bypass the commercial like yammering in the middle. If you just wanna see a Groovy way to remove duplicate files from a folder click here.)
I had this idea a long time ago. It started from when I noticed my music collection starting to grow. Back then I had grown used to Amarok, which I consider to be the best Music app on the planet, way better than iTunes, kills Win-Amp. I wanted to write an MP3 tagger in Java. The idea grew into maybe trying to recode all of Amarok nice features in Java. Of course I couldn’t do the whole thing but I sure wanted to get the featuresw I used most… music taging and organizing. Then after I started using a Mac I had another similar need. Maybe it’s cuz I’m new to Mac bringing old Windows and Linux habits. Whatever my hangup, every time I import music into iTunes I end up with duplicates. Duplicates by artist name, duplicates by album, duplicates by genre, duplicates, duplicates, duplicates. Like any new citizen in Mac country that doesn’t know what they’re doing I started managing these duplicates manually occasionally looking for a fix in the many menus and options or an easier way to fix the problem.

Enter todays Groovy snippet!

I couldn’t take it any more! This morning I broke out TextMate (I started using this over GroovyConsole to be more MacLike) and went to work. The core idea was not too complex. I’d recurse my Music folder and perform an MD5 hash on each thing in there collecting the files in a map keyed by the the hash. Let’s see… if my hash is already in the map then I have a duplicate. Sounds just about right, right? Now how to MD5 a file in a directory? A Google search turned up an old JavaLobby post I’d read from R.J. with a complete example in Java.

public static void main(String[] args) throws NoSuchAlgorithmException, FileNotFoundException {
	MessageDigest digest = MessageDigest.getInstance("MD5");
	File f = new File("c:\\myfile.txt");
	InputStream is = new FileInputStream(f);				
	byte[] buffer = new byte[8192];
	int read = 0;
	try {
		while( (read = is.read(buffer)) > 0) {
			digest.update(buffer, 0, read);
		}		
		byte[] md5sum = digest.digest();
		BigInteger bigInt = new BigInteger(1, md5sum);
		String output = bigInt.toString(16);
		System.out.println("MD5: " + output);
	}
	catch(IOException e) {
		throw new RuntimeException("Unable to process file for MD5", e);
	}
	finally {
		try {
			is.close();
		}
		catch(IOException e) {
			throw new RuntimeException("Unable to close input stream for MD5 calculation", e);
		}
	}		
}

Copy, paste in TextMate, remove the wrapping main method, and change the example file path to an arbitray file in my home dir and Cmd+R! It works! No sweat! Now I throw it into an md5 method block and turn the file reference into a parameter and add just a small touch of Groove to it. (important at this stage not to go too far with Groovy syntax cause you can overcook it and break the functionality.)

digest = MessageDigest.getInstance( "MD5" )
def md5(f) {
	byte[] buffer = new byte[8192]
	InputStream is = new FileInputStream(f)
	int read = 0
	try {
		while( (read = is.read(buffer)) > 0) {
			digest.update(buffer, 0, read)
		}		
		byte[] md5sum = digest.digest()
		BigInteger bigInt = new BigInteger(1, md5sum)
		String output = bigInt.toString(16)
		return output
	}
	catch(IOException e) {
		throw new RuntimeException("Unable to process file for MD5", e)
	}
	finally {
		try {
			is.close()
		}
		catch(IOException e) {
			throw new RuntimeException("Unable to close input stream for MD5 calculation", e)
		}
	}
}

Note how I externalized the digest object because I’ll need to reuse it. FInally, I go to work on the core algorithm. I need a filedb to hold my files as I iterate. I code this along with a cool closure thingy to recursively iterate files in the folder.

filedb = [:]
new File("/Users/cliftoncraig07/Music/iTunes").eachFileRecurse {
	if(!it.isDirectory() && it.exists() && it.canRead()) {
		def hash = md5(it)
		def match = filedb[hash]
		if(match) { match.dups << it; println "X" }
		else filedb&#91;hash&#93; = &#91;dups:&#91;&#93;, file:it&#93;
	}
}
&#91;/sourcecode&#93;
I need some sort of progress but I'm no GUI guru so text output will have to do for the prototype. Add a counter with an asterisk on every 20 files considered...
&#91;sourcecode language="java"&#93;
def count = 0
new File("/Users/cliftoncraig07/Music/iTunes").eachFileRecurse {
	if(!it.isDirectory() && it.exists() && it.canRead()) {
		def hash = md5(it)
		def match = filedb&#91;hash&#93;
		if(match) { match.dups << it; println "X" }
		else filedb&#91;hash&#93; = &#91;dups:&#91;&#93;, file:it&#93;
		if(count++ % 20 == 0) print "*"
	}
}
&#91;/sourcecode&#93;

Finally I have a structure that holds all of my files and duplicates. I'm ready to iterate and do finally something about all of these duplicates. How about print them for now? Gotta filter the db first... it's not a SQL DB but Groovy allows you to sift slice and dice collections almost as well as any SQL engine!
&#91;sourcecode language="java"&#93;
filedb.findAll { it.value.dups.size() > 0}.each { hash, match ->
	println "File " << match.file << " has " << match.dups.size() << " duplicates."
}
&#91;/sourcecode&#93;

Look deeply and you'll see that a possible SQL equivalent would involve a join table or two. You'd have to design your schema break out into JDBC and who knows what else. Keeping it simple trades this complexity for a lot more memory. The complete example is below. Overtime this could grow into my complete vision of a music library manager that could be plugged into an open source Java media player. That media player could then be enhanced with a flashy UI if you're as good as Romain Guy it might even blow up big time.

<a name="example"><h2>Remove Duplicate Files Groovy Example</h2></a>


import java.security.MessageDigest

digest = MessageDigest.getInstance("MD5")
filedb = [:]

def count = 0
new File("/Users/cliftoncraig07/Music/iTunes").eachFileRecurse {
	if(!it.isDirectory() && it.exists() && it.canRead()) {
		def hash = md5(it)
		def match = filedb[hash]
		if(match) { match.dups &lt;&lt; it; println "X" }
		else filedb[hash] = createMatch(it)
		if(count++ % 20 == 0) print "*"
	}
}

filedb.findAll { it.value.dups.size() > 0}.each { hash, match ->
	println "File " &lt;&lt; match.file &lt;&lt; " has " &lt;&lt; match.dups.size() &lt;&lt; " duplicates."
}

def createMatch(f) {
	return [dups:[], file:f]
}

def md5(f) {
	byte[] buffer = new byte[8192]
	InputStream is = new FileInputStream(f)
	int read = 0
	try {
		while( (read = is.read(buffer)) > 0) {
			digest.update(buffer, 0, read)
		}		
		byte[] md5sum = digest.digest()
		BigInteger bigInt = new BigInteger(1, md5sum)
		String output = bigInt.toString(16)
		return output
	}
	catch(IOException e) {
		throw new RuntimeException("Unable to process file for MD5", e)
	}
	finally {
		try {
			is.close()
		}
		catch(IOException e) {
			throw new RuntimeException("Unable to close input stream for MD5 calculation", e)
		}
	}
}

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s