Comparing PDF Binary Data

If you generate a PDF using iText5 twice you get different results. The creation date, modification date and ID are different. This means that it’s difficult to write a test that is repeatable. It’s also not ideal to mock out the whole PDF generation from a library that’s sole purpose is to manipulate PDF as it gives no confidence the code works.

I decided to read the PDF Reference which documents the PDF format on disk to figure out how to write a binary comparison function that ignores the differences.

Reading the binary data we can see we have this difference:

diff -y <(xxd 1.pdf) <(xxd 2.pdf) | colordiff

Compare PDF Binary data

import java.nio.file.Files
object Main {
 def main(args :Array[String]) = {
   if (args.length < 2) {
     println("usage: scala comparePDFs.scala 1.pdf 2.pdf")
   println(s"file 1: ${args(0)}")
   println(s"file 2: ${args(1)}")
   val file1 = fileToByteArray(args(0))
   val file2 = fileToByteArray(args(1))
   comparePDFs(file1, file2) match {
     case true => println("they match")
     case false => println("they dont match")
  def fileToByteArray(filename :String) :Array[Byte] = Files.readAllBytes(new File(filename).toPath())
  def comparePDFs(file1 :Array[Byte], file2 :Array[Byte]) :Boolean = {
   val file1str = new String(file1, "UTF-8")
   val file2str = new String(file2, "UTF-8")
   val file1cleaned = cleanStr(file1str)
   val file2cleaned = cleanStr(file2str)
   return file1cleaned.equals(file2cleaned)
   def cleanStr(fileStr :String) :String = {
   val idRegex = "/ID \\[.*\\]".r
   val creationDateRegex = "/CreationDate(.*)".r
   val modifiedDateRegex = "/ModDate(.*)".r
   val fileMinusID = idRegex.replaceFirstIn(fileStr, "")
   val fileMinusIDAndCreation = creationDateRegex.replaceFirstIn(fileMinusID, "")
   val fileMinusIDCreationAndMod = modifiedDateRegex.replaceFirstIn(fileMinusIDAndCreation,"")
   return fileMinusIDCreationAndMod

Not the best code, but hopefully useful.