Comparing PDF Binary Data

If you generate a PDF using iText5 twice you get different results. The creation date, modification date and ID are different. This means that it’s difficult to write a test that is repeatable. It’s also not ideal to mock out the whole PDF generation from a library that’s sole purpose is to manipulate PDF as it gives no confidence the code works.

I decided to read the PDF Reference which documents the PDF format on disk to figure out how to write a binary comparison function that ignores the differences.

Reading the binary data we can see we have this difference:

diff -y <(xxd 1.pdf) <(xxd 2.pdf) | colordiff

Compare PDF Binary data

import java.io._
import java.nio.file.Files
 
object Main {
 
 def main(args :Array[String]) = {
 
   if (args.length < 2) {
     println("usage: scala comparePDFs.scala 1.pdf 2.pdf")
     sys.exit(2)
   }
 
   println(s"file 1: ${args(0)}")
   println(s"file 2: ${args(1)}")
 
   val file1 = fileToByteArray(args(0))
   val file2 = fileToByteArray(args(1))
 
   comparePDFs(file1, file2) match {
     case true => println("they match")
     case false => println("they dont match")
   }
  }
 
  def fileToByteArray(filename :String) :Array[Byte] = Files.readAllBytes(new File(filename).toPath())
 
  def comparePDFs(file1 :Array[Byte], file2 :Array[Byte]) :Boolean = {
 
   val file1str = new String(file1, "UTF-8")
   val file2str = new String(file2, "UTF-8")
 
   val file1cleaned = cleanStr(file1str)
   val file2cleaned = cleanStr(file2str)
 
   println(file1cleaned)
   println(file2cleaned)
 
   return file1cleaned.equals(file2cleaned)
 
  }
 
   def cleanStr(fileStr :String) :String = {
   val idRegex = "/ID \\[.*\\]".r
   val creationDateRegex = "/CreationDate(.*)".r
   val modifiedDateRegex = "/ModDate(.*)".r
 
   val fileMinusID = idRegex.replaceFirstIn(fileStr, "")
   val fileMinusIDAndCreation = creationDateRegex.replaceFirstIn(fileMinusID, "")
   val fileMinusIDCreationAndMod = modifiedDateRegex.replaceFirstIn(fileMinusIDAndCreation,"")
   return fileMinusIDCreationAndMod
  }
}

Not the best code, but hopefully useful.