If you generate a PDF using iText5 twice you get different results. The creation date, modification date and ID are different. This means that it’s difficult to write a test that is repeatable. It’s also not ideal to mock out the whole PDF generation from a library that’s sole purpose is to manipulate PDF as it gives no confidence the code works.
I decided to read the PDF Reference which documents the PDF format on disk to figure out how to write a binary comparison function that ignores the differences.
Reading the binary data we can see we have this difference:
diff -y <(xxd 1.pdf) <(xxd 2.pdf) | colordiff
Compare PDF Binary data
import java.io._ import java.nio.file.Files object Main { def main(args :Array[String]) = { if (args.length < 2) { println("usage: scala comparePDFs.scala 1.pdf 2.pdf") sys.exit(2) } println(s"file 1: ${args(0)}") println(s"file 2: ${args(1)}") val file1 = fileToByteArray(args(0)) val file2 = fileToByteArray(args(1)) comparePDFs(file1, file2) match { case true => println("they match") case false => println("they dont match") } } def fileToByteArray(filename :String) :Array[Byte] = Files.readAllBytes(new File(filename).toPath()) def comparePDFs(file1 :Array[Byte], file2 :Array[Byte]) :Boolean = { val file1str = new String(file1, "UTF-8") val file2str = new String(file2, "UTF-8") val file1cleaned = cleanStr(file1str) val file2cleaned = cleanStr(file2str) println(file1cleaned) println(file2cleaned) return file1cleaned.equals(file2cleaned) } def cleanStr(fileStr :String) :String = { val idRegex = "/ID \\[.*\\]".r val creationDateRegex = "/CreationDate(.*)".r val modifiedDateRegex = "/ModDate(.*)".r val fileMinusID = idRegex.replaceFirstIn(fileStr, "") val fileMinusIDAndCreation = creationDateRegex.replaceFirstIn(fileMinusID, "") val fileMinusIDCreationAndMod = modifiedDateRegex.replaceFirstIn(fileMinusIDAndCreation,"") return fileMinusIDCreationAndMod } }
Not the best code, but hopefully useful.