Removing "/Subtype /Watermark" images from a PDF using Linux

Problem: I've received a PDF which has a large "watermark" obscuring every page. Investigating: Opening the PDF in LibreOffice Draw allowed me to see that the watermark was a separate image floating

Problem: I’ve received a PDF which has a large “watermark” obscuring every page.

Investigating: Opening the PDF in LibreOffice Draw allowed me to see that the watermark was a separate image floating above the others.

Manual Solution: Hit page down, select image, delete, repeat 500 times. BORING!

Further Investigating: Using pdftk (https://linux.die.net/man/1/pdftk), it’s possible to decompress a PDF. That makes it easier to look through manually.

pdftk input.pdf output output.pdf uncompress

Hey presto! A PDF you can open in a text editor! Deep joy!

Searching: On a hunch, I searched for “watermark” and found several lines like this:

<< /Length 548

stream /Figure <</MCID 0 >>BDC q 0 0 477 733.464 re W n q /GS0 gs 479.2799893 0 0 735.5999836 -1.0800002 -1.0559941 cm /Im0 Do Q EMC /Figure <</MCID 1 >>BDC Q q 28.333 300.661 420.334 126.141 re W n q /GS0 gs 420.3339603 0 0 126.1418879 28.3330078 300.6610601 cm /Im1 Do Q EMC /Figure <</MCID 2 >>BDC Q q 16.106 0 444.787 215.464 re W n q /GS0 gs 444.7874274 0 0 216.5921386 16.1062775 -1.1281493 cm /Im2 Do Q EMC /Artifact <</Subtype /Watermark /Type /Pagination >>BDC Q q 0.7361145 0 0 0.7361145 113.3616638 240.8575745 cm /GS1 gs /Fm0 Do Q EMC endstream endobj

Those are Marked Content Blocks (https://opensource.adobe.com/dc-acrobat-sdk-docs/library/pdfmark/pdfmark_Logical.html). In theory you can just chop out the line with /Subtype /Watermark but each block has a /length variable - so you’d also need to adjust that to account for what you’ve changed - otherwise the layout goes all screwy.

That led me to PyMuPDF which claimed to solve the problem (https://github.com/pymupdf/PyMuPDF/discussions/1855). But running that code only removed some of the watermarks. It got stuck on an infinite loop on certain pages.

So, now that I had more detailed knowledge, I managed to get an LLM to construct something which mostly seems to work.

Does it work with every PDF? I don’t know. Does it contain subtle implementation bugs? Probably. Is there an easier way to do this? Not that I can find.

import re import pymupdf

Open the PDF

doc = pymupdf.open(“output.pdf”)

Regex of the watermarks

pattern = re.compile( rb“/Artifact\s*<<[^>]?/Subtype\s/Watermark[^>]?>>BDC.?EMC“, re.DOTALL )

Loop through the PDF’s pages

for page_num, page in enumerate(doc, start=1): print(f“Processing page {page_num}“) xrefs = page.get_contents() for xref in xrefs: cont = doc.xref_stream(xref) new_cont, n = pattern.subn(b““, cont) if n > 0: print(f“ Removed {n} watermark block(s)“) doc.update_stream(xref, new_cont)

doc.save(“no-watermarks.pdf”)

One of the (many) problems with Vibe Coding is that trying to get a LLM to spit out something useful depends massively on how well you know the subject area. I’m proud to say I know vanishingly little about the baroque (https://shkspr.mobi/blog/2015/11/a-polite-way-to-say-ridiculously-complicated/) PDF specification - which meant that most of my attempts to use various “AI” tools consisted of me saying “No, that doesn’t work” and the accurs’d machine saying back “Golly-gee! You’re right! Let me fix that!” and then breaking something else.

I’m not sure this is the future we wanted, but it looks like the future we’ve got.

Write a comment
No comments yet.