Did you know that PDF content can be scraped (i.e. read and aggregated by machines)? Some people might think PDF content is somehow separate from the Web, but actually, it’s quite easy to gather the contents (especially the text) from any publically-hosted PDF file using freely available software. In fact, PDF contents have been indexed by search engines for well over twenty years.
If you want to expose your PDF data, you can do that, too! One example is providing shoppers with free or low-cost representative sample files to get them hooked on your premium download. Here’s another example: years ago, we helped a company who felt like the bulk of their value was information “hidden” in PDF files. We strategically exposed that goldmine of information by letting the WP Search plugin to scrape their website’s PDFs. Suddenly all their PDF content was indexed and searchable along with the rest of their WordPress blog posts and WooCommerce products. It seemed like magic! PDFs had become part of the web. 🕸️
But for other owners whose PDF files might end up scattered online, scraping and indexing could be a nightmare!
Below we’ll talk about one method– adding metadata –to stop well-behaved machines from scraping your PDF data, no matter where your PDF ends up.
This method uses PDF Ink + SetaPDF-Stamper. This approach isn’t free, but as far as we are aware, unless you have a slave programmer with significant programming chops, there is no “free” solution to this issue. Sometimes we have to pay a little extra to protect our assets. Think of it this way: it costs less than car insurance. 😉
The Metadata Approach
Using PDF metadata to block AI models from scraping content is possible through standardized protocols. It will help, but it is not a silver bullet. Its effectiveness depends entirely on the cooperation of the AI developers and the technical implementation of stronger security measures. In other words, metadata provides a “polite request” rather than a technical barrier, and maybe shouldn’t be used alone.
Text and Data Mining Reservation Protocol: TDMRep
The primary method for communicating scraping preferences via metadata is the Text and Data Mining Reservation Protocol (TDMRep), developed by the W3C (World Wide Web Consortium).
The TDMRep protocol allows content creators to embed machine-readable instructions within a document’s metadata (specifically XMP) that inform AI crawlers whether mining is permitted.
How to Implement using SetaPDF-Stamper + PDF Ink
Usually you might use PDF editing software (like Adobe Acrobat Pro or PDF-XChange Editor) to access the document properties, navigating to the advanced metadata sections, and adding these specific tdm-reservation and tdm-policy keys. But what if you don’t want to edit every single PDF of yours and want this done automagically each time your customers download your files?
Here’s how to do that using PDF Ink plus the SetaPDF-Stamper add-on:
- Create and host your own policy JSON file.
- Install this snippet on your WP site, by adding to child theme functions.php file, custom plugin, or by using the Code Snippets plugin:
function custom_setapdf_core_document_info_tdm( $info, $document, $settings ) {
// Define your Namespace URI (must be a unique URL)
$ns = 'http://www.example.com/tdm#';
// Define the alias (the prefix you want to see in the XML, e.g., 'tdm')
$alias = 'tdm';
// Get the XMP object
$xmp = $info->getXmp();
// Register the alias
// This tells the XMP object: "When you see 'tdm:', use the URI '$ns'"
$xmp->xmlAliases[$ns] = $alias;
// Update the XMP
$info->updateXmp( $ns, 'reservation', 1 );
// Replace the URL below with the URL of your policy file
$info->updateXmp( $ns, 'policy', 'https://www.example.com/policies/tdm-policy.json' );
// Save changes
$info->syncMetadata();
}
add_action( 'pdfink_filter_setapdf_core_document_info', 'custom_setapdf_core_document_info_tdm', 10, 3 );
Here’s what’s happening in the code above. Let’s suppose the user has a WordPress (or other PHP-based site) with PDF Ink and SetaPDF-Stamper installed. They’ve hooked a function to the PDF Ink hook, pdfink_filter_setapdf_core_document_info
That function fetches the XMP, then updates the XMP, with the following:
tdm-reservation(boolean): A flag that indicates whether rights are reserved. The 1 indicates true, or yes, rights are reserved.tdm-policy(URL): A link to a policy file that provides contact information for the rights holder and conditions for obtaining a license for TDM access (e.g., payment or explicit permission).
Finally, the altered XMP is stored.
Limitations of Metadata
It is crucial to understand that this method is a “social contract” for AI bots, not a technical firewall.
- Relies on Compliance: Legitimate AI companies like OpenAI and Google generally respect these signals (similar to a website’s
robots.txtfile), but non-compliant or malicious scrapers or humans can simply ignore the metadata and extract the data anyway. - Not a Technical Barrier: The data in the PDF is still plaintext and accessible to any program that can read the PDF structure. Metadata does not encrypt or hide content.
Effective Technical Defenses
Remember: the only 100% certain way to fully protect your PDF is to not share it at all. But if you must share files, it’s wise to use a tool like PDF Ink to add a protection layer.
If your primary goal is to technically prevent AI models and scrapers from accessing the content of your PDFs, you need a stronger, layered approach:
- Encryption and Password Protection: Encrypting a PDF with a password is the most effective defense. The contents cannot be read or indexed by any application without the correct decryption key. This robust technical barrier can be easily effected using PDF Ink security settings.
- Restrict Copying/Extraction Permissions: With PDF Ink, you can set permissions to disable the copying of text and images. While determined users can bypass this, it stops basic scraping tools.
- Place Content Behind a Paywall/Login: Putting your PDFs behind an authentication system is highly effective at stopping public AI web crawlers, as they typically do not have login credentials. If you’re reading this, you’ve probably already done this!
- Convert Text to Images: If the content is purely an image within the PDF, it requires Optical Character Recognition (OCR) to extract text, which introduces errors and complexity for scrapers, though AI is becoming relatively good at this.
In summary, metadata is a useful legal and ethical signal for compliant AI agents, but it will not stop a determined scraper. Technical security measures like encryption offer much better protection, but sad to say, the only way to truly protect your data is to probably not upload it.