Block AI from Scraping PDF Content

Using PDF metadata to block AI models from scraping content is possible through standardized protocols, but its effectiveness depends entirely on the cooperation of the AI developers and the technical implementation of stronger security measures

Metadata provides a “polite request” rather than a technical barrier. 

The Metadata Approach: TDMRep

The primary method for communicating scraping preferences via metadata is the Text and Data Mining Reservation Protocol (TDMRep), developed by the W3C (World Wide Web Consortium). 

The TDMRep protocol allows content creators to embed machine-readable instructions within a document’s metadata (specifically XMP) that inform AI crawlers whether mining is permitted. 

  • tdm-reservation (boolean): A flag that indicates whether rights are reserved.
  • tdm-policy (URL): A link to a policy file that provides contact information for the rights holder and conditions for obtaining a license for TDM access (e.g., payment or explicit permission). 

How to Implement using SetaPDF-Stamper + PDF Ink

Usually you might use PDF editing software (like Adobe Acrobat Pro or PDF-XChange Editor) to access the document properties, navigating to the advanced metadata sections, and adding these specific tdm-reservation and tdm-policy keys. But what if you don’t want to edit every single PDF of yours and want this done automagically each time your customers download your files?

Here’s how to do that using PDF Ink plus the SetaPDF-Stamper add-on:

  1. Create and host your own policy JSON file.
  2. Install this snippet on your WP site, by adding to child theme functions.php file, custom plugin, or by using the Code Snippets plugin:
function custom_setapdf_core_document_info_tdm( $info, $document, $settings ) {

  $xmp = $info->getXmp();
  $xmp->xmlAliases[$ns] = $alias;

  $info->updateXmp($ns, 'reservation', 1);
  // Replace the URL below with the URL of your policy file
  $info->updateXmp($ns, 'policy', 'https://www.example.com/policies/tdm-policy.json');
  $info->syncMetadata();

}
add_action( 'pdfink_filter_setapdf_core_document_info', 'custom_setapdf_core_document_info_tdm', 10, 3 );

Limitations of Metadata

It is crucial to understand that this method is a “social contract” for AI bots, not a technical firewall. 

  • Relies on Compliance: Legitimate AI companies like OpenAI and Google generally respect these signals (similar to a website’s robots.txt file), but non-compliant or malicious scrapers will simply ignore the metadata and extract the data anyway.
  • Not a Technical Barrier: The data in the PDF is still plaintext and accessible to any program that can read the PDF structure. Metadata does not encrypt or hide content. 

Effective Technical Defenses

If your primary goal is to technically prevent AI models and scrapers from accessing the content of your PDFs, you need a stronger, layered approach: 

  • Encryption and Password Protection: Encrypting a PDF with a password is the most effective defense. The contents cannot be read or indexed by any application without the correct decryption key. This robust technical barrier can be easily effected using PDF Ink security settings.
  • Restrict Copying/Extraction Permissions: With PDF Ink, you can set permissions to disable the copying of text and images. While determined users can bypass this, it stops basic scraping tools.
  • Place Content Behind a Paywall/Login: Putting your PDFs behind an authentication system is highly effective at stopping public AI web crawlers, as they typically do not have login credentials. If you’re reading this, you’ve probably already done this!
  • Convert to Images: If the content is purely an image within the PDF, it requires Optical Character Recognition (OCR) to extract text, which introduces errors and complexity for scrapers, though modern AI is good at this. 

In summary, metadata is a useful legal and ethical signal for compliant AI agents, but it will not stop a determined scraper. Only technical security measures like encryption offer true protection.