LOW / CODEBlueprints
operationsai

AI Document Data Extractor

Extract structured data from PDFs, invoices, or contracts using Claude AI. The workflow monitors a Gmail inbox for attachments, sends document content to Claude for extraction, and populates a Google Sheet with the structured results.

Intermediate~25 minn8nMake.com 561 downloads 1906 views

Setup Instructions

1. Set up a Gmail filter or label (e.g., "Documents to Process") for emails containing attachments you want to extract data from. 2. In your automation platform, create a workflow triggered by "New Email" in Gmail matching your label or filter. 3. Add a Gmail node to download the attachment. For PDFs, add a text extraction step (many platforms have built-in PDF-to-text, or use a dedicated service). 4. Add an HTTP Request node to call the Claude API (POST https://api.anthropic.com/v1/messages). Set headers: x-api-key to your Anthropic API key, anthropic-version to 2023-06-01, content-type to application/json. 5. Set the model to "claude-sonnet-4-20250514" and max_tokens to 1024. Write a prompt that describes the exact fields you want extracted (e.g., vendor_name, invoice_number, date, line_items, total_amount, due_date). Ask Claude to return a JSON object with these fields. Include an example of the expected output format in the prompt. 6. Add a Google Sheets node to append a new row with the extracted data. Map each JSON field to the corresponding column in your sheet. 7. Add a Slack notification to confirm the extraction was successful, including the document name and key extracted values. 8. Test with a sample invoice email. Verify the extracted data appears correctly in Google Sheets. Adjust the prompt if fields are missing or incorrectly parsed.
Troubleshooting
**Claude misidentifies fields in documents:** Be very specific in your prompt about field names and formats. Include an example of expected output. For invoices, specify date format (YYYY-MM-DD), currency format, and how to handle multiple line items. **PDF text extraction produces garbled output:** Some PDFs are image-based (scanned documents). These require OCR before Claude can process them. Add a Google Cloud Vision or Tesseract OCR step before the Claude API call. **Google Sheets row has wrong column mapping:** Double-check that the JSON field names match your column headers exactly. Use a Function node to explicitly map fields to columns in the correct order before the Sheets append step. **Token limit exceeded on large documents:** Invoices are usually short, but contracts can be very long. For large documents, extract only the relevant pages or sections before sending to Claude. Summarize lengthy clauses instead of sending raw text.

Need a custom version?

We can build a tailored automation workflow for your specific needs.

New blueprints weekly

Get notified when we publish new automation workflows.