Deduplication: keep Attio clean forever
Every CRM becomes a graveyard of duplicates eventually. Someone imports a list. Someone's assistant types a contact with a typo. A webhook fires twice. Three months later you have four records for the same person and nobody wants to merge them because nobody trusts which one is "right". We fix this two ways. One-shot: a scripted migration that lands 100% duplicate-free on day one. Continuous: a dedup layer that catches duplicates the moment they're created, scores them on completeness, and merges the best record automatically - with a dry-run preview and a human-in-the-loop gate for the ambiguous ones.
Direction
Attio-native + CSV ingest
Stack
Python, Attio API, Fuzzy matching, Optional FastAPI service
The what
What this integration actually does
The how
How we build it
- 1
Inventory every source of contact data: existing CRM, Mailchimp, event lists, founder inboxes, Google contacts. Duplicates compound across sources.
- 2
Score completeness: a record with phone + email + company + last activity beats a record with just an email. Merges keep the strongest fields.
- 3
Fuzzy-match on name + email + company with configurable thresholds. "John Smith @ Acme" and "J. Smith @ Acme Inc" are the same person.
- 4
Company-name validation to prevent merging two people named John Smith at different companies.
- 5
Always --dry-run first. Every merge shows a diff before it touches production data.
- 6
Deploy the ongoing dedup job on a schedule (nightly) or wire it to the create webhook for real-time catching.
Under the hood
What lives inside the pipeline
- Fuzzy matching with Levenshtein + token-set ratio on names; exact match on email domain.
- Completeness score: missing fields penalized, recent activity boosts the record.
- Merge-safe field policy: phone numbers unionize, notes concatenate, company name wins by most-complete.
- CSV-driven batch mode for large one-off cleanups.
- Optional SaaS version: dedupe-csv-saas (our own product) for teams that want a self-serve tool.
Hard-earned lessons
What we learned the hard way
- Merging across companies is the #1 way to corrupt a CRM. The validation for "same person, different company" is not optional.
- Attio rate limits are strict on bulk ops - always pace merges with backoff.
- Never run a merge script on production without --dry-run first. Always.
- Scoring heuristics matter more than matching heuristics. Getting the "which one wins" logic right is where trust is built.
Case study
Anker Capital - VC
Problem
Anker had 1,200 investor records coming in from multiple broker lists with inconsistent naming, formatting, and heavy overlap across sources.
Solution
Python dedup scripts with fuzzy matching and completeness scoring. Consolidated 1,032 of 1,200 investors automatically; the remaining 168 went to a human review queue for ambiguous merges.
Outcome
Clean foundation for every downstream automation. No "wait which record is real?" threads. The ongoing job keeps it that way.
FAQ
Questions we get
Yes. We run in dry-run first, show you the proposed merges, and execute only what you sign off on.
Yes - dedupe-csv-saas. Free for up to 250 contacts, then €0.01 per contact. Useful for teams that want to dedupe before importing to any CRM.
Same pipeline, different matcher. Company dedup uses domain as the strongest signal; contact dedup uses email.
Attio keeps an audit trail, but practically a merge is a merge. That's why we dry-run everything first.
Want this running on your Attio?
Book a free 30-min call. We'll map your use case to what we've already shipped and tell you whether this fits - honestly.
Book a 30-min call