Data Readiness Checklist
Bad data does not stop an AI project — it lets it start and then kills it. Work through each dimension before committing to any timeline or scope.
Data readiness checklist
Work through each dimension before starting any AI project. A gap in any one area can invalidate your data foundation before you've built anything.
Can the data actually be extracted, moved, and used by the system you are building?
-
Is there a documented path to extract this data from its source system?
Confirm the extraction method works end to end, including all system boundaries and transformations.
Critical -
Have all approval and access permissions been mapped and confirmed?
Approval chains, IT security reviews, and legal sign-offs can each add weeks to a timeline.
Critical -
Is there a stable, automated extraction method — not a manual export?
A manual spreadsheet export or one-time extract is not a reliable production data source.
Important -
Can all required data sources be accessed from the target environment?
Firewalls, on-premise systems, and cloud boundaries all require explicit connectivity planning.
Critical -
Are join keys available and validated across all source systems?
Cross-system joins without a reliable shared key often require significant data engineering work.
Important -
Has integration effort been estimated and included in the project timeline?
Integration typically accounts for 40–60% of delivery effort on AI projects.
Important
Is the data complete, accurate, consistent, and sufficient in volume for AI use?
-
Has a formal data profile been run against all sources in scope?
Null rates, format consistency, outliers, duplicates, and value distributions should all be documented.
Critical -
Are known data quality issues documented with remediation plans — not just flagged?
Parked issues become project risks. Each known problem needs an owner and a fix date.
Important -
Is there sufficient historical volume to train, fine-tune, or evaluate a model?
AI models require more historical examples than most teams assume; too little data means the model cannot generalise.
Critical -
Is the data accurate — does it reflect what it claims to represent?
Data that has drifted from what it was supposed to represent will produce outputs that reflect that drift.
Critical -
Are values consistent across the dataset — same things described the same way?
Inconsistent labelling or classification within the same field creates unpredictable model behaviour.
Important -
Has a bias review been conducted for any use case that affects people?
Historical data reflects historical decisions. If the use case affects people, systematic bias must be assessed.
Important
Is there a named individual accountable for each data source — not a team?
-
Is there a named individual — not a team — accountable for each data source?
Team ownership distributes accountability so broadly that nobody acts on it. Each source needs a named person.
Critical -
Does that person know the AI project will depend on their data?
Owners who do not know AI depends on their data will not flag changes, delays, or quality degradation.
Important -
Is the data documented in an actively maintained data dictionary?
An undocumented data source is a risk. If the owner leaves, institutional knowledge leaves with them.
Important -
Will the data owner remain in role for the duration of the project?
A data owner who changes mid-project creates a gap in accountability at the worst possible time.
Watch -
Is there a handover plan if the data owner changes during delivery?
Without a documented handover plan, a departure creates an unowned data source in production.
Watch
Is using this data in an AI model legally and ethically sound?
-
Has legal and data protection been involved at the start — not as a final sign-off?
Legal involvement at the end is sign-off. Legal involvement at the start is architecture guidance.
Critical -
Does the original consent or collection basis cover AI training and inference use?
Consent for reporting or analytics does not automatically extend to AI training or inference.
Critical -
Have data residency requirements been mapped before platform selection?
Selecting a cloud platform before mapping residency requirements can force costly architectural changes.
Important -
For regulated industries, have applicable frameworks been identified and confirmed?
GDPR, HIPAA, financial services, and public sector frameworks each impose specific constraints on AI use.
Important -
Is PII handling in training data, inference calls, and stored outputs compliant?
Passing personal data through a third-party API or storing AI-inferred data carries specific GDPR obligations.
Critical
Is the data fresh enough today, and is there a plan to keep it that way after go-live?
-
Is the training data recent enough to represent current conditions?
The age of training data relative to current conditions determines whether the model reflects reality.
Important -
Is there an automated refresh process — not a manual one — for ongoing data supply?
Manual refresh processes get deprioritised, forgotten, or orphaned when the responsible person moves on.
Important -
Is drift monitoring planned so degradation is detected before users notice it?
Model degradation is gradual and invisible without monitoring — users notice before teams do.
Important -
Has a retraining trigger and cadence been agreed with the data owner?
A retraining trigger without a cadence leaves the model drifting until someone remembers to update it.
Watch -
Are there named operational owners for refresh, monitoring, and retraining after go-live?
Operational ownership of refresh and monitoring must be defined before go-live, not after.
Important