Document Analysis
Every document you upload runs through a four-stage pipeline: type detection, OCR + parsing, validation, and red-flag scoring. The structured result is then attached to any report that references the document.
Supported formats
| Format | Max size | Notes |
|---|---|---|
| 20 MB | Searchable and scanned PDFs both supported. | |
| JPG / JPEG | 10 MB | Best for photos of physical documents. |
| PNG | 10 MB | Preferred for screenshots and digital exports. |
| HEIC | 10 MB | Automatically transcoded to JPEG before processing. |
The pipeline
1. Type detection. A lightweight classifier guesses the document type (passport, bank statement, employment letter, invitation, itinerary, etc.) so downstream extractors know which schema to apply.
2. OCR & structured extraction. Text is extracted with layout awareness; key fields (dates, IDs, amounts, names) are normalised into a typed schema per document type.
3. Validation. Each field is checked for plausibility (date ranges, currency consistency, name match across documents, passport MRZ check digits, etc.).
4. Red-flag detection. Pattern-level checks look for common reasons reviewers reject documents: expired-soon passports, inconsistent salaries, account balances that spike unnaturally days before submission, mismatched names, low-resolution photos of seals or stamps.
Per-document output
The result attached to each document is a stable JSON object with extracted, checks, and red_flags arrays. Reports use these as inputs into the document signal score.
{
"id": "doc_01HXYZpassport",
"type": "passport",
"status": "analysed",
"extracted": {
"full_name": "AYŞE YILMAZ",
"passport_number": "U12345678",
"issued_at": "2022-03-04",
"expires_at": "2032-03-03",
"country": "TR"
},
"checks": [
{ "id": "expiry_buffer", "status": "pass" },
{ "id": "mrz_consistency", "status": "pass" },
{ "id": "photo_quality", "status": "pass" }
],
"red_flags": []
}Document types currently supported
- Passport (biographical page)
- National ID
- Bank statements (Turkish, EU, US, UK formats)
- Employment letter / pay slip
- Tax return summary
- Invitation letter (host or sponsor)
- Hotel reservation / flight itinerary
- Travel medical insurance
- Previous visas and entry/exit stamps
Signal categories
Each check and red_flag belongs to a signal category. The category determines how the result feeds the document signal score and which recommendation templates it can trigger.
| Category | Example checks | Effect when failed |
|---|---|---|
validity | expiry_buffer, mrz_consistency, check_digit | Hard red flag — strongly lowers the document score. |
consistency | name_match, dob_match, address_match across files | Cross-document mismatch flagged for reviewer attention. |
finance_integrity | balance_stability, salary_consistency, deposit_spike | Raises a finances red flag and a proof-of-funds recommendation. |
media_quality | photo_quality, seal_resolution, page_completeness | Soft flag — prompts a clearer re-upload recommendation. |
Worked example
Suppose an applicant uploads a passport and a bank statement. Type detection labels each file, OCR extracts the fields, and validation runs the category checks. The passport passes expiry_buffer and mrz_consistency (as in the JSON above), but the statement trips finance_integrity.deposit_spike because a large deposit landed four days before submission. That single red flag lowers the document signal score and emits a finances recommendation asking the applicant to document the source of funds — the same event that the Recommendations page turns into rec_increase_funds_proof.