Directory Datasets
6 min read

How to detect when a scraper silently breaks: the complete drift-detection guide

A practical, code-first guide to scraper drift detection — synthetic tests, schema validation rates, baselines, and alerting. Catch upstream UI changes before your buyers do.

A scraper silently breaks when its code keeps running, its job keeps succeeding, and its output keeps flowing — but the data is now subtly wrong. The agency name field is empty for 18% of records. The salary field is suddenly the company's name. Nothing crashes. No alert fires. Your buyer notices in three weeks when their CRM is full of garbage.

This post is the complete playbook for catching silent scraper failures before they reach a buyer. It's the system Directory Datasets uses on every Apify Actor we ship.

The four ways scrapers silently fail

Every silent scraper failure falls into one of these buckets:

| Failure mode | What changed | What you see in the data | |--------------|--------------|--------------------------| | Selector drift | Upstream renamed a CSS class or DOM structure | Field is empty/null for many records | | Field semantics drift | Upstream changed what a field means | Field is populated but wrong (e.g., raw HTML in a "description" field) | | Pagination drift | Upstream changed page-size or cursor behavior | Total record count is silently lower | | Anti-bot drift | Upstream rolled out new fingerprinting or rate limits | Records 1-50 look fine, records 51+ are systematically blocked |

A monitoring system has to catch all four. Health-check pings and HTTP 200 responses catch zero of them.

The drift-detection stack in one diagram

┌──────────────────────────────────────────────────┐
│ 1. SCHEMA-VALIDATE EVERY RECORD AT WRITE TIME    │  → Zod / Pydantic gate
│    Drop + log records that fail validation       │
└──────────────────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────┐
│ 2. SYNTHETIC TESTS DAILY                         │  → fixed seed of known URLs
│    Assert known-good fields are non-empty        │     compared against a baseline
│    Assert known record counts                    │
└──────────────────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────┐
│ 3. STATISTICAL BASELINES PER RUN                 │  → schema-pass-rate, null-rate
│    Compare run-over-run, alert on drift          │     per field, distribution shifts
└──────────────────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────┐
│ 4. ALERTS THAT TELL YOU THE WHY                  │  → "field X null-rate jumped
│    Open an issue with diff + sample failing URL  │     12% → 78% — see issue #N"
└──────────────────────────────────────────────────┘

Layer 1: Schema-validate every record at write time

The single highest-leverage thing you can do is validate every record against a strict schema before it gets pushed to the dataset.

In Node.js, that's Zod:

import { z } from 'zod';
 
const AgencyRecord = z.object({
  recordId: z.string(),
  source: z.literal('agency_vista'),
  agencyName: z.string().min(1),
  services: z.array(z.string()),
  rating: z.object({
    value: z.number().nullable(),
    count: z.number().int().nullable(),
  }).nullable(),
  scrapedAt: z.string().datetime(),
});
 
for (const candidate of extractedRecords) {
  const parsed = AgencyRecord.safeParse(candidate);
  if (!parsed.success) {
    log.warning('Validation failed', {
      url: candidate.sourceUrl,
      issues: parsed.error.issues,
    });
    metrics.increment('records.validation_failed');
    continue;
  }
  await Dataset.pushData(parsed.data);
  metrics.increment('records.validated');
}

In Python, the equivalent is Pydantic. The principle is identical: a record that doesn't pass the schema doesn't reach the buyer.

This single layer turns a class of silent failures (selector drift) into loud failures (records dropped, metric incremented, log emitted). Your buyer never sees the bad rows.

Layer 2: Synthetic tests against known-good fixtures

Schema validation catches malformed records. It does not catch records where every field is structurally valid but semantically wrong — e.g., the scraper now puts the company description into the agencyName field, and both are non-empty strings.

The fix is synthetic tests: a small fixed set of known URLs whose expected output you've manually verified once and committed to the repo.

A synthetic test for the Agency Vista scraper:

const FIXTURES = [
  {
    url: 'https://agencyvista.com/agency/page-1-media/marketing-agency-boca-raton-florida-us',
    expect: {
      agencyName: 'Page 1 Media',
      'location.city': 'Boca Raton',
      services_min_length: 1,
      verified: true,
    },
  },
  // 5-10 more fixtures covering different agency types
];
 
for (const fixture of FIXTURES) {
  const record = await runActor({ startUrls: [fixture.url] });
  for (const [path, expected] of Object.entries(fixture.expect)) {
    const actual = get(record, path);
    if (path.endsWith('_min_length')) {
      assert(Array.isArray(actual) && actual.length >= expected as number,
        `${fixture.url}: ${path} length=${actual?.length} < ${expected}`);
    } else {
      assert(actual === expected,
        `${fixture.url}: ${path}=${JSON.stringify(actual)} !== ${JSON.stringify(expected)}`);
    }
  }
}

Run synthetic tests on a schedule, not just in CI. The whole point of drift detection is that the upstream site changed, not your code. CI passing on git push proves nothing about whether agencyvista.com shipped a UI change last Tuesday.

The Directory Datasets pattern: a separate Apify Actor that runs once a week, fires the main Actor against the fixtures, and opens a GitHub issue if any assertion fails. Total cost per week: a few cents.

Layer 3: Statistical baselines per run

Synthetic tests catch failures on the fixed seed. They do not catch a 12% null-rate increase on a 50,000-record full run.

Per-run baselines do. After every full Actor run, emit summary metrics to your storage of choice:

const summary = {
  actorId: 'agency-vista',
  runId,
  timestamp: new Date().toISOString(),
  records_pushed: pushedCount,
  records_validation_failed: failedCount,
  validation_pass_rate: pushedCount / (pushedCount + failedCount),
  field_null_rates: {
    description: countNulls(records, 'description') / records.length,
    website: countNulls(records, 'website') / records.length,
    teamSize: countNulls(records, 'teamSize') / records.length,
    rating: countNulls(records, 'rating') / records.length,
  },
};

Compare each run's summary against the rolling 30-run median. Alert when:

  • validation_pass_rate drops below 99%
  • Any field_null_rate jumps by more than 15 percentage points run-over-run
  • records_pushed drops by more than 25% on a deterministic input

These thresholds catch the failure modes synthetic tests miss: gradual upstream-side data quality degradation, new anti-bot rules that block half your traffic, and pagination drift that quietly halves your output.

Layer 4: Alerts that tell you why, not just that

The final piece: when an alert fires, it has to be actionable in under five minutes. That means the alert payload includes:

  1. The metric that drifted (field_null_rates.website)
  2. Before/after values (12% → 78%)
  3. A sample failing record URL (https://agencyvista.com/agency/...)
  4. A pre-filled GitHub issue link with all of the above embedded

If your alert is "scraper degraded — investigate", you've built a paging system, not a drift-detection system.

Why this matters more in 2026 than it did in 2022

Two changes make drift detection load-bearing in a way it wasn't a few years ago:

  • AI-driven UI changes upstream are constant. Sites are A/B-testing layouts, regenerating components from design tokens, and rolling out new structures weekly. Selector drift used to be quarterly; it's now bi-weekly on active sites.
  • LLM-extraction-based scrapers are rampant. Many AI scraping tools quietly hallucinate fields when they can't find them. The output looks right; the data is wrong. The validation gate is no longer optional — it's the only thing between you and a buyer's CRM.

A 4-question audit for your current scraper

If you can't answer "yes" to all four of these for every scraper you ship, you have silent failure exposure:

  1. Does every record pass a strict schema before it's persisted, with failures dropped and counted?
  2. Do you run a fixed-fixture synthetic test on a schedule (not just on git push)?
  3. Do you compare per-field null-rates run-over-run?
  4. Do your alerts include a sample failing URL and the metric delta?

If any answer is "no", that's where the next silent failure will come from. Start there.


This is the system Directory Datasets uses on Agency Vista and OnlineJobs.ph. Every dataset ships with the four layers above wired in.