Skip to content
Directory Datasets
8 min read

How do you detect when a scraper silently breaks?

A practical, code-first guide to scraper drift detection — synthetic tests, schema validation rates, baselines, and alerting. Catch upstream UI changes before your buyers do.

A scraper silently breaks when its code keeps running, its job keeps succeeding, and its output keeps flowing — but the data is now subtly wrong. The agency name field is empty for 18% of records. The salary field is suddenly the company's name. Nothing crashes. No alert fires. Your buyer notices in three weeks when their CRM is full of garbage.

This post is the complete playbook for catching silent scraper failures before they reach a buyer. It's the system Directory Datasets uses on every Apify Actor we ship.

What are the four ways scrapers silently fail?

Every silent scraper failure falls into one of these buckets:

Failure modeWhat changedWhat you see in the data
Selector driftUpstream renamed a CSS class or DOM structureField is empty/null for many records
Field semantics driftUpstream changed what a field meansField is populated but wrong (e.g., raw HTML in a "description" field)
Pagination driftUpstream changed page-size or cursor behaviorTotal record count is silently lower
Anti-bot driftUpstream rolled out new fingerprinting or rate limitsRecords 1-50 look fine, records 51+ are systematically blocked

A monitoring system has to catch all four. Health-check pings and HTTP 200 responses catch zero of them.

What does a complete drift-detection stack look like?

┌──────────────────────────────────────────────────┐
│ 1. SCHEMA-VALIDATE EVERY RECORD AT WRITE TIME    │  → Zod / Pydantic gate
│    Drop + log records that fail validation       │
└──────────────────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────┐
│ 2. SYNTHETIC TESTS DAILY                         │  → fixed seed of known URLs
│    Assert known-good fields are non-empty        │     compared against a baseline
│    Assert known record counts                    │
└──────────────────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────┐
│ 3. STATISTICAL BASELINES PER RUN                 │  → schema-pass-rate, null-rate
│    Compare run-over-run, alert on drift          │     per field, distribution shifts
└──────────────────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────┐
│ 4. ALERTS THAT TELL YOU THE WHY                  │  → "field X null-rate jumped
│    Open an issue with diff + sample failing URL  │     12% → 78% — see issue #N"
└──────────────────────────────────────────────────┘

How do you implement layer 1: schema validation at write time?

The single highest-leverage thing you can do is validate every record against a strict schema before it gets pushed to the dataset.

In Node.js, that's Zod:

import { z } from 'zod';
 
const AgencyRecord = z.object({
  recordId: z.string(),
  source: z.literal('agency_vista'),
  agencyName: z.string().min(1),
  services: z.array(z.string()),
  rating: z.object({
    value: z.number().nullable(),
    count: z.number().int().nullable(),
  }).nullable(),
  scrapedAt: z.string().datetime(),
});
 
for (const candidate of extractedRecords) {
  const parsed = AgencyRecord.safeParse(candidate);
  if (!parsed.success) {
    log.warning('Validation failed', {
      url: candidate.sourceUrl,
      issues: parsed.error.issues,
    });
    metrics.increment('records.validation_failed');
    continue;
  }
  await Dataset.pushData(parsed.data);
  metrics.increment('records.validated');
}

In Python, the equivalent is Pydantic. The principle is identical: a record that doesn't pass the schema doesn't reach the buyer.

This single layer turns a class of silent failures (selector drift) into loud failures (records dropped, metric incremented, log emitted). Your buyer never sees the bad rows.

How do you implement layer 2: synthetic tests against known-good fixtures?

Schema validation catches malformed records. It does not catch records where every field is structurally valid but semantically wrong — e.g., the scraper now puts the company description into the agencyName field, and both are non-empty strings.

The fix is synthetic tests: a small fixed set of known URLs whose expected output you've manually verified once and committed to the repo.

A synthetic test for the Agency Vista scraper:

const FIXTURES = [
  {
    url: 'https://agencyvista.com/agency/page-1-media/marketing-agency-boca-raton-florida-us',
    expect: {
      agencyName: 'Page 1 Media',
      'location.city': 'Boca Raton',
      services_min_length: 1,
      verified: true,
    },
  },
  // 5-10 more fixtures covering different agency types
];
 
for (const fixture of FIXTURES) {
  const record = await runActor({ startUrls: [fixture.url] });
  for (const [path, expected] of Object.entries(fixture.expect)) {
    const actual = get(record, path);
    if (path.endsWith('_min_length')) {
      assert(Array.isArray(actual) && actual.length >= expected as number,
        `${fixture.url}: ${path} length=${actual?.length} < ${expected}`);
    } else {
      assert(actual === expected,
        `${fixture.url}: ${path}=${JSON.stringify(actual)} !== ${JSON.stringify(expected)}`);
    }
  }
}

Run synthetic tests on a schedule, not just in CI. The whole point of drift detection is that the upstream site changed, not your code. CI passing on git push proves nothing about whether agencyvista.com shipped a UI change last Tuesday.

The Directory Datasets pattern: a separate Apify Actor that runs once a week, fires the main Actor against the fixtures, and opens a GitHub issue if any assertion fails. Total cost per week: a few cents.

How do you implement layer 3: statistical baselines per run?

Synthetic tests catch failures on the fixed seed. They do not catch a 12% null-rate increase on a 50,000-record full run.

Per-run baselines do. After every full Actor run, emit summary metrics to your storage of choice:

const summary = {
  actorId: 'agency-vista',
  runId,
  timestamp: new Date().toISOString(),
  records_pushed: pushedCount,
  records_validation_failed: failedCount,
  validation_pass_rate: pushedCount / (pushedCount + failedCount),
  field_null_rates: {
    description: countNulls(records, 'description') / records.length,
    website: countNulls(records, 'website') / records.length,
    teamSize: countNulls(records, 'teamSize') / records.length,
    rating: countNulls(records, 'rating') / records.length,
  },
};

Compare each run's summary against the rolling 30-run median. Alert when:

  • validation_pass_rate drops below 99%
  • Any field_null_rate jumps by more than 15 percentage points run-over-run
  • records_pushed drops by more than 25% on a deterministic input

These thresholds catch the failure modes synthetic tests miss: gradual upstream-side data quality degradation, new anti-bot rules that block half your traffic, and pagination drift that quietly halves your output.

How do you implement layer 4: actionable alerts (not just "investigate")?

The final piece: when an alert fires, it has to be actionable in under five minutes. That means the alert payload includes:

  1. The metric that drifted (field_null_rates.website)
  2. Before/after values (12% → 78%)
  3. A sample failing record URL (https://agencyvista.com/agency/...)
  4. A pre-filled GitHub issue link with all of the above embedded

If your alert is "scraper degraded — investigate", you've built a paging system, not a drift-detection system.

Why does drift detection matter more in 2026 than in 2022?

Two changes make drift detection load-bearing in a way it wasn't a few years ago:

  • AI-driven UI changes upstream are constant. Sites are A/B-testing layouts, regenerating components from design tokens, and rolling out new structures weekly. Selector drift used to be quarterly; it's now bi-weekly on active sites.
  • LLM-extraction-based scrapers are rampant. Many AI scraping tools quietly hallucinate fields when they can't find them. The output looks right; the data is wrong. The validation gate is no longer optional — it's the only thing between you and a buyer's CRM.

What four questions should I ask of my current scraper?

If you can't answer "yes" to all four of these for every scraper you ship, you have silent failure exposure:

  1. Does every record pass a strict schema before it's persisted, with failures dropped and counted?
  2. Do you run a fixed-fixture synthetic test on a schedule (not just on git push)?
  3. Do you compare per-field null-rates run-over-run?
  4. Do your alerts include a sample failing URL and the metric delta?

If any answer is "no", that's where the next silent failure will come from. Start there.

Frequently asked questions

What's the difference between scraper monitoring and drift detection?

Monitoring asks "did the scraper run?" Drift detection asks "did the scraper produce the right data?" A scraper can run successfully (HTTP 200, no exceptions, jobs complete) and still be silently wrong. Drift detection catches that gap.

How often should I run synthetic tests?

For most production scrapers, daily is the right cadence — enough to catch upstream UI changes within a day, not so often that the synthetic test itself becomes a load issue. For high-volume revenue-critical scrapers, hourly synthetic checks are reasonable.

What's the right schema-validation pass rate threshold?

99% is the bar for a healthy scraper on a stable upstream. Below 95% is broken — investigate immediately. Between 95-99% is borderline; check whether the failures are systematic (a field semantically changed) or scattered (a few malformed entries upstream).

Should I use Zod, Pydantic, or JSON Schema for validation?

Match the language. Zod for TypeScript/JavaScript scrapers; Pydantic for Python. Both compile to the same set of constraints. JSON Schema is the least ergonomic but is portable across runtimes if you need cross-language consistency.

How much does drift detection add to operating cost?

Roughly 0.5-2% of the main scraper's compute cost. A weekly synthetic run against 10-50 known-good URLs costs cents per week on most platforms. The savings from catching one silent failure dwarf the entire annual drift-detection budget.

How do I detect drift if the upstream site doesn't change frequently?

Even seemingly-stable sites ship A/B tests, layout refreshes, and component refactors monthly. Run synthetic tests anyway — they're cheap, and the cost of a missed change is much higher than the cost of running the check.


This is the system Directory Datasets uses on Agency Vista and OnlineJobs.ph. Every dataset ships with the four layers above wired in.