Directory Datasets
4 min read

How I scraped 47K Agency Vista profiles without a headless browser

Walkthrough of building the Agency Vista Apify Actor with Cheerio and __NEXT_DATA__ extraction — deterministic, schema-validated, and cheap.

Most marketing agency directories don't expose an API. Agency Vista is one of them — 47,000+ agency profiles sitting behind a search UI, no public endpoint, no bulk export. The conventional answer is to spin up a headless browser and click around. I wanted something cheaper, faster, and far more durable.

Here's how the Agency Vista scraper works, and why every record is schema-validated before it leaves the Actor.

The data is already on the page

Agency Vista is a Next.js application. Like every Next.js site, it ships serialized server data inside a <script id="__NEXT_DATA__" type="application/json"> block on every page. That payload contains the full rendered tree's props — including the agency objects the page uses to build the UI.

There are two page types I care about:

  1. List pages like /agency/all/all — they embed up to 50 list-stub agency objects with id, name, slug, description preview, city, social links, badges, and rating.
  2. Detail pages like /agency/{slug}/summary — each embeds a single rich agency object with services breakdown, industry focus, full client list, verification status, and the rest.

So the Actor's job is straightforward: visit a list URL, grab the stubs from __NEXT_DATA__, then for each stub fetch its /summary detail page and merge the two records.

No JavaScript needs to execute. No browser. Just an HTTP GET and a regex (well, Cheerio) to extract the JSON blob.

Cheerio + Crawlee, no Puppeteer

The Actor is built on Crawlee's CheerioCrawler. Cheerio parses the HTML on the server in roughly the same shape as jQuery — perfect for pulling a script tag's text content:

const $ = await page.parseWithCheerio();
const raw = $('script#__NEXT_DATA__').first().text();
const data = JSON.parse(raw);

That's it. The data object holds the entire Next.js page tree. From there, extraction is a series of safe property reads — data.props.pageProps.agency on detail pages, data.props.pageProps.agencies on list pages.

The win is real: a CheerioCrawler run on Apify's infrastructure costs roughly 1/10th of a PuppeteerCrawler run for the same throughput. No Chrome binary, no headless rendering, no CDP overhead. Just HTTP and HTML parsing.

Schema-validated output, every record, every time

The thing I care about more than performance is trust. Buyers of scraped data have one big fear: silent breakage. The Actor returns 50,000 records, you ship them to your CRM, and three weeks later you discover that 4,000 of them have the wrong services array because some upstream UI change broke an extractor.

The fix: validate every record against a Zod schema before it gets pushed to the Apify dataset. If validation fails, the record is dropped (not silently massaged) and the Actor logs the failure with the source URL.

import { z } from 'zod';
 
const AgencyRecord = z.object({
  recordId: z.string(),
  source: z.literal('agency_vista'),
  agencyName: z.string().min(1),
  services: z.array(z.string()),
  rating: z.object({
    value: z.number().nullable(),
    count: z.number().int().nullable(),
  }).nullable(),
  // ...
});
 
for (const candidate of extractedRecords) {
  const parsed = AgencyRecord.safeParse(candidate);
  if (!parsed.success) {
    log.warning('Validation failed', {
      url: candidate.sourceUrl,
      issues: parsed.error.issues,
    });
    continue;
  }
  await Dataset.pushData(parsed.data);
}

Pay-per-result pricing on Apify charges per pushed record. So the Zod gate is also a money gate: the buyer doesn't get charged for malformed records, and they get a clean dataset. Both sides win.

Drift detection runs every week

The other half of trust is catching breakage before the buyer does. There's a tiny separate scheduled Actor that runs once a week, fires the main Agency Vista Actor against a fixed seed of detail URLs, and asserts:

  • Total records returned matches the seed count (no quiet drops)
  • Specific known-good fields are still present (e.g. agencyName non-empty, services array length > 0 for the seed agencies)
  • Schema validation rate stays above 99%

If anything fails, the drift bot opens a GitHub issue against the Actor's source repo. I see the issue, I fix the extractor, the dataset stays clean.

Why I think this approach generalizes

Almost every modern directory site is built on a JS framework that ships its data into the page as serialized JSON. Next.js has __NEXT_DATA__. SvelteKit has __sveltekit_data. Nuxt has __NUXT_DATA__. Even sites built on Gatsby/Astro/Remix usually have a per-route data island somewhere in the markup.

If you can find that blob, you don't need a browser. You just need a cheap HTTP fetch, an HTML parser, a Zod schema, and the discipline to never ship a record that fails validation.

That's the recipe. Two more datasets coming with the same playbook.