How I scraped 47K Agency Vista profiles without a headless browser
Walkthrough of building the Agency Vista Apify Actor with Cheerio and __NEXT_DATA__ extraction — deterministic, schema-validated, and cheap.
Most marketing agency directories don't expose an API. Agency Vista is one of them — 47,000+ agency profiles sitting behind a search UI, no public endpoint, no bulk export. The conventional answer is to spin up a headless browser and click around. I wanted something cheaper, faster, and far more durable.
Here's how the Agency Vista scraper works, and why every record is schema-validated before it leaves the Actor.
The data is already on the page
Agency Vista is a Next.js application. Like every Next.js site, it ships
serialized server data inside a <script id="__NEXT_DATA__" type="application/json"> block on every page. That payload contains the
full rendered tree's props — including the agency objects the page uses to
build the UI.
There are two page types I care about:
- List pages like
/agency/all/all— they embed up to 50 list-stub agency objects with id, name, slug, description preview, city, social links, badges, and rating. - Detail pages like
/agency/{slug}/summary— each embeds a single rich agency object with services breakdown, industry focus, full client list, verification status, and the rest.
So the Actor's job is straightforward: visit a list URL, grab the stubs from
__NEXT_DATA__, then for each stub fetch its /summary detail page and merge
the two records.
No JavaScript needs to execute. No browser. Just an HTTP GET and a regex (well, Cheerio) to extract the JSON blob.
Cheerio + Crawlee, no Puppeteer
The Actor is built on Crawlee's CheerioCrawler. Cheerio parses the HTML on
the server in roughly the same shape as jQuery — perfect for pulling a script
tag's text content:
const $ = await page.parseWithCheerio();
const raw = $('script#__NEXT_DATA__').first().text();
const data = JSON.parse(raw);That's it. The data object holds the entire Next.js page tree. From there,
extraction is a series of safe property reads — data.props.pageProps.agency
on detail pages, data.props.pageProps.agencies on list pages.
The win is real: a CheerioCrawler run on Apify's infrastructure costs roughly
1/10th of a PuppeteerCrawler run for the same throughput. No Chrome
binary, no headless rendering, no CDP overhead. Just HTTP and HTML parsing.
Schema-validated output, every record, every time
The thing I care about more than performance is trust. Buyers of scraped
data have one big fear: silent breakage. The Actor returns 50,000 records, you
ship them to your CRM, and three weeks later you discover that 4,000 of them
have the wrong services array because some upstream UI change broke an
extractor.
The fix: validate every record against a Zod schema before it gets pushed to the Apify dataset. If validation fails, the record is dropped (not silently massaged) and the Actor logs the failure with the source URL.
import { z } from 'zod';
const AgencyRecord = z.object({
recordId: z.string(),
source: z.literal('agency_vista'),
agencyName: z.string().min(1),
services: z.array(z.string()),
rating: z.object({
value: z.number().nullable(),
count: z.number().int().nullable(),
}).nullable(),
// ...
});
for (const candidate of extractedRecords) {
const parsed = AgencyRecord.safeParse(candidate);
if (!parsed.success) {
log.warning('Validation failed', {
url: candidate.sourceUrl,
issues: parsed.error.issues,
});
continue;
}
await Dataset.pushData(parsed.data);
}Pay-per-result pricing on Apify charges per pushed record. So the Zod gate is also a money gate: the buyer doesn't get charged for malformed records, and they get a clean dataset. Both sides win.
Drift detection runs every week
The other half of trust is catching breakage before the buyer does. There's a tiny separate scheduled Actor that runs once a week, fires the main Agency Vista Actor against a fixed seed of detail URLs, and asserts:
- Total records returned matches the seed count (no quiet drops)
- Specific known-good fields are still present (e.g.
agencyNamenon-empty,servicesarray length > 0 for the seed agencies) - Schema validation rate stays above 99%
If anything fails, the drift bot opens a GitHub issue against the Actor's source repo. I see the issue, I fix the extractor, the dataset stays clean.
Why I think this approach generalizes
Almost every modern directory site is built on a JS framework that ships its
data into the page as serialized JSON. Next.js has __NEXT_DATA__. SvelteKit
has __sveltekit_data. Nuxt has __NUXT_DATA__. Even sites built on
Gatsby/Astro/Remix usually have a per-route data island somewhere in the
markup.
If you can find that blob, you don't need a browser. You just need a cheap HTTP fetch, an HTML parser, a Zod schema, and the discipline to never ship a record that fails validation.
That's the recipe. Two more datasets coming with the same playbook.