How can you scrape 47K Agency Vista profiles without a headless browser?
Walkthrough of building the Agency Vista Apify Actor with Cheerio and __NEXT_DATA__ extraction — deterministic, schema-validated, and 10× cheaper than Playwright.
To scrape 47,000+ Agency Vista profiles without a headless browser, you read the serialized JSON Agency Vista already ships in every page's __NEXT_DATA__ script tag. The site is a Next.js application that hydrates from server-rendered JSON; a CheerioCrawler plus JSON.parse() extracts the full agency record. No Chrome, no Puppeteer, no Playwright — about 10× cheaper than the headless-browser equivalent and far more durable.
This post walks through how the Agency Vista scraper works, the Zod gate that validates every record before it bills the buyer, and why this approach generalizes to most modern directory sites.
Why is the data already on the page?
Agency Vista is a Next.js application. Like every Next.js site, it ships serialized server data inside a <script id="__NEXT_DATA__" type="application/json"> block on every page. That payload contains the full rendered tree's props — including the agency objects the page uses to build the UI.
There are two page types I care about:
- List pages like
/agency/all/all— they embed up to 50 list-stub agency objects with id, name, slug, description preview, city, social links, badges, and rating. - Detail pages like
/agency/[slug]/summary— each embeds a single rich agency object with services breakdown, industry focus, full client list, verification status, and the rest.
So the Actor's job is straightforward: visit a list URL, grab the stubs from __NEXT_DATA__, then for each stub fetch its /summary detail page and merge the two records.
No JavaScript needs to execute. No browser. Just an HTTP GET and a regex (well, Cheerio) to extract the JSON blob.
How does the Cheerio + Crawlee stack work?
The Actor is built on Crawlee's CheerioCrawler. Cheerio parses the HTML on the server in roughly the same shape as jQuery — perfect for pulling a script tag's text content:
const $ = await page.parseWithCheerio();
const raw = $('script#__NEXT_DATA__').first().text();
const data = JSON.parse(raw);That's it. The data object holds the entire Next.js page tree. From there, extraction is a series of safe property reads — data.props.pageProps.agency on detail pages, data.props.pageProps.agencies on list pages.
The win is real: a CheerioCrawler run on Apify's infrastructure costs roughly 1/10th of a PuppeteerCrawler run for the same throughput. No Chrome binary, no headless rendering, no CDP overhead. Just HTTP and HTML parsing.
How does schema validation prevent silent breakage?
The thing I care about more than performance is trust. Buyers of scraped data have one big fear: silent breakage. The Actor returns 50,000 records, you ship them to your CRM, and three weeks later you discover that 4,000 of them have the wrong services array because some upstream UI change broke an extractor.
The fix: validate every record against a Zod schema before it gets pushed to the Apify dataset. If validation fails, the record is dropped (not silently massaged) and the Actor logs the failure with the source URL.
import { z } from 'zod';
const AgencyRecord = z.object({
recordId: z.string(),
source: z.literal('agency_vista'),
agencyName: z.string().min(1),
services: z.array(z.string()),
rating: z.object({
value: z.number().nullable(),
count: z.number().int().nullable(),
}).nullable(),
// ...
});
for (const candidate of extractedRecords) {
const parsed = AgencyRecord.safeParse(candidate);
if (!parsed.success) {
log.warning('Validation failed', {
url: candidate.sourceUrl,
issues: parsed.error.issues,
});
continue;
}
await Dataset.pushData(parsed.data);
}Pay-per-result pricing on Apify charges per pushed record. So the Zod gate is also a money gate: the buyer doesn't get charged for malformed records, and they get a clean dataset. Both sides win.
How does drift detection catch upstream changes before buyers see them?
The other half of trust is catching breakage before the buyer does. There's a tiny separate scheduled Actor that runs once a week, fires the main Agency Vista Actor against a fixed seed of detail URLs, and asserts:
- Total records returned matches the seed count (no quiet drops)
- Specific known-good fields are still present (e.g.
agencyNamenon-empty,servicesarray length > 0 for the seed agencies) - Schema validation rate stays above 99%
If anything fails, the drift bot opens a tracking issue against the Actor's source repo. I see the issue, I fix the extractor, the dataset stays clean.
Does this approach work on other directory sites?
Almost every modern directory site is built on a JS framework that ships its data into the page as serialized JSON.
- Next.js uses
__NEXT_DATA__ - Nuxt uses
__NUXT_DATA__ - SvelteKit uses
__sveltekit_data - Astro / Remix typically embed per-route data islands in the markup
If you can find that blob, you don't need a browser. You just need a cheap HTTP fetch, an HTML parser, a Zod schema, and the discipline to never ship a record that fails validation.
That's the recipe. Two more datasets coming with the same playbook.
Frequently asked questions
What's the difference between __NEXT_DATA__ and React Server Components?
__NEXT_DATA__ is the Pages-Router serialization format Next.js used for years; it's still emitted by many Next.js apps for backwards compatibility and for any page that uses getServerSideProps / getStaticProps. React Server Components (App Router) use a different streaming format with the data inlined as self.__next_f.push([...]). Sites built fresh after late 2023 increasingly use the RSC format, which is parseable but more complex.
Is __NEXT_DATA__ always present on a Next.js site?
On Pages Router pages, yes. On App Router pages, no — App Router uses RSC streaming instead. Most directory sites still use Pages Router or have hybrid setups, so __NEXT_DATA__ is present on the majority of public Next.js directories in 2026.
How do I check if a site has __NEXT_DATA__?
Open the target page in your browser, hit Ctrl+U (View Source), and search for __NEXT_DATA__. If you find a <script id="__NEXT_DATA__"> block, the data you need is almost certainly inside it.
Doesn't this break when the site changes structure?
That's exactly what schema validation + drift detection are for. When the site changes, schema-validation-pass-rate drops, the drift bot fires, and the extractor is patched. The data buyer never sees a broken record because schema-failed records are dropped before billing.
What about anti-bot protection like Cloudflare or Datadome?
Agency Vista does not use aggressive anti-bot. For sites that do, you'll need either residential proxies or a managed scraping infrastructure (Browserbase, Bright Data). The __NEXT_DATA__ approach still saves cost there because you can use simple HTTP requests through a proxy rather than full headless browsers.
Can I use this approach on JavaScript-heavy SPAs that don't server-render?
No. If the data is fetched client-side after page load, it's not in the initial HTML. For those, you need a headless browser or you need to find and call the underlying XHR endpoint directly.