Directory Datasets
6 min read

Cheerio vs Playwright in 2026: when each is the right tool for web scraping

A benchmark-driven guide to choosing between Cheerio and Playwright. Real numbers — 10× memory reduction, 3min → 14sec — and a decision tree you can use today.

Cheerio is a server-side HTML parser. Playwright is a full headless browser automation framework. In 2026 they're often discussed as if they're competing for the same job. They aren't — and using the wrong one for the wrong site is the single most expensive mistake in production scraping.

This post is the benchmark-backed guide to choosing between them, with real numbers from a 47K-record directory scraper, the four-question decision tree we use, and the cases where you genuinely need a browser.

TL;DR — the headline numbers

When the data you need is already in the HTML response, switching from Playwright to Cheerio gives you:

| Metric | Playwright | Cheerio | Improvement | |--------|-----------|---------|-------------| | Wall-clock execution | 3 min 0 sec | 14 sec | ~13× faster | | Memory per worker | ~500 MB | ~50 MB | 10× lower | | Concurrent workers per CPU | 2-3 | 20-30 | ~10× higher | | Cold-start time | 800-2,000 ms | ~5 ms | ~400× faster | | Cost per 10K records | ~$0.50-$1.20 | ~$0.05-$0.15 | ~10× cheaper |

Source: Apify Engineering ran this exact migration on their Switching from Playwright to Cheerio post; we see the same numbers on our Agency Vista Actor.

The reason isn't subtle: Playwright launches a Chromium binary, evaluates JavaScript, paints DOM, and runs a CDP connection. Cheerio reads bytes and parses them. If you don't need a browser, a browser is the dominant cost.

When you actually need Playwright

Despite the numbers above, Playwright is the right answer when any of these are true:

  • The data you need is rendered by client-side JavaScript and not present in the initial HTML response.
  • You need to interact with the page: click, type, scroll, wait for XHR, dismiss modals.
  • The site requires a logged-in session that depends on a JavaScript-set cookie (e.g., LinkedIn, most SaaS dashboards).
  • The site fingerprints requests using TLS, JS challenges, or canvas-based bot tests that only a real browser engine passes.
  • The data you need only loads after a setTimeout or an IntersectionObserver triggers.

For all of those, Cheerio is the wrong tool. Playwright (or Puppeteer, or a managed browser like Browserbase) is your only option.

When Cheerio wins by an order of magnitude

Cheerio wins when the data is already in the HTML you got back. This is more common than people assume in 2026:

  • Next.js, Nuxt, SvelteKit, Astro, Remix — every modern JS framework ships server-rendered HTML by default. Most of the data is in the initial response or in a __NEXT_DATA__ / __NUXT_DATA__ / __sveltekit_data script tag.
  • Most directories, marketplaces, and job boards — agency directories, license boards, niche job sites, and review platforms are overwhelmingly server-rendered.
  • Sitemap-driven extraction — when you're crawling from a sitemap of detail-page URLs, every page is a single HTTP GET. No interaction needed.
  • Paginated JSON list pages — many sites embed their full agency/job/profile list as serialized JSON in a script tag for SSR hydration. You're reading JSON out of the HTML, not parsing HTML.

The Agency Vista scraper that powers our Agency Vista dataset is a textbook example: 47,000+ profiles, all server-rendered Next.js, every page's data sitting in __NEXT_DATA__. A CheerioCrawler + a JSON parse is the entire extraction pipeline.

The four-question decision tree

When you're scoping a new scraper, ask these in order:

1. Is there a public API or a hidden internal API?
   YES → Use neither Cheerio nor Playwright. Hit the API directly.
   NO  → Continue.

2. View source on a target page. Is the data you need in the HTML?
   YES → Cheerio.
   NO  → Continue.

3. Is the data in a __NEXT_DATA__ / __NUXT_DATA__ /
   inline-JSON script tag in the HTML?
   YES → Cheerio. Parse the script tag content as JSON.
   NO  → Continue.

4. Does the data appear after JS execution, scrolling, or interaction?
   YES → Playwright.

Most scrapers stop at step 2 or step 3. Production teams that default-to-Playwright are paying 10× cost for nothing.

The "view source" test in practice

The cheap way to know which one you need: open the target page in your browser, hit Ctrl+U (or Cmd+Option+U), and search the raw HTML for a known data point — an agency name, a job title, a license number you can see on the rendered page.

  • If Ctrl+U shows your data → Cheerio.
  • If Ctrl+U does not show your data, but the data is visible in the rendered page → Playwright (or check the Network tab for an XHR endpoint that returns it as JSON, in which case: hit the API directly).

This 30-second test will save you days of building the wrong stack.

Crawlee makes the choice trivial

If you're using Crawlee (the open-source crawler library that powers Apify Actors), switching between the two is a one-import change:

// Cheerio version
import { CheerioCrawler } from 'crawlee';
 
const crawler = new CheerioCrawler({
  async requestHandler({ $, request }) {
    const data = JSON.parse($('script#__NEXT_DATA__').first().text());
    // ...
  },
});
// Playwright version (same handler shape)
import { PlaywrightCrawler } from 'crawlee';
 
const crawler = new PlaywrightCrawler({
  async requestHandler({ page, request }) {
    await page.waitForSelector('.profile-name');
    const name = await page.locator('.profile-name').textContent();
    // ...
  },
});

Same Crawlee API, same request queueing, same retry/concurrency knobs. The framework is the same; only the worker model changes. That makes it cheap to start with Cheerio and add a Playwright fallback for the 10% of pages that need it.

A common trap: defaulting to Playwright "just in case"

In 2026, AI-coding assistants overwhelmingly suggest Playwright as the default for any scraping task. They'll write you a 200-line Playwright script for a job that needs fetch() + 3 lines of Cheerio.

The cost shows up later, in three places:

  1. Apify Compute Units, AWS bills, or Browserbase invoices — 10× the wall-clock and 10× the memory means 10× the cost.
  2. Concurrency ceiling — 30 concurrent Cheerio workers fit on a small VPS. 30 concurrent Playwright workers need a serious machine.
  3. Failure surface — Chromium occasionally hangs, segfaults, or leaks memory under load. Cheerio is a pure function: input HTML → output object. Less can go wrong.

The default-to-Playwright instinct is wrong in 2026 the same way default-to-jQuery was wrong in 2018. Use the tool the job needs, not the most general tool that happens to work.

A 60-second rule of thumb

If your data is in the HTML, use Cheerio. If your data is rendered by JavaScript, use Playwright. Decide which one by viewing source on a target page.

That's the entire decision in one sentence. Everything else is implementation detail.


The Directory Datasets Actors that power Agency Vista and OnlineJobs.ph are both Cheerio-based. We started both with Ctrl+U, found the data in the HTML, and never wrote a Playwright handler. Costs per 10K records sit in the cents-per-thousand range as a result — which is what makes pay-per-result pricing economically viable on directory data.