Perplexity Search API

Perplexity Search API: an intro for developers

Search APIs have historically been built for human eyeballs rather than AI systems. They returned pages, not the specific snippets your model needs; they were slow when scraped from third-party engines; and they went stale quickly. Perplexity claims to have rebuilt the stack for AI: a large, frequently refreshed index, hybrid lexical-semantic retrieval, and sub-document results designed to drop straight into model context. The public launch on Sep 25, 2025 exposes the same infrastructure that powers Perplexity’s answer engine, along with an SDK and an open evaluation framework, search_evals.

Perplexity positions the API as fast and fresh at internet scale: hundreds of billions of pages tracked, tens of thousands of CPUs in play, hot storage in the hundreds of petabytes, and continuous index updates. They report median p50 latency near 358 ms from AWS us-east-1, with quality wins on SimpleQA, FRAMES, BrowseComp, and HLE benchmarks in their research write-up. Treat these as vendor claims, but they are specific and reproducible via their OSS harness.

What you get, concretely

At its core there is a single endpoint:

POST https://api.perplexity.ai/search
Authorization: Bearer <token>
Content-Type: application/json

{
  "query": "latest AI developments 2025",
  "max_results": 10,
  "max_tokens_per_page": 1024,
  "country": "US"
}

The response returns ranked results with title, url, snippet, and date fields. The SDK mirrors the same contract. Multi-query is supported by sending query as an array, which returns per-query result sets.

The mental model

Think of the Search API as a retrieval primitive to feed your downstream model. You control:

  • Breadth with max_results per query.
  • Depth of extraction per result with max_tokens_per_page.
  • Scope with geographic filter country, domain filters, and “web” vs “academic” search modes in guides.

Below is a simple diagram of how a single query flows through Perplexity’s described pipeline and into your app.

sequenceDiagram
    actor Dev as Your App
    participant API as Perplexity Search API
    participant IR as Hybrid Retrieval
    participant Rank as Multi-stage Rerankers
    participant Index as Fresh Index (sub-doc units)

    Dev->>API: POST /search {query, filters}
    API->>IR: Lexical + Semantic retrieval
    IR->>Index: Fetch doc + sub-document candidates
    Index-->>IR: Candidates with snippets
    IR->>Rank: Fast scorers → cross-encoders
    Rank-->>API: Top-k results with snippets, dates
    API-->>Dev: JSON results (ranked)

Perplexity’s research article details the hybrid retrieval and multi-stage ranking used to produce those results.

Quick start in practice

Python

from perplexity import Perplexity

client = Perplexity()  # uses PERPLEXITY_API_KEY

search = client.search.create(
    query="how to deploy Flarum site:dropletdrift.com",
    max_results=8,
    max_tokens_per_page=1024
)

for r in search.results:
    print(r.date, r.title, r.url)

TypeScript (Node)

import { Perplexity } from "perplexityai";

const client = new Perplexity({ apiKey: process.env.PERPLEXITY_API_KEY! });

const res = await client.search.create({
  query: "why DNS propogation takes so long site:dropletdrift.com",
  max_results: 6,
  max_tokens_per_page: 1024,
});
res.results.forEach(r => console.log(r.date, r.title, r.url));

Both mirror the REST example and parameters in the docs.

Patterns that actually work

1) Answer grounding for chat

Use search to fetch 3–6 high-quality snippets and frame them into a concise, source-attributed context for your model. Keep the context compact; models degrade with bloated input. Perplexity surfaces sub-document snippets exactly to reduce context bloat.

from perplexity import Perplexity
from textwrap import shorten

client = Perplexity()
q = "EU Chat Control, implications 2025"
search = client.search.create(
    query=q,
    max_results=6,
    max_tokens_per_page=800
)

context = "\n".join(
    f"- [{r.title}]({r.url}) — {shorten(r.snippet, width=260)}"
    for r in search.results
)

prompt = f"""Question: {q}
Use only the sources below. Cite inline as [n].

Sources:
{context}
"""
print(prompt)

The goal is minimal, current, and attributable grounding.

2) Multi-query “deepening” for agents

Break a broad task into parallel, specific queries. The guides support up to five queries per request and show async patterns and rate-limit handling.

import asyncio
from perplexity import AsyncPerplexity

queries = [
  "developer salaries site:news.ycombinator.com",
  "css tutorials site:dropletdrift.com",
  "privacy-friendly analytics site:github.com"
]

async def main():
    async with AsyncPerplexity() as client:
        tasks = [client.search.create(query=q, max_results=5) for q in queries]
        results = await asyncio.gather(*tasks)
        for i, res in enumerate(results):
            print(f"Q{i+1}:")
            for r in res.results: print(" ", r.title, r.url)

asyncio.run(main())

3) Academic mode for literature scans

Use academic mode and domain filters to reduce noise when you need papers and official guidance.

from perplexity import Perplexity
client = Perplexity()

s = client.search.create(
    query="SGLT2 inhibitors heart failure 2024 randomized trial",
    # shown in guides; see docs for exact names/values
    search_mode="academic",
    search_domain_filter=["nejm.org","thelancet.com","ahajournals.org"],
    max_results=10
)

The best-practices guide shows mode and filtering patterns.

Choosing parameters with intent

  • max_results. Request only what you will use. More results increase latency and cost. Perplexity defaults to 10 and caps at 20.
  • max_tokens_per_page. This controls how much text gets extracted per URL. Use 512–1024 for general Q&A and 1500–2048 when you plan to chunk downstream. The guide shows trade-offs explicitly.
  • Filters. Prefer allow-lists of authoritative domains and a recency filter when currency matters. The guide shows domain filtering and recency patterns.
  • Regionalization. Use country to bias results to an ISO 3166-1 code when laws, prices, or news differ by region.

Operational concerns that will save you later

  • Rate limits and retries. Implement exponential backoff with jitter and batch concurrency. The docs provide Python and TS examples with AsyncPerplexity.
  • Caching. Cache successful searches keyed by normalized query plus filters. Respect freshness; when you need today’s data, prefer short TTLs and set a recency filter at query time.
  • Deduplication. Many news topics syndicate across domains. Deduplicate by canonical URL and by title similarity before constructing model context.
  • Attribution. Keep 3–6 sources. Shorten snippets and force the model to cite as [1], [2], etc. This improves user trust and guards against hallucinations.
  • Crawler stance for your own properties. If you operate sites and want Perplexity to surface them, allow PerplexityBot in robots.txt and, if needed, whitelist IPs in your WAF. The crawler page lists user-agents and published IP ranges.

Quality, latency, and cost: how to think about it

Perplexity’s research article reports p50 latency near 358 ms and leads on quality across SimpleQA, FRAMES, BrowseComp, and HLE, evaluated via their open search_evals framework. Treat this as a starting point, and run your own evals on your domain.

If you are comparing alternatives, look at index freshness, sub-document support, latency under load, and data rights. Brave Search API publishes transparent pricing tiers with high request ceilings; SERP-based approaches vary and often trade latency for coverage. Benchmarks and pricing posts from Brave’s own docs provide useful context while you run your bake-off.

Running your own evaluation

Use search_evals

search_evals provides agent harnesses and plugs in multiple search APIs. Clone it, add your tasks, and run head-to-head tests that reflect your product, not just public benchmarks.

git clone https://github.com/perplexityai/search_evals.git
cd search_evals
# set your API keys as env vars per README
# run built-in suites or point to your JSONL tasks
python -m search_evals.run --suite simpleqa --provider perplexity
python -m search_evals.run --suite simpleqa --provider brave
python -m search_evals.compare --suite simpleqa --providers perplexity brave

Define a few dozen representative questions, including cases where timeliness matters, where you need sub-page citations, and where domain filters are essential. Measure task success, time to first useful snippet, and end-to-end latency for your full pipeline, not just the search call.

Security, privacy, and compliance notes

  • Keys and groups. Manage API keys and billing groups in the portal; rotate keys and scope usage by environment.
  • Data handling. Use allow-lists for domains when your policy requires vetted sources. Keep logs of URLs used to answer regulated queries.
  • Robots and IPs. If you host content, consult Perplexity Crawlers for user-agents and IP JSON endpoints to control access.

Troubleshooting playbook

  1. Results feel stale. Add a recency constraint, lower cache TTL, and raise max_tokens_per_page slightly to capture updated snippets.
  2. Too much noise. Switch to academic mode or apply domain filters. Tune max_results down to 5–8.
  3. Rate limits. Use the async client with batching and backoff. Log retry counts and successes.
  4. Latency spikes. Reduce max_results, lower max_tokens_per_page, and collapse near-duplicate queries into a single multi-query call.

End-to-end example: building a grounded answerer

import asyncio
import re
from textwrap import shorten
from perplexity import AsyncPerplexity

CITATION_RE = re.compile(r"\[(\d+)\]")

def build_sources(results):
    # keep top 5 authoritative sources, dedupe by domain
    seen = set()
    out = []
    for r in results:
        dom = r.url.split("/")[2]
        if dom in seen: continue
        seen.add(dom)
        out.append(r)
        if len(out) == 5: break
    return out

async def grounded_answer(question: str):
    async with AsyncPerplexity() as client:
        search = await client.search.create(
            query=question,
            max_results=12,
            max_tokens_per_page=900
        )
        sources = build_sources(search.results)
        context = "\n".join(
            f"[{i+1}] {s.title} {s.url}\n{shorten(s.snippet, 260)}"
            for i, s in enumerate(sources)
        )
        # hand off to your model here; we just return a template
        return f"""Q: {question}

Use only the sources below. Cite as [n].

Sources:
{context}

Answer:"""

print(asyncio.run(grounded_answer(
    "EU Chat Control implications for global privacy 2025"))))

This shows search assembly, simple dedupe, and a clean prompt for your completion model.

When to reach for this API

Use it when your agent needs current, attributable facts, when it benefits from granular snippets over whole pages, and when latency matters in user flows. Pair it with your existing LLM stack for grounded generation, or as the retrieval layer for internal tools. If you need a comparative baseline, run your search_evals suite across Perplexity and an alternative like Brave, then choose based on your task-level success, latency, and total cost in your environment.

Was this helpful?

Thanks for your feedback!

Leave a comment

Your email address will not be published. Required fields are marked *