PlaybookMarkUDown
MarkUDownPythonRSSPipelineIntermediate

How to build a news APIwith MarkUDown

Three endpoints, two Python dependencies. A pipeline that discovers articles via RSS and site mapping, extracts content from each one, and returns structured JSON with source, title, subtitle, date, author, and body — from any portal, without maintaining CSS selectors.

April 14, 202614 min readBy Scrape Technology

The problem every data analyst has faced

You need to know what the press is saying about your sector. It could be an investment fund monitoring portfolio company coverage. A startup tracking competitor mentions. A marketing team measuring the impact of a launch. A researcher building an economic journalism corpus.

It sounds like a simple problem. You have the URLs. You have Python. You open the terminal and start with requests.get(url).text.

And then reality hits:

  • G1 returns HTML with an empty article — the content is loaded via JavaScript
  • Folha returns 403 from datacenter IPs
  • InfoMoney has a different structure after every redesign
  • You extract the text but menus, ads, and footers come along
  • Each portal uses a different format for date, author, and subtitle

What looked like an afternoon script turns into a constant maintenance project. You fix the G1 selector, Valor changes the layout, you fix Valor, Folha updates the anti-bot, and the cycle begins again.

Three endpoints. Problem solved.

MarkUDown has three endpoints that, together, cover the entire news pipeline:

/rss — Discovery from feeds

Pass the URL of any RSS feed. Get back an array of {url, title, summary} for each article. No library, no manual XML parsing.

/map — Discovery on portals without RSS

For sites that don't provide a feed, /map crawls the section page and returns article URLs. A URL pattern filter isolates only articles.

/extract — Structured extraction in one call

Pass the article URL and the field schema. The endpoint accesses the page, extracts clean content, and maps fields with AI — returns ready JSON.

Anti-bot portals included

MarkUDown has three internal extraction layers. If the first layers are blocked, it automatically scales to Abrasio — a browser service with patched Chromium and Brazilian residential IPs. You configure nothing.

Tutorial

You'll need a MarkUDown API key. Create yours for free in the dashboard.

1

Create your account and get the API key

Go to the MarkUDown dashboard, create a free account and copy your API key. It goes as the header X-API-KEY in all calls.

2

Install dependencies

Just two Python libraries:

terminal
pip install requests python-dotenv

Create a .env in the root:

.env
# .env
MARKUDOWN_API_KEY=sua_chave_aqui

Never commit your API key

Add .env to .gitignore. The key gives access to your account and your request balance.
3

Define the news schema

The schema is a dictionary that describes the fields you want to extract. The AI uses the descriptions to understand what to look for on each page — regardless of the portal's layout.

schema.py
NEWS_SCHEMA = {
    "fonte":           "Nome do portal ou veículo de comunicação",
    "titulo":          "Título principal da matéria",
    "subtitulo":       "Subtítulo ou linha de apoio, se existir",
    "data_publicacao": "Data e hora de publicação no formato ISO 8601",
    "autor":           "Nome do autor ou repórter",
    "texto":           "Corpo completo da matéria, sem anúncios ou menus",
}
4

Phase 1 — Discovery via /rss

For portals with RSS feeds, one call to /rss already returns all recent articles with URL, title, and summary. No parsing library, no dealing with XML.

rss_discovery.py
import requests
import os

MARKUDOWN_API_KEY = os.getenv("MARKUDOWN_API_KEY")
BASE_URL = "https://api.scrapetechnology.com"

def descobrir_via_rss(feed_url: str) -> list[dict]:
    """Retorna lista de {url, title, summary} para cada item do feed."""
    resp = requests.post(
        f"{BASE_URL}/rss",
        headers={"X-API-KEY": MARKUDOWN_API_KEY},
        json={"url": feed_url},
        timeout=30,
    )
    resp.raise_for_status()
    return resp.json().get("items", [])

feeds = [
    "https://g1.globo.com/rss/g1/economia/",
    "https://feeds.folha.uol.com.br/mercado/rss091.xml",
    "https://www.infomoney.com.br/feed/",
    "https://www.valor.com.br/rss",
]

itens = []
for feed_url in feeds:
    novos = descobrir_via_rss(feed_url)
    itens.extend(novos)
    print(f"  {feed_url}: {len(novos)} matérias")

RSS first, always

If the portal has RSS, use /rss. It's the fastest and most stable way to discover articles. Reserve /map for portals that don't provide a feed.
5

Phase 1b — Discovery via /map

For portals without RSS feeds, /map crawls the section page and returns found URLs. The filter_pattern filters only URLs that follow the article pattern (with the year in the path).

map_discovery.py
def descobrir_via_map(site_url: str, max_urls: int = 50) -> list[str]:
    resp = requests.post(
        f"{BASE_URL}/map",
        headers={"X-API-KEY": MARKUDOWN_API_KEY},
        json={
            "url": site_url,
            "limit": max_urls,
            "filter_pattern": "/[0-9]{4}/",  # filtra URLs com ano no caminho
        },
        timeout=30,
    )
    resp.raise_for_status()
    return resp.json().get("urls", [])

# Portais sem feed RSS
portais_sem_rss = [
    "https://www.cnnbrasil.com.br/economia/",
]

for portal in portais_sem_rss:
    urls = descobrir_via_map(portal)
    itens.extend([{"url": u} for u in urls])
    print(f"  {portal}: {len(urls)} URLs")
6

Phase 2 — Extraction with /extract

For each discovered URL, a single call to /extract does everything: accesses the page, renders JavaScript, removes noise (ads, menus, footer), and maps content to your schema fields.

extractor.py
def extrair_materia(url: str) -> dict | None:
    try:
        resp = requests.post(
            f"{BASE_URL}/extract",
            headers={"X-API-KEY": MARKUDOWN_API_KEY},
            json={
                "url": url,
                "schema_fields": NEWS_SCHEMA,
                "extract_query": "Extraia os detalhes completos desta matéria jornalística",
            },
            timeout=60,
        )
        resp.raise_for_status()
        body = resp.json()

        if not body.get("success") or not body.get("data"):
            return None

        result = body["data"]
        result["url"] = url
        return result

    except requests.RequestException as e:
        print(f"  ✗ {url}: {e}")
        return None
7

Complete pipeline

Combining both phases into a script that runs end to end:

news_pipeline.py
"""
Pipeline de coleta e extração de notícias com MarkUDown.

Uso:
    python news_pipeline.py

Saída:
    noticias.json
"""

import json, time, os, requests
from datetime import datetime
from dotenv import load_dotenv

load_dotenv()

MARKUDOWN_API_KEY = os.getenv("MARKUDOWN_API_KEY")
BASE_URL = "https://api.scrapetechnology.com"

NEWS_SCHEMA = {
    "fonte":           "Nome do portal ou veículo de comunicação",
    "titulo":          "Título principal da matéria",
    "subtitulo":       "Subtítulo ou linha de apoio, se existir",
    "data_publicacao": "Data e hora de publicação no formato ISO 8601",
    "autor":           "Nome do autor ou repórter",
    "texto":           "Corpo completo da matéria, sem anúncios ou menus",
}

RSS_FEEDS = [
    "https://g1.globo.com/rss/g1/economia/",
    "https://feeds.folha.uol.com.br/mercado/rss091.xml",
    "https://www.infomoney.com.br/feed/",
    "https://www.valor.com.br/rss",
]

SITES_SEM_RSS = [
    "https://www.cnnbrasil.com.br/economia/",
]

# ── Fase 1: Descoberta ────────────────────────────────────────────────────────

def descobrir_urls() -> list[str]:
    urls = []
    print("→ Fase 1: Descoberta de URLs")

    for feed_url in RSS_FEEDS:
        try:
            resp = requests.post(
                f"{BASE_URL}/rss",
                headers={"X-API-KEY": MARKUDOWN_API_KEY},
                json={"url": feed_url},
                timeout=30,
            )
            resp.raise_for_status()
            itens = resp.json().get("items", [])
            novos = [item["url"] for item in itens if item.get("url")]
            urls.extend(novos)
            print(f"  RSS  {feed_url[:55]}: {len(novos)} matérias")
        except requests.RequestException as e:
            print(f"  ✗ RSS {feed_url}: {e}")

    for site in SITES_SEM_RSS:
        try:
            resp = requests.post(
                f"{BASE_URL}/map",
                headers={"X-API-KEY": MARKUDOWN_API_KEY},
                json={"url": site, "limit": 30, "filter_pattern": "/[0-9]{4}/"},
                timeout=30,
            )
            resp.raise_for_status()
            novos = resp.json().get("urls", [])
            urls.extend(novos)
            print(f"  Map  {site[:55]}: {len(novos)} URLs")
        except requests.RequestException as e:
            print(f"  ✗ Map {site}: {e}")

    seen, unique = set(), []
    for url in urls:
        if url not in seen:
            seen.add(url)
            unique.append(url)

    print(f"\n  Total: {len(unique)} URLs únicas\n")
    return unique

# ── Fase 2: Extração ──────────────────────────────────────────────────────────

def extrair_materia(url: str) -> dict | None:
    try:
        resp = requests.post(
            f"{BASE_URL}/extract",
            headers={"X-API-KEY": MARKUDOWN_API_KEY},
            json={
                "url": url,
                "schema_fields": NEWS_SCHEMA,
                "extract_query": "Extraia os detalhes completos desta matéria jornalística",
            },
            timeout=60,
        )
        resp.raise_for_status()
        body = resp.json()

        if not body.get("success") or not body.get("data"):
            return None

        result = body["data"]
        result["url"] = url
        return result

    except requests.RequestException:
        return None


def extrair_todas(urls: list[str], delay: float = 1.0) -> list[dict]:
    print("→ Fase 2: Extração das matérias")
    resultados = []

    for i, url in enumerate(urls, 1):
        print(f"  [{i:02d}/{len(urls)}] {url[:70]}...", end=" ", flush=True)
        materia = extrair_materia(url)
        if materia:
            resultados.append(materia)
            print(f"✓  {materia.get('titulo', '')[:45]}...")
        else:
            print("✗  skipped")
        time.sleep(delay)

    print(f"\n  Extraídas com sucesso: {len(resultados)}/{len(urls)}\n")
    return resultados

# ── Main ──────────────────────────────────────────────────────────────────────

def main():
    start = datetime.now()
    urls = descobrir_urls()
    noticias = extrair_todas(urls[:20])  # remova [:20] para processar tudo

    output_path = "noticias.json"
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(noticias, f, ensure_ascii=False, indent=2)

    elapsed = (datetime.now() - start).total_seconds()
    print(f"✓ {len(noticias)} notícias salvas em '{output_path}' ({elapsed:.1f}s)")


if __name__ == "__main__":
    main()

Run it

python news_pipeline.py
The script prints progress in real time and saves noticias.json at the end.

Result

Each article becomes a clean JSON object — ready to index, process with LLM, feed a dashboard, or publish via your own API:

noticias.json
[
  {
    "fonte": "InfoMoney",
    "titulo": "Ibovespa sobe 1,2% e fecha a 128.450 pontos com alívio externo",
    "subtitulo": "Bolsa acompanhou recuperação dos mercados internacionais após dados de inflação nos EUA",
    "data_publicacao": "2026-04-14T18:32:00-03:00",
    "autor": "Redação InfoMoney",
    "texto": "O Ibovespa, principal índice da bolsa brasileira, encerrou a sessão desta segunda-feira em alta de 1,2%, aos 128.450 pontos. O movimento seguiu a recuperação dos mercados internacionais...",
    "url": "https://www.infomoney.com.br/mercados/ibovespa-sobe-128450-pontos/"
  },
  {
    "fonte": "Folha de S.Paulo",
    "titulo": "Banco Central mantém Selic em 10,5% ao ano pela segunda reunião seguida",
    "subtitulo": "Decisão unânime do Copom surpreendeu parte do mercado que esperava corte de 0,25 ponto",
    "data_publicacao": "2026-04-13T21:00:00-03:00",
    "autor": "Eduardo Cucolo",
    "texto": "O Comitê de Política Monetária (Copom) do Banco Central manteve a taxa Selic em 10,5% ao ano...",
    "url": "https://www1.folha.uol.com.br/mercado/2026/04/banco-central-mantem-selic.shtml"
  }
]

What you can build with this

Mention monitor

Detect when your company, product, or competitor is cited in the press. Run the pipeline every hour.

Custom thematic feed

Aggregate news from multiple sector-filtered sources in a single endpoint — without manually opening each portal.

Sentiment analysis

Pass the text field to an LLM to classify as positive, neutral, or negative. Measures brand perception at scale.

Editorial dashboard

Feed an internal panel with the day's already-normalized articles. No copy and paste.

Trend alerts

Compare coverage volume and sentiment over time. Identify when a topic starts to scale.

AI corpus

Build a real journalism database to train language models, embeddings, or RAG systems.

Get started with MarkUDown

Create your free account and run this tutorial's pipeline in less than 10 minutes.