How to build a news APIwith MarkUDown
Three endpoints, two Python dependencies. A pipeline that discovers articles via RSS and site mapping, extracts content from each one, and returns structured JSON with source, title, subtitle, date, author, and body — from any portal, without maintaining CSS selectors.
The problem every data analyst has faced
You need to know what the press is saying about your sector. It could be an investment fund monitoring portfolio company coverage. A startup tracking competitor mentions. A marketing team measuring the impact of a launch. A researcher building an economic journalism corpus.
It sounds like a simple problem. You have the URLs. You have Python. You open the terminal and start with requests.get(url).text.
And then reality hits:
- G1 returns HTML with an empty article — the content is loaded via JavaScript
- Folha returns 403 from datacenter IPs
- InfoMoney has a different structure after every redesign
- You extract the text but menus, ads, and footers come along
- Each portal uses a different format for date, author, and subtitle
What looked like an afternoon script turns into a constant maintenance project. You fix the G1 selector, Valor changes the layout, you fix Valor, Folha updates the anti-bot, and the cycle begins again.
Three endpoints. Problem solved.
MarkUDown has three endpoints that, together, cover the entire news pipeline:
/rss — Discovery from feeds
Pass the URL of any RSS feed. Get back an array of {url, title, summary} for each article. No library, no manual XML parsing.
/map — Discovery on portals without RSS
For sites that don't provide a feed, /map crawls the section page and returns article URLs. A URL pattern filter isolates only articles.
/extract — Structured extraction in one call
Pass the article URL and the field schema. The endpoint accesses the page, extracts clean content, and maps fields with AI — returns ready JSON.
Anti-bot portals included
Tutorial
You'll need a MarkUDown API key. Create yours for free in the dashboard.
Create your account and get the API key
Go to the MarkUDown dashboard, create a free account and copy your API key. It goes as the header X-API-KEY in all calls.
Install dependencies
Just two Python libraries:
pip install requests python-dotenvCreate a .env in the root:
# .env
MARKUDOWN_API_KEY=sua_chave_aquiNever commit your API key
Define the news schema
The schema is a dictionary that describes the fields you want to extract. The AI uses the descriptions to understand what to look for on each page — regardless of the portal's layout.
NEWS_SCHEMA = {
"fonte": "Nome do portal ou veículo de comunicação",
"titulo": "Título principal da matéria",
"subtitulo": "Subtítulo ou linha de apoio, se existir",
"data_publicacao": "Data e hora de publicação no formato ISO 8601",
"autor": "Nome do autor ou repórter",
"texto": "Corpo completo da matéria, sem anúncios ou menus",
}Phase 1 — Discovery via /rss
For portals with RSS feeds, one call to /rss already returns all recent articles with URL, title, and summary. No parsing library, no dealing with XML.
import requests
import os
MARKUDOWN_API_KEY = os.getenv("MARKUDOWN_API_KEY")
BASE_URL = "https://api.scrapetechnology.com"
def descobrir_via_rss(feed_url: str) -> list[dict]:
"""Retorna lista de {url, title, summary} para cada item do feed."""
resp = requests.post(
f"{BASE_URL}/rss",
headers={"X-API-KEY": MARKUDOWN_API_KEY},
json={"url": feed_url},
timeout=30,
)
resp.raise_for_status()
return resp.json().get("items", [])
feeds = [
"https://g1.globo.com/rss/g1/economia/",
"https://feeds.folha.uol.com.br/mercado/rss091.xml",
"https://www.infomoney.com.br/feed/",
"https://www.valor.com.br/rss",
]
itens = []
for feed_url in feeds:
novos = descobrir_via_rss(feed_url)
itens.extend(novos)
print(f" {feed_url}: {len(novos)} matérias")RSS first, always
Phase 1b — Discovery via /map
For portals without RSS feeds, /map crawls the section page and returns found URLs. The filter_pattern filters only URLs that follow the article pattern (with the year in the path).
def descobrir_via_map(site_url: str, max_urls: int = 50) -> list[str]:
resp = requests.post(
f"{BASE_URL}/map",
headers={"X-API-KEY": MARKUDOWN_API_KEY},
json={
"url": site_url,
"limit": max_urls,
"filter_pattern": "/[0-9]{4}/", # filtra URLs com ano no caminho
},
timeout=30,
)
resp.raise_for_status()
return resp.json().get("urls", [])
# Portais sem feed RSS
portais_sem_rss = [
"https://www.cnnbrasil.com.br/economia/",
]
for portal in portais_sem_rss:
urls = descobrir_via_map(portal)
itens.extend([{"url": u} for u in urls])
print(f" {portal}: {len(urls)} URLs")Phase 2 — Extraction with /extract
For each discovered URL, a single call to /extract does everything: accesses the page, renders JavaScript, removes noise (ads, menus, footer), and maps content to your schema fields.
def extrair_materia(url: str) -> dict | None:
try:
resp = requests.post(
f"{BASE_URL}/extract",
headers={"X-API-KEY": MARKUDOWN_API_KEY},
json={
"url": url,
"schema_fields": NEWS_SCHEMA,
"extract_query": "Extraia os detalhes completos desta matéria jornalística",
},
timeout=60,
)
resp.raise_for_status()
body = resp.json()
if not body.get("success") or not body.get("data"):
return None
result = body["data"]
result["url"] = url
return result
except requests.RequestException as e:
print(f" ✗ {url}: {e}")
return NoneComplete pipeline
Combining both phases into a script that runs end to end:
"""
Pipeline de coleta e extração de notícias com MarkUDown.
Uso:
python news_pipeline.py
Saída:
noticias.json
"""
import json, time, os, requests
from datetime import datetime
from dotenv import load_dotenv
load_dotenv()
MARKUDOWN_API_KEY = os.getenv("MARKUDOWN_API_KEY")
BASE_URL = "https://api.scrapetechnology.com"
NEWS_SCHEMA = {
"fonte": "Nome do portal ou veículo de comunicação",
"titulo": "Título principal da matéria",
"subtitulo": "Subtítulo ou linha de apoio, se existir",
"data_publicacao": "Data e hora de publicação no formato ISO 8601",
"autor": "Nome do autor ou repórter",
"texto": "Corpo completo da matéria, sem anúncios ou menus",
}
RSS_FEEDS = [
"https://g1.globo.com/rss/g1/economia/",
"https://feeds.folha.uol.com.br/mercado/rss091.xml",
"https://www.infomoney.com.br/feed/",
"https://www.valor.com.br/rss",
]
SITES_SEM_RSS = [
"https://www.cnnbrasil.com.br/economia/",
]
# ── Fase 1: Descoberta ────────────────────────────────────────────────────────
def descobrir_urls() -> list[str]:
urls = []
print("→ Fase 1: Descoberta de URLs")
for feed_url in RSS_FEEDS:
try:
resp = requests.post(
f"{BASE_URL}/rss",
headers={"X-API-KEY": MARKUDOWN_API_KEY},
json={"url": feed_url},
timeout=30,
)
resp.raise_for_status()
itens = resp.json().get("items", [])
novos = [item["url"] for item in itens if item.get("url")]
urls.extend(novos)
print(f" RSS {feed_url[:55]}: {len(novos)} matérias")
except requests.RequestException as e:
print(f" ✗ RSS {feed_url}: {e}")
for site in SITES_SEM_RSS:
try:
resp = requests.post(
f"{BASE_URL}/map",
headers={"X-API-KEY": MARKUDOWN_API_KEY},
json={"url": site, "limit": 30, "filter_pattern": "/[0-9]{4}/"},
timeout=30,
)
resp.raise_for_status()
novos = resp.json().get("urls", [])
urls.extend(novos)
print(f" Map {site[:55]}: {len(novos)} URLs")
except requests.RequestException as e:
print(f" ✗ Map {site}: {e}")
seen, unique = set(), []
for url in urls:
if url not in seen:
seen.add(url)
unique.append(url)
print(f"\n Total: {len(unique)} URLs únicas\n")
return unique
# ── Fase 2: Extração ──────────────────────────────────────────────────────────
def extrair_materia(url: str) -> dict | None:
try:
resp = requests.post(
f"{BASE_URL}/extract",
headers={"X-API-KEY": MARKUDOWN_API_KEY},
json={
"url": url,
"schema_fields": NEWS_SCHEMA,
"extract_query": "Extraia os detalhes completos desta matéria jornalística",
},
timeout=60,
)
resp.raise_for_status()
body = resp.json()
if not body.get("success") or not body.get("data"):
return None
result = body["data"]
result["url"] = url
return result
except requests.RequestException:
return None
def extrair_todas(urls: list[str], delay: float = 1.0) -> list[dict]:
print("→ Fase 2: Extração das matérias")
resultados = []
for i, url in enumerate(urls, 1):
print(f" [{i:02d}/{len(urls)}] {url[:70]}...", end=" ", flush=True)
materia = extrair_materia(url)
if materia:
resultados.append(materia)
print(f"✓ {materia.get('titulo', '')[:45]}...")
else:
print("✗ skipped")
time.sleep(delay)
print(f"\n Extraídas com sucesso: {len(resultados)}/{len(urls)}\n")
return resultados
# ── Main ──────────────────────────────────────────────────────────────────────
def main():
start = datetime.now()
urls = descobrir_urls()
noticias = extrair_todas(urls[:20]) # remova [:20] para processar tudo
output_path = "noticias.json"
with open(output_path, "w", encoding="utf-8") as f:
json.dump(noticias, f, ensure_ascii=False, indent=2)
elapsed = (datetime.now() - start).total_seconds()
print(f"✓ {len(noticias)} notícias salvas em '{output_path}' ({elapsed:.1f}s)")
if __name__ == "__main__":
main()Run it
python news_pipeline.pyThe script prints progress in real time and saves noticias.json at the end.
Result
Each article becomes a clean JSON object — ready to index, process with LLM, feed a dashboard, or publish via your own API:
[
{
"fonte": "InfoMoney",
"titulo": "Ibovespa sobe 1,2% e fecha a 128.450 pontos com alívio externo",
"subtitulo": "Bolsa acompanhou recuperação dos mercados internacionais após dados de inflação nos EUA",
"data_publicacao": "2026-04-14T18:32:00-03:00",
"autor": "Redação InfoMoney",
"texto": "O Ibovespa, principal índice da bolsa brasileira, encerrou a sessão desta segunda-feira em alta de 1,2%, aos 128.450 pontos. O movimento seguiu a recuperação dos mercados internacionais...",
"url": "https://www.infomoney.com.br/mercados/ibovespa-sobe-128450-pontos/"
},
{
"fonte": "Folha de S.Paulo",
"titulo": "Banco Central mantém Selic em 10,5% ao ano pela segunda reunião seguida",
"subtitulo": "Decisão unânime do Copom surpreendeu parte do mercado que esperava corte de 0,25 ponto",
"data_publicacao": "2026-04-13T21:00:00-03:00",
"autor": "Eduardo Cucolo",
"texto": "O Comitê de Política Monetária (Copom) do Banco Central manteve a taxa Selic em 10,5% ao ano...",
"url": "https://www1.folha.uol.com.br/mercado/2026/04/banco-central-mantem-selic.shtml"
}
]What you can build with this
Mention monitor
Detect when your company, product, or competitor is cited in the press. Run the pipeline every hour.
Custom thematic feed
Aggregate news from multiple sector-filtered sources in a single endpoint — without manually opening each portal.
Sentiment analysis
Pass the text field to an LLM to classify as positive, neutral, or negative. Measures brand perception at scale.
Editorial dashboard
Feed an internal panel with the day's already-normalized articles. No copy and paste.
Trend alerts
Compare coverage volume and sentiment over time. Identify when a topic starts to scale.
AI corpus
Build a real journalism database to train language models, embeddings, or RAG systems.
Get started with MarkUDown
Create your free account and run this tutorial's pipeline in less than 10 minutes.