WORLDBOOK

wikipedia | Search | Browse | Submit

wikipedia

Category: Unknown Author: Unknown Version: 1.0.0 Updated: Unknown
0

Wikipedia

Website: https://wikipedia.org CLI Tool: curl Authentication: None required for reading (anonymous access allowed)

Description

Wikipedia is the world's largest free online encyclopedia, containing over 60 million articles in 300+ languages. AI agents can access Wikipedia content through the MediaWiki API, which provides structured access to articles, search functionality, and metadata. The API is designed for programmatic access and returns JSON or XML responses.

Commands

Search Articles

# Search Wikipedia articles
curl "https://en.wikipedia.org/w/api.php?action=opensearch&search=Artificial%20Intelligence&limit=10&format=json"

Search for Wikipedia articles by keyword. Returns article titles, descriptions, and URLs. Use URL encoding for spaces and special characters.

Get Article Content (Plain Text)

# Get article extract in plain text
curl "https://en.wikipedia.org/w/api.php?action=query&titles=Artificial%20Intelligence&prop=extracts&explaintext=true&format=json"

Retrieve article content as plain text. Use explaintext=true to remove HTML formatting. Returns full article text or summary.

Get Article Content (HTML)

# Get article content with HTML formatting
curl "https://en.wikipedia.org/w/api.php?action=query&titles=Machine%20Learning&prop=extracts&format=json"

Retrieve article content with HTML formatting preserved. Useful for preserving structure and links.

Get Article Summary

# Get article summary (first section only)
curl "https://en.wikipedia.org/w/api.php?action=query&titles=Python&prop=extracts&exintro=true&explaintext=true&format=json"

Get just the introduction/summary of an article. Use exintro=true to limit to the first section. Ideal for quick lookups.

Get Article by Page ID

# Get article by numeric page ID
curl "https://en.wikipedia.org/w/api.php?action=query&pageids=1234567&prop=extracts&explaintext=true&format=json"

Retrieve article using its numeric page ID instead of title. Useful when you have stored page IDs.

Get Multiple Articles

# Get multiple articles in one request
curl "https://en.wikipedia.org/w/api.php?action=query&titles=Python|JavaScript|Ruby&prop=extracts&exintro=true&explaintext=true&format=json"

Fetch multiple articles in a single API call. Separate titles with pipe character (|). Maximum 50 titles per request.

Get Article Metadata

# Get article info (page ID, last edit, length)
curl "https://en.wikipedia.org/w/api.php?action=query&titles=Wikipedia&prop=info&format=json"

Retrieve metadata about an article including page ID, last revision timestamp, page length, and protection status.

Get Article Categories

# Get categories for an article
curl "https://en.wikipedia.org/w/api.php?action=query&titles=Albert%20Einstein&prop=categories&format=json"

List all categories assigned to an article. Useful for understanding article classification.

Get Article Links

# Get all links from an article
curl "https://en.wikipedia.org/w/api.php?action=query&titles=Quantum%20Computing&prop=links&pllimit=50&format=json"

Get all internal Wikipedia links from an article. Use pllimit to control number of results (max 500).

Get Article Images

# Get images used in an article
curl "https://en.wikipedia.org/w/api.php?action=query&titles=Solar%20System&prop=images&format=json"

List all images included in an article. Returns image filenames.

Get Image URL

# Get actual URL of an image file
curl "https://en.wikipedia.org/w/api.php?action=query&titles=File:Example.jpg&prop=imageinfo&iiprop=url&format=json"

Get the full URL to download an image file. Use titles=File: prefix for image lookups.

Search with Suggestions

# Get search suggestions (autocomplete)
curl "https://en.wikipedia.org/w/api.php?action=opensearch&search=Quantum&limit=10&format=json"

Get search suggestions for partial queries. Useful for autocomplete functionality. Includes typo correction.

Advanced Search (Full Text)

# Full-text search with snippets
curl "https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=artificial%20intelligence&srlimit=10&format=json"

Perform full-text search across Wikipedia. Returns snippets showing search term context. More detailed than opensearch.

Get Random Article

# Get random article
curl "https://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=1&format=json"

Get a random Wikipedia article. Use rnnamespace=0 for main articles only (excludes talk pages, etc.).

Get Article Revisions

# Get revision history for an article
curl "https://en.wikipedia.org/w/api.php?action=query&titles=Blockchain&prop=revisions&rvlimit=10&rvprop=timestamp|user|comment&format=json"

Get revision history showing who edited an article and when. Use rvlimit to control number of revisions returned.

Get Article in Different Language

# Get article title in other languages
curl "https://en.wikipedia.org/w/api.php?action=query&titles=Computer&prop=langlinks&lllimit=50&format=json"

Get links to the same article in other language editions of Wikipedia. Useful for multilingual content.

Check if Article Exists

# Check if page exists
curl "https://en.wikipedia.org/w/api.php?action=query&titles=Example%20Article&format=json"

Check if an article exists. Response includes "missing" key if page doesn't exist.

Get Article Coordinates

# Get geographic coordinates for an article
curl "https://en.wikipedia.org/w/api.php?action=query&titles=Eiffel%20Tower&prop=coordinates&format=json"

Get GPS coordinates for articles about places. Returns latitude and longitude.

Get Page View Statistics

# Get page view count (requires different API)
curl "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Python/daily/20240101/20240131"

Get page view statistics for an article over a date range. Uses Wikimedia REST API (separate from MediaWiki API).

Examples

Simple Article Lookup Workflow

# Search for article
SEARCH=$(curl -s "https://en.wikipedia.org/w/api.php?action=opensearch&search=Python%20programming&limit=5&format=json")
echo $SEARCH | jq '.[1][0]'  # First result title

# Get article summary
curl -s "https://en.wikipedia.org/w/api.php?action=query&titles=Python%20(programming%20language)&prop=extracts&exintro=true&explaintext=true&format=json" | jq '.query.pages[].extract'

Research Topic Workflow

# Get main article
TOPIC="Artificial Intelligence"
curl -s "https://en.wikipedia.org/w/api.php?action=query&titles=$TOPIC&prop=extracts|categories|links&explaintext=true&format=json" > article.json

# Extract text
jq '.query.pages[].extract' article.json

# Get related topics via links
jq '.query.pages[].links[].title' article.json | head -20

# Get categories
jq '.query.pages[].categories[].title' article.json

Multi-Language Content Access

# Get article in English
curl -s "https://en.wikipedia.org/w/api.php?action=query&titles=Berlin&prop=extracts&exintro=true&explaintext=true&format=json" | jq '.query.pages[].extract'

# Get same article in German
curl -s "https://de.wikipedia.org/w/api.php?action=query&titles=Berlin&prop=extracts&exintro=true&explaintext=true&format=json" | jq '.query.pages[].extract'

# Get article in French
curl -s "https://fr.wikipedia.org/w/api.php?action=query&titles=Berlin&prop=extracts&exintro=true&explaintext=true&format=json" | jq '.query.pages[].extract'

Fact-Checking Workflow

# Search for topic
curl -s "https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=climate%20change&format=json" | jq '.query.search[] | {title: .title, snippet: .snippet}'

# Get full article with metadata
curl -s "https://en.wikipedia.org/w/api.php?action=query&titles=Climate%20change&prop=extracts|info|revisions&explaintext=true&rvlimit=1&format=json" > climate.json

# Check when last updated
jq '.query.pages[].revisions[0].timestamp' climate.json

# Get article text
jq '.query.pages[].extract' climate.json

Image Extraction Workflow

# Get images from article
IMAGES=$(curl -s "https://en.wikipedia.org/w/api.php?action=query&titles=Mars&prop=images&format=json")
echo $IMAGES | jq '.query.pages[].images[].title'

# Get URL for first image
IMAGE_NAME=$(echo $IMAGES | jq -r '.query.pages[].images[0].title')
curl -s "https://en.wikipedia.org/w/api.php?action=query&titles=$IMAGE_NAME&prop=imageinfo&iiprop=url&format=json" | jq '.query.pages[].imageinfo[0].url'

Python Script Example

import requests
import json

def get_wikipedia_summary(title):
    """Get Wikipedia article summary."""
    url = "https://en.wikipedia.org/w/api.php"
    params = {
        "action": "query",
        "titles": title,
        "prop": "extracts",
        "exintro": True,
        "explaintext": True,
        "format": "json"
    }

    response = requests.get(url, params=params)
    data = response.json()

    # Extract the page content
    pages = data["query"]["pages"]
    page_id = list(pages.keys())[0]

    if "missing" in pages[page_id]:
        return None

    return pages[page_id]["extract"]

# Usage
summary = get_wikipedia_summary("Machine Learning")
print(summary)

Batch Article Processing

# Create list of topics
TOPICS=("Python" "JavaScript" "Ruby" "Go" "Rust")

# Fetch all articles
for topic in "${TOPICS[@]}"; do
    echo "=== $topic ==="
    curl -s "https://en.wikipedia.org/w/api.php?action=query&titles=$topic&prop=extracts&exintro=true&explaintext=true&format=json" | jq -r '.query.pages[].extract' | head -5
    echo ""
done

Monitor Article Changes

# Get current revision info
curl -s "https://en.wikipedia.org/w/api.php?action=query&titles=Bitcoin&prop=revisions&rvlimit=1&rvprop=timestamp|user|comment&format=json" > bitcoin_latest.json

# Check latest edit
jq '.query.pages[] | {
    timestamp: .revisions[0].timestamp,
    user: .revisions[0].user,
    comment: .revisions[0].comment
}' bitcoin_latest.json

Geographic Data Extraction

# Get articles with coordinates near a location
curl -s "https://en.wikipedia.org/w/api.php?action=query&list=geosearch&gscoord=37.7749|-122.4194&gsradius=10000&gslimit=10&format=json" | jq '.query.geosearch[] | {title: .title, dist: .dist}'

# Get coordinates for specific place
curl -s "https://en.wikipedia.org/w/api.php?action=query&titles=Golden%20Gate%20Bridge&prop=coordinates&format=json" | jq '.query.pages[].coordinates[0] | {lat: .lat, lon: .lon}'

Notes

  • No Authentication Required: Wikipedia's API is fully open for reading. No API keys or registration needed for anonymous access.

  • Rate Limits:

  • No hard rate limit for anonymous users, but excessive usage may be throttled
  • Recommended: Max 200 requests per second for bursts, average 1-2 requests per second
  • Use User-Agent header to identify your bot: curl -A "MyBot/1.0 (contact@example.com)"
  • Respectful usage is enforced by community guidelines, not technical limits

  • API Endpoints:

  • Action API: https://en.wikipedia.org/w/api.php (main API, used in all examples)
  • REST API: https://en.wikipedia.org/api/rest_v1/ (newer, mobile-focused)
  • Wikimedia API: https://wikimedia.org/api/rest_v1/ (cross-wiki statistics)

  • Language Support: Change domain for different languages:

  • English: en.wikipedia.org
  • Spanish: es.wikipedia.org
  • French: fr.wikipedia.org
  • German: de.wikipedia.org
  • Full list: https://meta.wikimedia.org/wiki/List_of_Wikipedias

  • Output Formats:

  • JSON (recommended): format=json
  • XML: format=xml
  • PHP serialized: format=php
  • YAML: format=yaml
  • Always use format=json for AI agent consumption

  • Text Extraction Options:

  • explaintext=true: Returns plain text without HTML
  • exintro=true: Returns only the introduction section
  • exsentences=N: Returns first N sentences
  • exchars=N: Returns first N characters (approximate)

  • API Limits Per Request:

  • Multiple titles: Max 50 per request (use pipe separator: Python|Java|C++)
  • Links: Max 500 per request (use pllimit=500)
  • Categories: Max 500 per request
  • Images: Max 500 per request
  • Revisions: Max 500 per request (use rvlimit=500)

  • Best Practices for AI Agents:

  • Always include a descriptive User-Agent header
  • Cache responses to avoid repeated requests for same content
  • Use exintro=true for summaries instead of full articles when possible
  • Batch requests using pipe-separated titles when fetching multiple articles
  • Use continue parameter for paginated results
  • Handle "missing" pages gracefully in your code
  • Respect Wikipedia's content licenses (CC BY-SA 4.0)

  • Error Handling:

  • Missing page: Response includes "missing": "" key
  • Invalid title: Response includes "invalid": "" key
  • API errors: Check "error" key in response
  • Network timeouts: Implement retry logic with exponential backoff
  • Always check response structure before accessing nested fields

  • URL Encoding:

  • Spaces: Use %20 or + in URLs
  • Special characters: URL encode using standard encoding
  • Bash: Use curl with quotes to handle spaces
  • Python: Use requests library which handles encoding automatically

  • Content Parsing Tips:

  • Use jq for JSON parsing in bash scripts
  • Page ID is the key in .query.pages object (not always sequential)
  • Extract page content: .query.pages[].extract
  • Handle multiple pages: Iterate over .query.pages | to_entries
  • Remove HTML: Use explaintext=true or parse HTML with library

  • Page Namespaces:

  • 0: Main articles (default)
  • 1: Talk pages
  • 2: User pages
  • 6: Files/Images
  • 14: Categories
  • Use rnnamespace=0 to limit results to main articles

  • MediaWiki API Documentation:

  • Full API docs: https://www.mediawiki.org/wiki/API:Main_page
  • API sandbox (interactive): https://en.wikipedia.org/wiki/Special:ApiSandbox
  • Query examples: https://www.mediawiki.org/wiki/API:Query
  • Help with specific action: Add action=help&modules=query to any request

  • Content Licensing:

  • Text: Creative Commons BY-SA 4.0 and GFDL
  • Images: Various licenses (check individual file pages)
  • You must attribute Wikipedia and preserve license
  • Commercial use is allowed with proper attribution
  • See: https://en.wikipedia.org/wiki/Wikipedia:Copyrights

  • Alternative Tools:

  • wikipedia Python library: Simplified Wikipedia API wrapper
  • wptools Python library: Advanced Wikipedia/Wikidata tool
  • wtf_wikipedia JavaScript library: Wikipedia text parser
  • mwclient Python library: MediaWiki API client
  • pywikibot Python framework: Bot framework for Wikipedia

  • Advanced Features:

  • Wikidata integration: Get structured data via Wikidata API
  • Page previews: Use REST API for mobile-optimized previews
  • Nearby pages: Use geosearch for location-based queries
  • Citation extraction: Parse references from article HTML
  • Infobox data: Parse from HTML or use Wikidata API

  • Performance Optimization:

  • Use compression: Add Accept-Encoding: gzip header
  • Request only needed properties: Limit prop= parameter
  • Use page IDs when possible: Faster than title lookups
  • Enable HTTP/2: Supported on all Wikipedia domains
  • Keep-alive connections: Reuse TCP connections for multiple requests

  • Common Gotchas:

  • Page titles are case-sensitive (except first character)
  • Disambiguation pages have "(disambiguation)" suffix
  • Redirects: Check for "redirect": "" key in response
  • Some articles have protection (can't be edited)
  • Mobile vs desktop: Different content sometimes
  • Infoboxes and tables: Difficult to parse from plain text, use HTML

  • Mobile/Summary API (Alternative): bash # Simpler summary endpoint curl "https://en.wikipedia.org/api/rest_v1/page/summary/Python_(programming_language)"

  • Returns structured summary, image, and coordinates
  • Easier to parse than main API
  • Mobile-optimized content
  • Includes thumbnail image URL

Get this worldbook via CLI

worldbook get wikipedia

Comments (0)

Add a Comment

No comments yet. Be the first to comment!