WhyML provides powerful web scraping capabilities that can intelligently convert existing websites into clean, maintainable YAML manifests. The scraping system includes structure simplification, selective content extraction, and comprehensive page analysis.
# Basic scraping
whyml scrape https://example.com --output manifest.yaml
# With structure simplification
whyml scrape https://example.com --output manifest.yaml --simplify-structure
# Limit nesting depth
whyml scrape https://example.com --output manifest.yaml --max-depth 5
# Specify output file
whyml scrape https://example.com --output my-manifest.yaml
# Save to directory
whyml scrape https://example.com --output-dir ./scraped-sites/
# Generate multiple formats
whyml scrape https://example.com --output manifest.yaml --also-html --also-react
Reduce HTML nesting complexity by limiting depth:
# Limit to 3 levels of nesting
whyml scrape https://example.com --max-depth 3 --output simplified.yaml
Before (complex nesting):
<div class="wrapper">
<div class="container">
<div class="inner">
<div class="content">
<div class="text-wrapper">
<p>Content</p>
</div>
</div>
</div>
</div>
</div>
After (simplified):
<div class="container">
<div class="content">
<p>Content</p>
</div>
</div>
Remove unnecessary wrapper divs:
whyml scrape https://example.com --flatten-containers --output flattened.yaml
Automatically removes divs with classes like:
wrapper, container, inner, outerrow, col, grid-itemPreserve important HTML5 semantic elements:
# Preserve semantic elements (default)
whyml scrape https://example.com --preserve-semantic
# Disable semantic preservation
whyml scrape https://example.com --no-preserve-semantic
Preserved elements:
header, main, article, section, footernav, aside, figure, figcaptionh1-h6)Extract only specific parts of the manifest:
# Extract only metadata and structure
whyml scrape https://example.com --section metadata --section structure
# Extract analysis data only
whyml scrape https://example.com --section analysis --output analysis.yaml
# Multiple sections
whyml scrape https://example.com \
--section metadata \
--section styles \
--section imports \
--output partial.yaml
metadata - Page title, description, author infostructure - HTML structure treestyles - CSS styling informationimports - External CSS/JS dependenciesanalysis - Page analysis and metricsvariables - Extracted template variablesWhyML automatically detects page types:
whyml scrape https://blog.example.com --analyze --output blog.yaml
Detected types:
Comprehensive SEO analysis of scraped pages:
analysis:
seo:
title_length: 65 # Optimal: 50-60 characters
meta_description: true # Has meta description
meta_description_length: 155 # Optimal: 150-160 characters
heading_structure: "good" # H1 -> H2 -> H3 hierarchy
social_meta: true # OpenGraph/Twitter cards
keywords_density: 2.5 # Keyword density percentage
Accessibility compliance checking:
analysis:
accessibility:
alt_text_coverage: 85 # Percentage of images with alt text
language_attribute: true # Has lang attribute
heading_hierarchy: "good" # Proper heading structure
aria_labels: 12 # Number of ARIA labels found
wcag_compliance: "AA" # WCAG compliance level
Detailed content analysis:
analysis:
content:
word_count: 1250 # Total word count
paragraph_count: 15 # Number of paragraphs
heading_count: 8 # Number of headings
link_count: 23 # Number of links
image_count: 5 # Number of images
reading_time: "5 min" # Estimated reading time
Test the accuracy of scraping and regeneration:
# Complete testing workflow
whyml scrape https://example.com \
--test-conversion \
--output original.yaml \
--output-html regenerated.html
This workflow:
testing:
similarity:
content_similarity: 95.2 # Text content match percentage
structure_similarity: 88.7 # HTML structure match
visual_similarity: 92.1 # Layout preservation
semantic_similarity: 96.8 # Semantic meaning preservation
# Validate manifest quality
whyml validate original.yaml --strict
# Test conversion accuracy
whyml test-conversion original.yaml --target-url https://example.com
Target specific elements for scraping:
# Scrape only main content
whyml scrape https://example.com \
--selector "main, .content, #main-content" \
--output content-only.yaml
# Exclude elements
whyml scrape https://example.com \
--exclude ".ads, .sidebar, .comments" \
--output clean.yaml
Filter content during scraping:
# Skip images and media
whyml scrape https://example.com --no-images --no-media
# Skip external scripts
whyml scrape https://example.com --no-scripts
# Skip inline styles
whyml scrape https://example.com --no-inline-styles
Scrape multiple URLs:
# From file list
whyml scrape-batch urls.txt --output-dir ./scraped/
# With pattern
whyml scrape https://example.com/page-{1..10} --output-dir ./pages/
Use configuration file for complex scraping:
# scrape-config.yaml
scraping:
max_depth: 4
flatten_containers: true
preserve_semantic: true
sections: ["metadata", "structure", "styles"]
filters:
exclude_selectors: [".ads", ".popup", ".cookie-notice"]
include_selectors: ["main", ".content"]
analysis:
enable_seo: true
enable_accessibility: true
enable_performance: true
output:
format: "yaml"
optimize: true
minify: false
whyml scrape https://example.com --config scrape-config.yaml
# Scrape multiple pages concurrently
whyml scrape-batch urls.txt --concurrent 5 --output-dir ./results/
# Enable caching for repeated scraping
whyml scrape https://example.com --cache --cache-dir ./cache/
# Use cached version if available
whyml scrape https://example.com --use-cache --max-age 3600
# Add delays between requests
whyml scrape-batch urls.txt --delay 2 --output-dir ./results/
# Respect robots.txt
whyml scrape https://example.com --respect-robots
# Set timeout and retries
whyml scrape https://example.com \
--timeout 30 \
--retries 3 \
--output manifest.yaml
WhyML handles malformed HTML gracefully:
# Enable verbose output
whyml scrape https://example.com --verbose --output debug.yaml
# Save debug information
whyml scrape https://example.com --debug --debug-output debug.json
# .github/workflows/scrape-monitor.yml
name: Website Monitoring
on:
schedule:
- cron: '0 0 * * *' # Daily
jobs:
scrape-and-compare:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Install WhyML
run: pip install whyml
- name: Scrape website
run: |
whyml scrape https://example.com \
--output current.yaml \
--test-conversion \
--compare-with baseline.yaml
import asyncio
from whyml import URLScraper
async def monitor_website():
scraper = URLScraper(
max_depth=5,
simplify_structure=True,
enable_analysis=True
)
result = await scraper.scrape_url("https://example.com")
# Check for changes
if result.analysis.structure_changes > 10:
print("Significant structure changes detected!")
# Save manifest
with open("current.yaml", "w") as f:
f.write(result.to_yaml())
asyncio.run(monitor_website())
# Check what WhyML sees
whyml scrape https://example.com --dry-run --verbose
# Validate scraped manifest
whyml validate scraped.yaml --detailed
# Test conversion accuracy
whyml test-conversion scraped.yaml --metrics