{"id":38401,"date":"2026-05-21T10:58:33","date_gmt":"2026-05-21T10:58:33","guid":{"rendered":"https:\/\/nodemaven.com\/?p=38401"},"modified":"2026-05-21T12:24:26","modified_gmt":"2026-05-21T12:24:26","slug":"python-web-scraping","status":"publish","type":"post","link":"https:\/\/nodemaven.com\/ru\/blog\/python-web-scraping\/","title":{"rendered":"Web Scraping with Python: The Complete Guide [2026]"},"content":{"rendered":"<p>Python web scraping has evolved far beyond simple scripts that extract HTML from static pages. Modern websites rely heavily on JavaScript rendering, aggressive anti-bot systems, fingerprinting, and rate limits, which means successful web scraping with Python now requires more than just requests and BeautifulSoup.<\/p>\n\n\n\n<p>In this guide, you\u2019ll learn how web scraping in Python actually works in 2026, how to scrape both static and dynamic websites, and how to choose the right tools for different targets.<\/p>\n\n\n\n<p>We\u2019ll cover everything from requests, BeautifulSoup, and lxml to Playwright, Scrapy, and curl_cffi, along with practical techniques for handling pagination, rotating proxies, browser fingerprinting, Cloudflare protection, and large-scale scraping workflows.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What is Web Scraping?<\/h2>\n\n\n\n<p><a href=\"https:\/\/nodemaven.com\/ru\/use-cases\/web-scraping-proxies\/\">\u0412\u0435\u0431-\u0441\u043a\u0440\u0435\u0439\u043f\u0438\u043d\u0433<\/a> is the automated extraction of data from websites. You write a program that visits a URL, downloads the page\u2019s HTML, locates the elements containing the data you need \u2014 prices, product names, news articles, contact details \u2014 and saves that data in a structured format like CSV, JSON, or a database.<\/p>\n\n\n\n<p>Python is the language of choice for web scraping in 2026 for three reasons: its libraries cover every step of the pipeline out of the box, the code is readable enough for non-engineers to maintain, and it has the largest community producing scraping-specific tooling. According to most developer surveys, more than 70% of web scrapers are written in Python.<\/p>\n\n\n\n<p>Whether you\u2019re using Python for web scraping small research projects or building production-scale data pipelines, it offers mature libraries for HTTP requests, HTML parsing, browser automation, async crawling, and anti-bot handling.<\/p>\n\n\n\n<p><strong>Common use cases for Python web scraping:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>\u041c\u043e\u043d\u0438\u0442\u043e\u0440\u0438\u043d\u0433 \u0446\u0435\u043d<\/strong> \u2014 track competitor pricing on e-commerce sites<\/li>\n\n\n\n<li><strong>\u0413\u0435\u043d\u0435\u0440\u0430\u0446\u0438\u044f \u043b\u0438\u0434\u043e\u0432<\/strong> \u2014 collect business directories, contact pages, job boards<\/li>\n\n\n\n<li><strong>\u041c\u0430\u0440\u043a\u0435\u0442\u0438\u043d\u0433\u043e\u0432\u044b\u0435 \u0438\u0441\u0441\u043b\u0435\u0434\u043e\u0432\u0430\u043d\u0438\u044f<\/strong> \u2014 aggregate product reviews, social sentiment, news coverage<\/li>\n\n\n\n<li><strong>\u0410\u043a\u0430\u0434\u0435\u043c\u0438\u0447\u0435\u0441\u043a\u0438\u0435 \u0438\u0441\u0441\u043b\u0435\u0434\u043e\u0432\u0430\u043d\u0438\u044f<\/strong> \u2014 build datasets from public sources for NLP or ML training<\/li>\n\n\n\n<li><strong>Real estate data<\/strong> \u2014 gather listings, pricing trends, property details<\/li>\n\n\n\n<li><strong>SEO-\u043c\u043e\u043d\u0438\u0442\u043e\u0440\u0438\u043d\u0433<\/strong> \u2014 track rankings, extract SERP features, monitor backlinks<\/li>\n\n\n\n<li><strong>Travel & hospitality<\/strong> \u2014 scrape flight prices, hotel availability, reviews<\/li>\n<\/ul>\n\n\n<div\n\t\t\t\n\t\t\tclass=\"so-widget-rhinocore-addons-rhino-alert-banner so-widget-rhinocore-addons-rhino-alert-banner-default-d75171398898\"\n\t\t\t\n\t\t><div class=\"rhino-widget rhino-widget--rhinocore-addons-rhino-alert-banner section-alert\"    style=\"--alert-background-color: #E6E6FF\"\n>\n            <div class=\"section-alert__icon\">\n            <img decoding=\"async\" src=\"https:\/\/nodemaven.com\/wp-content\/uploads\/2026\/02\/icon-4.svg\" alt=\"\" loading=\"lazy\" width=\"64\" height=\"64\">        <\/div>\n    \n            <div class=\"section-alert__main\">\n                            <div class=\"section-alert__title\">Try Python residential & mobile proxies for just $3.50 \u2014 750 MB of bandwidth included<\/div>\n            \n                            <div class=\"section-alert__description\"><p>30M+ pre-filtered IPs with 95% clean records. No blocks, no burned addresses.<\/p>\n<\/div>\n                    <\/div>\n    \n            <a\n            class=\"section-alert__button b-btn b-btn--static-xl b-btn--secondary-black\"\n            href=\"https:\/\/dashboard.nodemaven.com\/accounts\/signup\/?next=\/checkout\/pag\/trial&_gl=1*lri4ul*_gcl_aw*R0NMLjE3NzkyODYzNDMuQ2p3S0NBand0N1hRQmhCa0Vpd0F0U3RwcDBSV2xNVVBsMXk5M2xzV2JJUnVkT0dPRjdDc1M4enh5X2JGb0tabEZJMGtBSXFZMHFlTVdCb0MwMzBRQXZEX0J3RQ..*_gcl_au*MTk3NzAwNDQ4My4xNzcyNDc5NzU3*_ga*MTAxNzQyMTIwOC4xNzYyODcwMDE5*_ga_33JL89XFQ5*czE3NzkzNTk0MzMkbzE4MCRnMSR0MTc3OTM2MDAxNCRqNDYkbDAkaDI1MTU5Mjk0NA..\"\n             target=\"_blank\" rel=\"noopener noreferrer\">\n            \u041f\u043e\u043f\u0440\u043e\u0431\u043e\u0432\u0430\u0442\u044c        <\/a>\n    <\/div>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\">Is Web Scraping legal?<\/h2>\n\n\n\n<p>Web scraping publicly available data sits in a legal grey zone that varies by jurisdiction, target site, and how the scraping is conducted. The landmark 2022 ruling in <em>hiQ Labs v. LinkedIn<\/em> (US Ninth Circuit) affirmed that scraping publicly accessible data generally does not violate the Computer Fraud and Abuse Act \u2014 but that ruling doesn\u2019t give blanket permission for everything.<\/p>\n\n\n\n<p><strong>The practical checklist before scraping any site:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><td><strong>Factor<\/strong><\/td><td><strong>What to check<\/strong><\/td><td><strong>Risk if ignored<\/strong><\/td><\/tr><\/thead><tbody><tr><td>robots.txt<\/td><td>Check \/robots.txt for Disallow directives<\/td><td>ToS violation, civil claim<\/td><\/tr><tr><td>\u0423\u0441\u043b\u043e\u0432\u0438\u044f \u043e\u0431\u0441\u043b\u0443\u0436\u0438\u0432\u0430\u043d\u0438\u044f<\/td><td>Read the ToS \u2014 many explicitly prohibit automated access<\/td><td>Contract violation, account ban<\/td><\/tr><tr><td>Personal data (GDPR\/CCPA)<\/td><td>Don\u2019t collect or store names, emails, identifiers without legal basis<\/td><td>Regulatory fine (\u20ac20M+)<\/td><\/tr><tr><td>Rate limiting<\/td><td>Add delays \u2014 aggressive scraping can constitute DoS in some jurisdictions<\/td><td>Criminal liability<\/td><\/tr><tr><td>Login-required content<\/td><td>Never scrape behind authentication you don\u2019t own<\/td><td>CFAA violation<\/td><\/tr><tr><td>Copyright<\/td><td>Extracting copyrighted creative works (text, images) has separate protections<\/td><td>DMCA takedown, lawsuit<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">How Web Scraping works<\/h2>\n\n\n\n<p>Before writing a single line of Python, understanding what actually happens under the hood makes everything easier to debug.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>HTTP Request<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Your scraper sends an HTTP GET request to a URL. The server receives it and decides whether to respond with HTML or block you.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Server Response<\/strong><\/li>\n<\/ul>\n\n\n\n<p>The server returns the page\u2019s HTML (static sites) or an initial HTML shell that JavaScript then populates (dynamic sites). You need to know which type you\u2019re dealing with before picking a tool.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>HTML Parsing<\/strong><\/li>\n<\/ul>\n\n\n\n<p>Your parser reads the HTML tree and locates elements by their tag, class, ID, or XPath. This is where you extract the specific data you want.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Data Cleaning<\/strong><\/li>\n<\/ul>\n\n\n\n<p>Raw HTML contains whitespace, special characters, and formatting noise. You strip and normalize it into clean, usable values.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Storage<\/strong><\/li>\n<\/ul>\n\n\n\n<p>Save to CSV, JSON, a database, or push to an API. The right format depends on what you\u2019re doing with the data next.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Static vs. Dynamic pages: this determines everything<\/h2>\n\n\n\n<p>The most important question before writing any scraper is: is the data in the raw HTML source, or is it loaded by JavaScript?<\/p>\n\n\n\n<p>Right-click the page \u2192 View Page Source. If your data is visible in that source, it\u2019s static. If you see a mostly empty shell with <em><div id=\"\u201dapp\u201d\"><\/div><\/em>, it\u2019s dynamic and you\u2019ll need a browser automation tool like Playwright.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Python libraries: choosing the right tool<\/h2>\n\n\n\n<p>There\u2019s no single \u201cbest\u201d library for Python web scraping. The right tool depends on the type of target page, the scale of your project, and your latency requirements. Here\u2019s the full landscape:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><td><strong>Library<\/strong><\/td><td><strong>Role<\/strong><\/td><td><strong>Handles JS?<\/strong><\/td><td><strong>\u0421\u043a\u043e\u0440\u043e\u0441\u0442\u044c<\/strong><\/td><td><strong>\u041b\u0443\u0447\u0448\u0435\u0435 \u0434\u043b\u044f<\/strong><\/td><\/tr><\/thead><tbody><tr><td>\u0437\u0430\u043f\u0440\u043e\u0441\u044b<\/td><td>HTTP fetching<\/td><td>\ud83d\udd34 No<\/td><td>\ud83d\udfe2 Fast<\/td><td>Static pages, APIs<\/td><\/tr><tr><td>BeautifulSoup4<\/td><td>HTML parsing<\/td><td>\ud83d\udd34 No<\/td><td>\ud83d\udfe1 Medium<\/td><td>Parsing HTML with simple selectors<\/td><\/tr><tr><td>lxml<\/td><td>HTML\/XML parsing<\/td><td>\ud83d\udd34 No<\/td><td>\ud83d\udfe2 Very fast<\/td><td>Large pages, XPath power users<\/td><\/tr><tr><td>\u0414\u0440\u0430\u043c\u0430\u0442\u0443\u0440\u0433<\/td><td>Browser automation<\/td><td>\ud83d\udfe2 Yes<\/td><td>\ud83d\udfe1 Slower<\/td><td>JS-heavy sites, form interaction<\/td><\/tr><tr><td>\u0421\u0435\u043b\u0435\u043d<\/td><td>Browser automation (legacy)<\/td><td>\ud83d\udfe2 Yes<\/td><td>\ud83d\udd34 Slowest<\/td><td>Legacy projects, existing test suites<\/td><\/tr><tr><td>\u0421\u043a\u0440\u0430\u043f\u0438<\/td><td>Full crawling framework<\/td><td>\ud83e\udde9 Plugin<\/td><td>\ud83d\udfe2 Very fast<\/td><td>1,000+ pages, production pipelines<\/td><\/tr><tr><td>curl_cffi<\/td><td>TLS-fingerprint-safe HTTP<\/td><td>\ud83d\udd34 No<\/td><td>\ud83d\udfe2 Fast<\/td><td>Cloudflare-protected sites<\/td><\/tr><tr><td>httpx<\/td><td>Async HTTP client<\/td><td>\ud83d\udd34 No<\/td><td>\ud83d\udfe2 Fast<\/td><td>Async scraping, HTTP\/2 support<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Library decision Tree<\/h3>\n\n\n\n<p>Is the data in View Source (raw HTML)?<\/p>\n\n\n\n<p>\u251c\u2500\u2500 YES<\/p>\n\n\n\n<p>\u2502\u00a0\u00a0 \u251c\u2500\u2500 Small project (1\u2013100 pages)?\u00a0 \u2192\u00a0 requests + BeautifulSoup<\/p>\n\n\n\n<p>\u2502\u00a0\u00a0 \u251c\u2500\u2500 Need maximum speed \/ XPath?\u00a0\u00a0 \u2192\u00a0 requests + lxml<\/p>\n\n\n\n<p>\u2502\u00a0\u00a0 \u2514\u2500\u2500 Large crawl (1,000+ pages)?\u00a0\u00a0 \u2192\u00a0 Scrapy<\/p>\n\n\n\n<p>\u2514\u2500\u2500 NO (JavaScript-rendered)<\/p>\n\n\n\n<p>\u00a0\u00a0\u00a0 \u251c\u2500\u2500 Is there a JSON API in DevTools \u2192 Network \u2192 XHR?<\/p>\n\n\n\n<p>\u00a0\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 YES\u00a0 \u2192\u00a0 requests (call the API directly \u2014 fastest!)<\/p>\n\n\n\n<p>\u00a0\u00a0\u00a0 \u2514\u2500\u2500 NO real API<\/p>\n\n\n\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u251c\u2500\u2500 Getting blocked by Cloudflare?\u00a0 \u2192\u00a0 curl_cffi or Playwright + stealth<\/p>\n\n\n\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2514\u2500\u2500 Standard JS rendering?\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2192\u00a0 Playwright (preferred over Selenium)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">First Python Web Scraper<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Setup & Installation<\/h3>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"%23%20Create%20a%20virtual%20environment%20%28keeps%20things%20clean%29%0Apython%20-m%20venv%20scraping-env%0Asource%20scraping-env%2Fbin%2Factivate%20%20%23%20Windows%3A%20scraping-env%5CScripts%5Cactivate%0A%0A%23%20Install%20core%20libraries%0Apip%20install%20requests%20beautifulsoup4%20lxml\"><\/code><\/pre><\/figure>\n\n\n<h3 class=\"wp-block-heading\">Inspect before you code<\/h3>\n\n\n\n<p>This step saves hours of frustration. Before writing any Python, open your browser\u2019s DevTools (F12), click the <strong>Elements<\/strong> tab, and hover over the data you want to extract. Note the HTML tag, class name, and any parent structure. The selector you\u2019ll use in Python maps directly to what you see here.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Complete working scraper<\/h3>\n\n\n\n<p>We\u2019ll scrape <a href=\"https:\/\/books.toscrape.com\/\">books.toscrape.com<\/a>, a sandboxed site made for practicing scraping, so it\u2019s completely legal and won\u2019t block you.<\/p>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"import%20requests%0Afrom%20bs4%20import%20BeautifulSoup%0Aimport%20csv%0A%0A%23%20Always%20set%20a%20User-Agent%20%E2%80%94%20bare%20requests%20is%20an%20instant%20flag%0AHEADERS%20%3D%20%7B%0A%20%20%20%20%22User-Agent%22%3A%20%28%0A%20%20%20%20%20%20%20%20%22Mozilla%2F5.0%20%28Windows%20NT%2010.0%3B%20Win64%3B%20x64%29%20%22%0A%20%20%20%20%20%20%20%20%22AppleWebKit%2F537.36%20%28KHTML%2C%20like%20Gecko%29%20%22%0A%20%20%20%20%20%20%20%20%22Chrome%2F124.0.0.0%20Safari%2F537.36%22%0A%20%20%20%20%29%2C%0A%20%20%20%20%22Accept-Language%22%3A%20%22en-US%2Cen%3Bq%3D0.9%22%2C%0A%7D%0A%0Adef%20scrape_books%28url%29%3A%0A%20%20%20%20%22%22%22Fetch%20a%20page%20and%20extract%20all%20book%20listings.%22%22%22%0A%20%20%20%20response%20%3D%20requests.get%28url%2C%20headers%3DHEADERS%2C%20timeout%3D15%29%0A%20%20%20%20response.raise_for_status%28%29%20%20%23%20Raises%20exception%20for%204xx%2F5xx%0A%0A%20%20%20%20soup%20%3D%20BeautifulSoup%28response.text%2C%20%22lxml%22%29%0A%20%20%20%20books%20%3D%20%5B%5D%0A%0A%20%20%20%20for%20article%20in%20soup.select%28%22article.product_pod%22%29%3A%0A%20%20%20%20%20%20%20%20title%20%20%3D%20article.select_one%28%22h3%20a%22%29%5B%22title%22%5D%0A%20%20%20%20%20%20%20%20price%20%20%3D%20article.select_one%28%22p.price_color%22%29.text.strip%28%29%0A%20%20%20%20%20%20%20%20rating%20%3D%20article.select_one%28%22p.star-rating%22%29%5B%22class%22%5D%5B1%5D%0A%20%20%20%20%20%20%20%20in_stock%20%3D%20%22In%20stock%22%20in%20article.select_one%28%22p.availability%22%29.text%0A%0A%20%20%20%20%20%20%20%20books.append%28%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22title%22%3A%20%20%20%20title%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22price%22%3A%20%20%20%20price%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22rating%22%3A%20%20%20rating%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22in_stock%22%3A%20in_stock%2C%0A%20%20%20%20%20%20%20%20%7D%29%0A%0A%20%20%20%20return%20books%0A%0A%23%20Scrape%20and%20save%20to%20CSV%0Adata%20%3D%20scrape_books%28%22https%3A%2F%2Fbooks.toscrape.com%2F%22%29%0A%0Awith%20open%28%22books.csv%22%2C%20%22w%22%2C%20newline%3D%22%22%2C%20encoding%3D%22utf-8%22%29%20as%20f%3A%0A%20%20%20%20writer%20%3D%20csv.DictWriter%28f%2C%20fieldnames%3Ddata%5B0%5D.keys%28%29%29%0A%20%20%20%20writer.writeheader%28%29%0A%20%20%20%20writer.writerows%28data%29%0A%0Aprint%28f%22Scraped%20%7Blen%28data%29%7D%20books%20%E2%86%92%20books.csv%22%29\"><\/code><\/pre><\/figure>\n\n\n<p><strong>\ud83d\ude80<\/strong><strong> <\/strong><strong>\u0421\u043e\u0432\u0435\u0442: <\/strong>\u0418\u0441\u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u043d\u0438\u0435 <em>lxml<\/em> as the BeautifulSoup parser (<em>BeautifulSoup(html, \u201clxml\u201d)<\/em>) instead of <em>html.parser<\/em>. It\u2019s significantly faster for large pages and handles malformed HTML more gracefully.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">CSS selectors & XPath: finding your data<\/h2>\n\n\n\n<p>Choosing the right selector is the difference between a scraper that works reliably for months and one that breaks every time the site updates its CSS. Here\u2019s the practical guide.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">CSS Selectors (recommended for most use cases)<\/h3>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"%23%20By%20tag%0Asoup.select%28%22h1%22%29%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20all%20%3Ch1%3E%20tags%0A%0A%23%20By%20class%0Asoup.select%28%22.product-price%22%29%20%20%20%20%20%20%20%20%23%20class%3D%22product-price%22%0A%0A%23%20By%20ID%0Asoup.select_one%28%22%23main-content%22%29%20%20%20%20%23%20id%3D%22main-content%22%0A%0A%23%20Combined%3A%20tag%20%2B%20class%0Asoup.select%28%22span.price%22%29%0A%0A%23%20Nested%3A%20div%20containing%20a%20span%0Asoup.select%28%22div.product%20span.price%22%29%0A%0A%23%20Attribute%20selector%0Asoup.select%28%27a%5Bhref%5E%3D%22%2Fproducts%22%5D%27%29%20%20%23%20href%20starts%20with%20%2Fproducts%0A%0A%23%20First%20child%0Asoup.select%28%22ul.items%20li%3Afirst-child%22%29%0A%0A%23%20Get%20text%20vs%20attribute%0Ael%20%3D%20soup.select_one%28%22h1.title%22%29%0Ael.text.strip%28%29%20%20%20%20%20%20%20%23%20inner%20text%0Ael%5B%22data-id%22%5D%20%20%20%20%20%20%20%20%23%20attribute%20value%0Ael.get%28%22data-id%22%29%20%20%20%20%23%20safe%20get%20%28returns%20None%20if%20missing%29\"><\/code><\/pre><\/figure>\n\n\n<h3 class=\"wp-block-heading\">XPath (best for complex traversals)<\/h3>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"from%20lxml%20import%20html%0A%0Atree%20%3D%20html.fromstring%28response.content%29%0A%0A%23%20XPath%20examples%0Atree.xpath%28%22%2F%2Fdiv%5B%40class%3D%27product%27%5D%2F%2Fspan%5B%40class%3D%27price%27%5D%2Ftext%28%29%22%29%0Atree.xpath%28%22%2F%2Fa%5Bcontains%28%40href%2C%20%27%2Fproduct%2F%27%29%5D%2F%40href%22%29%0Atree.xpath%28%22%2F%2Ftable%2F%2Ftr%5Bposition%28%29%3E1%5D%22%29%20%20%20%23%20skip%20header%20row%0Atree.xpath%28%22%2F%2Fdiv%5Bnot%28contains%28%40class%2C%27ad%27%29%29%5D%22%29\"><\/code><\/pre><\/figure>\n\n\n<p><strong>\ud83d\ude80<\/strong><strong> <\/strong><strong>\u0421\u043e\u0432\u0435\u0442:<\/strong>\u00a0In Chrome DevTools, right-click any element \u2192 Copy \u2192 Copy selector (or Copy XPath). This gives you a starting point, though auto-generated selectors are often brittle. Simplify them by targeting stable attributes like\u00a0<em>data-*<\/em>\u00a0attributes, IDs, or semantic class names rather than positional selectors.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Scraping JavaScript-rendered pages with Playwright<\/h2>\n\n\n\n<p>A significant portion of modern websites \u2014 e-commerce, SaaS, social platforms \u2014 render their content via JavaScript after the initial HTML loads. If you can\u2019t find your data in View Source, you need a tool that runs a real browser.<\/p>\n\n\n\n<p><strong>Playwright is the modern choice<\/strong> over Selenium in 2026: it\u2019s faster, has a cleaner API, supports async natively, and has better built-in waiting mechanisms. Selenium is still viable for legacy projects, but for new work, start with Playwright.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Setup<\/h3>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"pip%20install%20playwright%0Aplaywright%20install%20chromium%20%20%20%23%20installs%20the%20browser%20binary\"><\/code><\/pre><\/figure>\n\n\n<h3 class=\"wp-block-heading\">Basic Playwright scraper<\/h3>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"from%20playwright.sync_api%20import%20sync_playwright%0A%0Awith%20sync_playwright%28%29%20as%20p%3A%0A%20%20%20%20browser%20%3D%20p.chromium.launch%28headless%3DTrue%29%0A%20%20%20%20page%20%3D%20browser.new_page%28%29%0A%0A%20%20%20%20%23%20Set%20realistic%20headers%0A%20%20%20%20page.set_extra_http_headers%28%7B%0A%20%20%20%20%20%20%20%20%22User-Agent%22%3A%20%22Mozilla%2F5.0%20%28Windows%20NT%2010.0%3B%20Win64%3B%20x64%29%20Chrome%2F124.0.0.0%22%2C%0A%20%20%20%20%20%20%20%20%22Accept-Language%22%3A%20%22en-US%2Cen%3Bq%3D0.9%22%2C%0A%20%20%20%20%7D%29%0A%0A%20%20%20%20page.goto%28%22https%3A%2F%2Fexample.com%2Fproducts%22%2C%20wait_until%3D%22networkidle%22%29%0A%0A%20%20%20%20%23%20Wait%20for%20the%20data%20to%20render%0A%20%20%20%20page.wait_for_selector%28%22.product-list%22%2C%20timeout%3D10000%29%0A%0A%20%20%20%20%23%20Extract%20data%20from%20the%20rendered%20DOM%0A%20%20%20%20products%20%3D%20page.query_selector_all%28%22.product-item%22%29%0A%20%20%20%20for%20product%20in%20products%3A%0A%20%20%20%20%20%20%20%20name%20%20%3D%20product.query_selector%28%22.name%22%29.inner_text%28%29%0A%20%20%20%20%20%20%20%20price%20%3D%20product.query_selector%28%22.price%22%29.inner_text%28%29%0A%20%20%20%20%20%20%20%20print%28name%2C%20price%29%0A%0A%20%20%20%20browser.close%28%29\"><\/code><\/pre><\/figure>\n\n<div\n\t\t\t\n\t\t\tclass=\"so-widget-rhinocore-addons-rhino-alert-banner so-widget-rhinocore-addons-rhino-alert-banner-default-d75171398898\"\n\t\t\t\n\t\t><div class=\"rhino-widget rhino-widget--rhinocore-addons-rhino-alert-banner section-alert\"    style=\"--alert-background-color: #E6E6FF\"\n>\n            <div class=\"section-alert__icon\">\n            <img decoding=\"async\" src=\"https:\/\/nodemaven.com\/wp-content\/uploads\/2026\/02\/icon-4.svg\" alt=\"\" loading=\"lazy\" width=\"64\" height=\"64\">        <\/div>\n    \n            <div class=\"section-alert__main\">\n            \n                            <div class=\"section-alert__description\"><p><strong>Running Playwright? Route it through NodeMaven proxies \u2014 two lines of config, no blocks. From $3.50<\/strong><\/p>\n<\/div>\n                    <\/div>\n    \n            <a\n            class=\"section-alert__button b-btn b-btn--static-xl b-btn--secondary-black\"\n            href=\"https:\/\/dashboard.nodemaven.com\/accounts\/signup\/?next=\/checkout\/pag\/trial&_gl=1*lri4ul*_gcl_aw*R0NMLjE3NzkyODYzNDMuQ2p3S0NBand0N1hRQmhCa0Vpd0F0U3RwcDBSV2xNVVBsMXk5M2xzV2JJUnVkT0dPRjdDc1M4enh5X2JGb0tabEZJMGtBSXFZMHFlTVdCb0MwMzBRQXZEX0J3RQ..*_gcl_au*MTk3NzAwNDQ4My4xNzcyNDc5NzU3*_ga*MTAxNzQyMTIwOC4xNzYyODcwMDE5*_ga_33JL89XFQ5*czE3NzkzNTk0MzMkbzE4MCRnMSR0MTc3OTM2MDAxNCRqNDYkbDAkaDI1MTU5Mjk0NA..\"\n             target=\"_blank\" rel=\"noopener noreferrer\">\n            \u041f\u043e\u043f\u0440\u043e\u0431\u043e\u0432\u0430\u0442\u044c        <\/a>\n    <\/div>\n<\/div>\n\n\n<h3 class=\"wp-block-heading\">Async Playwright (for scraping multiple pages concurrently)<\/h3>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"import%20asyncio%0Afrom%20playwright.async_api%20import%20async_playwright%0A%0Aasync%20def%20scrape_page%28url%3A%20str%29%20-%3E%20str%3A%0A%20%20%20%20async%20with%20async_playwright%28%29%20as%20p%3A%0A%20%20%20%20%20%20%20%20browser%20%3D%20await%20p.chromium.launch%28headless%3DTrue%29%0A%20%20%20%20%20%20%20%20page%20%3D%20await%20browser.new_page%28%29%0A%20%20%20%20%20%20%20%20await%20page.goto%28url%29%0A%20%20%20%20%20%20%20%20content%20%3D%20await%20page.content%28%29%0A%20%20%20%20%20%20%20%20await%20browser.close%28%29%0A%20%20%20%20%20%20%20%20return%20content%0A%0Aurls%20%3D%20%5B%22https%3A%2F%2Fexample.com%2Fpage%2F1%22%2C%20%22https%3A%2F%2Fexample.com%2Fpage%2F2%22%5D%0Aresults%20%3D%20asyncio.run%28asyncio.gather%28%2A%5Bscrape_page%28u%29%20for%20u%20in%20urls%5D%29%29\"><\/code><\/pre><\/figure>\n\n\n<p><strong>\ud83d\ude80<\/strong><strong> <\/strong><strong>Tip: Check the Network tab first. <\/strong>Before switching to Playwright, open DevTools \u2192 Network \u2192 Fetch\/XHR and reload the page. Many sites that look JS-rendered actually expose a clean JSON API endpoint. Calling that directly with requests is 10\u201350x faster than spinning up a browser and far more stable.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">\u041e\u0431\u0440\u0430\u0431\u043e\u0442\u043a\u0430 \u043f\u0430\u0433\u0438\u043d\u0430\u0446\u0438\u0438<\/h2>\n\n\n\n<p>Real scraping targets almost never fit on a single page. Here are the two common patterns and how to handle both.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pattern 1: URL-Based pagination<\/h3>\n\n\n\n<p>Many sites use predictable URL patterns: <em>\/page\/2<\/em>, <em>?page=3<\/em>, <em>&start=40<\/em>. These are the easiest to handle.<\/p>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"import%20requests%2C%20time%2C%20random%0Afrom%20bs4%20import%20BeautifulSoup%0A%0ABASE%20%3D%20%22https%3A%2F%2Fbooks.toscrape.com%2Fcatalogue%2Fpage-%7B%7D.html%22%0Aall_books%20%3D%20%5B%5D%0A%0Afor%20page_num%20in%20range%281%2C%2051%29%3A%0A%20%20%20%20response%20%3D%20requests.get%28BASE.format%28page_num%29%2C%20timeout%3D10%29%0A%0A%20%20%20%20if%20response.status_code%20%3D%3D%20404%3A%0A%20%20%20%20%20%20%20%20break%20%20%23%20No%20more%20pages%0A%0A%20%20%20%20soup%20%3D%20BeautifulSoup%28response.text%2C%20%22lxml%22%29%0A%20%20%20%20titles%20%3D%20%5Ba%5B%22title%22%5D%20for%20a%20in%20soup.select%28%22article.product_pod%20h3%20a%22%29%5D%0A%20%20%20%20all_books.extend%28titles%29%0A%0A%20%20%20%20time.sleep%28random.uniform%280.8%2C%202.0%29%29%20%20%23%20random%20delay%20%E2%80%94%20be%20polite%0A%0Aprint%28f%22Total%3A%20%7Blen%28all_books%29%7D%20books%22%29\"><\/code><\/pre><\/figure>\n\n\n<h3 class=\"wp-block-heading\">Pattern 2: \u201cNext\u201d Button Crawling<\/h3>\n\n\n\n<p>When URLs aren\u2019t predictable, follow the next-page link directly from the HTML.<\/p>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"from%20urllib.parse%20import%20urljoin%0A%0ABASE_URL%20%3D%20%22https%3A%2F%2Fexample.com%2Flistings%22%0Aurl%20%3D%20BASE_URL%0Aall_items%20%3D%20%5B%5D%0A%0Awhile%20url%3A%0A%20%20%20%20soup%20%3D%20BeautifulSoup%28requests.get%28url%2C%20timeout%3D10%29.text%2C%20%22lxml%22%29%0A%0A%20%20%20%20for%20item%20in%20soup.select%28%22.listing-item%22%29%3A%0A%20%20%20%20%20%20%20%20all_items.append%28item.text.strip%28%29%29%0A%0A%20%20%20%20nxt%20%3D%20soup.select_one%28%22a%5Brel%3D%27next%27%5D%2C%20a.next-page%2C%20li.next%20a%22%29%0A%20%20%20%20url%20%3D%20urljoin%28BASE_URL%2C%20nxt%5B%22href%22%5D%29%20if%20nxt%20else%20None%0A%20%20%20%20time.sleep%281%29%0A%0Aprint%28f%22Scraped%20%7Blen%28all_items%29%7D%20items%20across%20all%20pages%22%29\"><\/code><\/pre><\/figure>\n\n\n<h2 class=\"wp-block-heading\">Storing scraped data<\/h2>\n\n\n\n<p>The right storage format depends entirely on what you\u2019re doing with the data downstream. Here\u2019s the decision guide and implementation for each option.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><td><strong>Format<\/strong><\/td><td><strong>\u041b\u0443\u0447\u0448\u0435\u0435 \u0434\u043b\u044f<\/strong><\/td><td><strong>Max scale<\/strong><\/td><td><strong>Queryable?<\/strong><\/td><\/tr><\/thead><tbody><tr><td>CSV<\/td><td>One-off exports, Excel\/pandas consumption<\/td><td>~100K rows<\/td><td>\u00a0\u041d\u0435\u0442<\/td><\/tr><tr><td>JSON<\/td><td>APIs, nested\/irregular data structures<\/td><td>~100K rows<\/td><td>\u00a0\u041d\u0435\u0442<\/td><\/tr><tr><td>SQLite<\/td><td>Deduplication, local querying, medium scale<\/td><td>~10M rows<\/td><td>\u00a0\u0414\u0430<\/td><\/tr><tr><td>PostgreSQL<\/td><td>Production pipelines, multi-user, large scale<\/td><td>Unlimited<\/td><td>\u00a0\u0414\u0430<\/td><\/tr><tr><td>pandas DataFrame<\/td><td>Immediate data analysis\/visualization<\/td><td>RAM limit<\/td><td>\u00a0\u0414\u0430<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"import%20csv%2C%20json%2C%20sqlite3%0A%0Adata%20%3D%20%5B%0A%20%20%20%20%7B%22title%22%3A%20%22Book%20A%22%2C%20%22price%22%3A%20%22%C2%A312.99%22%2C%20%22rating%22%3A%20%22Four%22%7D%2C%0A%20%20%20%20%7B%22title%22%3A%20%22Book%20B%22%2C%20%22price%22%3A%20%22%C2%A39.99%22%2C%20%20%22rating%22%3A%20%22Five%22%7D%2C%0A%5D%0A%0A%23%20%E2%94%80%E2%94%80%20CSV%20%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%0Awith%20open%28%22output.csv%22%2C%20%22w%22%2C%20newline%3D%22%22%2C%20encoding%3D%22utf-8%22%29%20as%20f%3A%0A%20%20%20%20w%20%3D%20csv.DictWriter%28f%2C%20fieldnames%3Ddata%5B0%5D.keys%28%29%29%0A%20%20%20%20w.writeheader%28%29%3B%20w.writerows%28data%29%0A%0A%23%20%E2%94%80%E2%94%80%20JSON%20%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%0Awith%20open%28%22output.json%22%2C%20%22w%22%2C%20encoding%3D%22utf-8%22%29%20as%20f%3A%0A%20%20%20%20json.dump%28data%2C%20f%2C%20indent%3D2%2C%20ensure_ascii%3DFalse%29%0A%0A%23%20%E2%94%80%E2%94%80%20SQLite%20%28with%20deduplication%29%20%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%0Aconn%20%3D%20sqlite3.connect%28%22books.db%22%29%0Aconn.execute%28%22%22%22%0A%20%20%20%20CREATE%20TABLE%20IF%20NOT%20EXISTS%20books%20%28%0A%20%20%20%20%20%20%20%20id%20%20%20%20%20INTEGER%20PRIMARY%20KEY%20AUTOINCREMENT%2C%0A%20%20%20%20%20%20%20%20title%20%20TEXT%20UNIQUE%2C%0A%20%20%20%20%20%20%20%20price%20%20TEXT%2C%0A%20%20%20%20%20%20%20%20rating%20TEXT%0A%20%20%20%20%29%0A%22%22%22%29%0Afor%20row%20in%20data%3A%0A%20%20%20%20conn.execute%28%0A%20%20%20%20%20%20%20%20%22INSERT%20OR%20IGNORE%20INTO%20books%20%28title%2C%20price%2C%20rating%29%20VALUES%20%28%3F%2C%3F%2C%3F%29%22%2C%0A%20%20%20%20%20%20%20%20%28row%5B%22title%22%5D%2C%20row%5B%22price%22%5D%2C%20row%5B%22rating%22%5D%29%0A%20%20%20%20%29%0Aconn.commit%28%29%3B%20conn.close%28%29\"><\/code><\/pre><\/figure>\n\n\n<h2 class=\"wp-block-heading\">Why scrapers get blocked and how to fix it<\/h2>\n\n\n\n<p>This is the section that most Python web scraping tutorials skip entirely, and the reason most scrapers fail in production. Anti-bot systems work in layers, and understanding each one is the first step to bypassing it.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The Detection Stack (ordered by when they fire)<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><td><\/td><td><strong>Layer<\/strong><\/td><td><strong>What it checks<\/strong><\/td><td><strong>Fix<\/strong><\/td><\/tr><\/thead><tbody><tr><td>1<\/td><td><strong>TLS Fingerprinting<\/strong><\/td><td>JA3\/JA4 hash of your TLS ClientHello \u2014 fires before headers are read<\/td><td>curl_cffi to impersonate a real browser TLS stack<\/td><\/tr><tr><td>2<\/td><td><strong>HTTP Headers<\/strong><\/td><td>Bare requests headers look nothing like a real browser<\/td><td>Set full, realistic header set including Sec-Fetch-*<\/td><\/tr><tr><td>3<\/td><td><strong>\u0420\u0435\u043f\u0443\u0442\u0430\u0446\u0438\u044f IP-\u0430\u0434\u0440\u0435\u0441\u0430<\/strong><\/td><td>Datacenter IPs are flagged; too many requests from one IP = block<\/td><td>Rotate residential proxies per request<\/td><\/tr><tr><td>4<\/td><td><strong>Request Timing<\/strong><\/td><td>Machine-perfect timing is a bot signal<\/td><td>Random delays (1\u20134s), jitter on intervals<\/td><\/tr><tr><td>5<\/td><td><strong>Browser Fingerprint<\/strong><\/td><td>Headless browser leaks: navigator.webdriver, missing plugins, canvas hash<\/td><td>Playwright with playwright-stealth<\/td><\/tr><tr><td>6<\/td><td><strong>Behavioral Analysis<\/strong><\/td><td>No mouse movement, scroll, or interaction patterns<\/td><td>Playwright with randomized mouse\/scroll simulation<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Layer 1: TLS fingerprint bypass with curl_cffi<\/h4>\n\n\n\n<p>This is the most commonly missed fix in 2026. Cloudflare, Akamai, and DataDome inspect the TLS <em>ClientHello<\/em> message before your HTTP headers even arrive. Python\u2019s standard <em>\u0437\u0430\u043f\u0440\u043e\u0441\u044b<\/em> library creates a fingerprint that\u2019s trivially identified as non-browser. The fix is <em>curl_cffi<\/em>:<\/p>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"pip%20install%20curl-cffi%0A%0Afrom%20curl_cffi%20import%20requests%20as%20cffi_requests%0A%0A%23%20Impersonate%20a%20real%20Chrome%20browser%27s%20TLS%20stack%0Aresponse%20%3D%20cffi_requests.get%28%0A%20%20%20%20%22https%3A%2F%2Fcloudflare-protected-site.com%22%2C%0A%20%20%20%20impersonate%3D%22chrome124%22%20%20%23%20or%20chrome120%2C%20safari17%2C%20firefox120%0A%29%0Aprint%28response.text%29\"><\/code><\/pre><\/figure>\n\n\n<h4 class=\"wp-block-heading\">Layer 2: setting realistic HTTP headers<\/h4>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"HEADERS%20%3D%20%7B%0A%20%20%20%20%22User-Agent%22%3A%20%28%0A%20%20%20%20%20%20%20%20%22Mozilla%2F5.0%20%28Windows%20NT%2010.0%3B%20Win64%3B%20x64%29%20%22%0A%20%20%20%20%20%20%20%20%22AppleWebKit%2F537.36%20%28KHTML%2C%20like%20Gecko%29%20%22%0A%20%20%20%20%20%20%20%20%22Chrome%2F124.0.0.0%20Safari%2F537.36%22%0A%20%20%20%20%29%2C%0A%20%20%20%20%22Accept%22%3A%20%20%20%20%20%20%20%20%20%20%20%20%20%22text%2Fhtml%2Capplication%2Fxhtml%2Bxml%2Capplication%2Fxml%3Bq%3D0.9%2C%2A%2F%2A%3Bq%3D0.8%22%2C%0A%20%20%20%20%22Accept-Language%22%3A%20%20%20%20%22en-US%2Cen%3Bq%3D0.5%22%2C%0A%20%20%20%20%22Accept-Encoding%22%3A%20%20%20%20%22gzip%2C%20deflate%2C%20br%22%2C%0A%20%20%20%20%22Connection%22%3A%20%20%20%20%20%20%20%20%20%22keep-alive%22%2C%0A%20%20%20%20%22Upgrade-Insecure-Requests%22%3A%20%221%22%2C%0A%20%20%20%20%22Sec-Fetch-Dest%22%3A%20%20%20%20%20%22document%22%2C%0A%20%20%20%20%22Sec-Fetch-Mode%22%3A%20%20%20%20%20%22navigate%22%2C%0A%20%20%20%20%22Sec-Fetch-Site%22%3A%20%20%20%20%20%22none%22%2C%0A%20%20%20%20%22Sec-Fetch-User%22%3A%20%20%20%20%20%22%3F1%22%2C%0A%7D%0A%23%20Keep%20User-Agent%20current%20%E2%80%94%20browsers%20from%202022%20are%20a%20detection%20signal%20in%202026\"><\/code><\/pre><\/figure>\n\n\n<h4 class=\"wp-block-heading\">Layer 5\u20136: stealth Playwright<\/h4>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"pip%20install%20playwright-stealth%0A%0Afrom%20playwright.sync_api%20import%20sync_playwright%0Afrom%20playwright_stealth%20import%20stealth_sync%0A%0Awith%20sync_playwright%28%29%20as%20p%3A%0A%20%20%20%20browser%20%3D%20p.chromium.launch%28headless%3DTrue%29%0A%20%20%20%20page%20%3D%20browser.new_page%28%29%0A%20%20%20%20stealth_sync%28page%29%20%20%23%20patches%20navigator.webdriver%20and%2050%2B%20fingerprint%20signals%0A%0A%20%20%20%20%23%20Simulate%20human-like%20behaviour%0A%20%20%20%20page.goto%28%22https%3A%2F%2Fprotected-site.com%22%29%0A%20%20%20%20page.mouse.move%28400%2C%20300%29%0A%20%20%20%20page.wait_for_timeout%281500%29%0A%20%20%20%20page.evaluate%28%22window.scrollBy%280%2C%20400%29%22%29%0A%20%20%20%20page.wait_for_timeout%28800%29\"><\/code><\/pre><\/figure>\n\n\n<h2 class=\"wp-block-heading\">Using residential proxies in Python<\/h2>\n\n\n\n<p>IP blocking is the single most common reason Python scrapers fail in production. Once a site identifies your IP \u2014 through rate limits, datacenter ASN detection, or fingerprinting, every request from that address gets blocked. The only reliable solution is <a href=\"https:\/\/nodemaven.com\/ru\/proxies\/rotating-residential-proxies\/\">proxy rotation using residential IPs<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Why residential proxies, specifically?<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><td><strong>\u0422\u0438\u043f \u043f\u0440\u043e\u043a\u0441\u0438<\/strong><\/td><td><strong>Detection risk<\/strong><\/td><td><strong>\u0421\u043a\u043e\u0440\u043e\u0441\u0442\u044c<\/strong><\/td><td><strong>\u041b\u0443\u0447\u0448\u0435\u0435 \u0434\u043b\u044f<\/strong><\/td><\/tr><\/thead><tbody><tr><td>\u0426\u0435\u043d\u0442\u0440 \u043e\u0431\u0440\u0430\u0431\u043e\u0442\u043a\u0438 \u0434\u0430\u043d\u043d\u044b\u0445<\/td><td>\ud83d\udd34 High \u2014 ASN easily flagged<\/td><td>\ud83d\udfe2 Fast<\/td><td>Low-protection sites only<\/td><\/tr><tr><td>\u0416\u0438\u043b\u043e\u0439<\/td><td>\ud83d\udfe2 Low \u2014 real ISP IPs<\/td><td>\ud83d\udfe1 Medium<\/td><td>Most e-commerce, news, data sites<\/td><\/tr><tr><td>ISP (Static Residential)<\/td><td>\ud83d\udfe2 Low \u2014 residential trust + speed<\/td><td>\ud83d\udfe2 Fast<\/td><td>Session-based scraping, login flows<\/td><\/tr><tr><td>Mobile (4G\/5G)<\/td><td>\ud83d\udfe2 Very low \u2014 carrier IPs are trusted<\/td><td>\ud83d\udfe1 Varies<\/td><td>Highly protected sites, social platforms<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><a href=\"https:\/\/nodemaven.com\/ru\/proxies\/residential-proxies\/\">\u0420\u0435\u0437\u0438\u0434\u0435\u043d\u0442\u0441\u043a\u0438\u0435 \u043f\u0440\u043e\u043a\u0441\u0438<\/a> route your requests through real household IP addresses assigned by ISPs, the same type of IP that a person browsing from their home uses. To a target website, the traffic looks identical to organic user activity. This is why they\u2019re the standard choice for serious Python web scraping.<\/p>\n\n\n<div\n\t\t\t\n\t\t\tclass=\"so-widget-rhinocore-addons-rhino-alert-banner so-widget-rhinocore-addons-rhino-alert-banner-default-d75171398898\"\n\t\t\t\n\t\t><div class=\"rhino-widget rhino-widget--rhinocore-addons-rhino-alert-banner section-alert\"    style=\"--alert-background-color: #E6E6FF\"\n>\n            <div class=\"section-alert__icon\">\n            <img decoding=\"async\" src=\"https:\/\/nodemaven.com\/wp-content\/uploads\/2026\/02\/icon-4.svg\" alt=\"\" loading=\"lazy\" width=\"64\" height=\"64\">        <\/div>\n    \n            <div class=\"section-alert__main\">\n            \n                            <div class=\"section-alert__description\"><p><strong>NodeMaven\u2019s IP Quality Filter pre-screens every IP \u2014 only clean, low-fraud addresses in the pool<\/strong><\/p>\n<\/div>\n                    <\/div>\n    \n            <a\n            class=\"section-alert__button b-btn b-btn--static-xl b-btn--secondary-black\"\n            href=\"https:\/\/dashboard.nodemaven.com\/accounts\/signup\/?next=\/checkout\/pag\/trial&_gl=1*lri4ul*_gcl_aw*R0NMLjE3NzkyODYzNDMuQ2p3S0NBand0N1hRQmhCa0Vpd0F0U3RwcDBSV2xNVVBsMXk5M2xzV2JJUnVkT0dPRjdDc1M4enh5X2JGb0tabEZJMGtBSXFZMHFlTVdCb0MwMzBRQXZEX0J3RQ..*_gcl_au*MTk3NzAwNDQ4My4xNzcyNDc5NzU3*_ga*MTAxNzQyMTIwOC4xNzYyODcwMDE5*_ga_33JL89XFQ5*czE3NzkzNTk0MzMkbzE4MCRnMSR0MTc3OTM2MDAxNCRqNDYkbDAkaDI1MTU5Mjk0NA..\"\n             target=\"_blank\" rel=\"noopener noreferrer\">\n            \u041f\u043e\u043f\u0440\u043e\u0431\u043e\u0432\u0430\u0442\u044c        <\/a>\n    <\/div>\n<\/div>\n\n\n<h3 class=\"wp-block-heading\">Start scraping safely with NodeMaven proxies<\/h3>\n\n\n\n<p><a href=\"https:\/\/nodemaven.com\/ru\/use-cases\/proxies-for-python\/\">NodeMaven\u2019s proxies for Python<\/a> 30M+ pre-filtered residential IPs deliver >98% success rates scrapers.<\/p>\n\n\n\n<p>Every IP passes a <a href=\"https:\/\/nodemaven.com\/ru\/features\/ip-quality-filter\/\">quality filter<\/a> \u2014 no burned, flagged, or recycled addresses in the pool. Includes rotating and static options, SOCKS5 + HTTPS, and ZIP-level geo-targeting across <a href=\"https:\/\/nodemaven.com\/ru\/locations\/\">190+ locations.<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Basic proxy integration with requests<\/h3>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"import%20requests%0A%0A%23%20NodeMaven%20proxy%20credentials%0APROXY%20%3D%20%22http%3A%2F%2FUSERNAME%3APASSWORD%40proxy.nodemaven.com%3A8080%22%0A%0Aproxies%20%3D%20%7B%22http%22%3A%20PROXY%2C%20%22https%22%3A%20PROXY%7D%0A%0Aresponse%20%3D%20requests.get%28%0A%20%20%20%20%22https%3A%2F%2Fhttpbin.org%2Fip%22%2C%0A%20%20%20%20proxies%3Dproxies%2C%0A%20%20%20%20timeout%3D15%0A%29%0Aprint%28response.json%28%29%29%20%20%23%20Returns%20the%20proxy%20IP%2C%20not%20yours\"><\/code><\/pre><\/figure>\n\n\n<h3 class=\"wp-block-heading\">Rotating proxies per request<\/h3>\n\n\n\n<p>For maximum anti-detection, rotate the proxy on every single request so each one appears to come from a different user:<\/p>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"import%20requests%2C%20random%2C%20time%0A%0A%23%20NodeMaven%20provides%20a%20pool%20of%20rotating%20endpoints%0APROXY_POOL%20%3D%20%5B%0A%20%20%20%20%22http%3A%2F%2FUSER%3APASS%40proxy1.nodemaven.com%3A8080%22%2C%0A%20%20%20%20%22http%3A%2F%2FUSER%3APASS%40proxy2.nodemaven.com%3A8080%22%2C%0A%20%20%20%20%22http%3A%2F%2FUSER%3APASS%40proxy3.nodemaven.com%3A8080%22%2C%0A%5D%0A%0Adef%20get_proxy%28%29%3A%0A%20%20%20%20p%20%3D%20random.choice%28PROXY_POOL%29%0A%20%20%20%20return%20%7B%22http%22%3A%20p%2C%20%22https%22%3A%20p%7D%0A%0Aurls%20%3D%20%5Bf%22https%3A%2F%2Fexample.com%2Fproduct%2F%7Bi%7D%22%20for%20i%20in%20range%281%2C%20101%29%5D%0A%0Afor%20url%20in%20urls%3A%0A%20%20%20%20try%3A%0A%20%20%20%20%20%20%20%20response%20%3D%20requests.get%28url%2C%20proxies%3Dget_proxy%28%29%2C%20timeout%3D15%29%0A%20%20%20%20%20%20%20%20print%28f%22%5B%7Bresponse.status_code%7D%5D%20%7Burl%7D%22%29%0A%20%20%20%20except%20requests.exceptions.ProxyError%20as%20e%3A%0A%20%20%20%20%20%20%20%20print%28f%22Proxy%20error%20on%20%7Burl%7D%3A%20%7Be%7D%22%29%0A%20%20%20%20except%20requests.exceptions.Timeout%3A%0A%20%20%20%20%20%20%20%20print%28f%22Timeout%20on%20%7Burl%7D%22%29%0A%0A%20%20%20%20time.sleep%28random.uniform%280.5%2C%202.5%29%29\"><\/code><\/pre><\/figure>\n\n\n<h3 class=\"wp-block-heading\">Session-based proxies (for login flows)<\/h3>\n\n\n\n<p>When scraping behind a login \u2014 or any workflow that requires the same IP across multiple requests \u2014 use a sticky session proxy:<\/p>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"session%20%3D%20requests.Session%28%29%0Asession.proxies%20%3D%20%7B%0A%20%20%20%20%22http%22%3A%20%20%22http%3A%2F%2FUSER%3APASS%40proxy.nodemaven.com%3A8080%22%2C%0A%20%20%20%20%22https%22%3A%20%22http%3A%2F%2FUSER%3APASS%40proxy.nodemaven.com%3A8080%22%2C%0A%7D%0Asession.headers.update%28HEADERS%29%0A%0A%23%20All%20requests%20in%20this%20session%20share%20the%20same%20IP%0Asession.post%28%22https%3A%2F%2Fexample.com%2Flogin%22%2C%20data%3D%7B%22user%22%3A%20%22me%22%2C%20%22pass%22%3A%20%22secret%22%7D%29%0Adashboard%20%3D%20session.get%28%22https%3A%2F%2Fexample.com%2Fdashboard%22%29%0Adata_page%20%20%3D%20session.get%28%22https%3A%2F%2Fexample.com%2Fdata%22%29\"><\/code><\/pre><\/figure>\n\n\n<h3 class=\"wp-block-heading\">Geo-Targeted Proxies for Localized Data<\/h3>\n\n\n\n<p>One of the most powerful use cases for <a href=\"https:\/\/nodemaven.com\/ru\/proxies\/residential-proxies\/\">\u0440\u0435\u0437\u0438\u0434\u0435\u043d\u0442\u0441\u043a\u0438\u0435 \u043f\u0440\u043e\u043a\u0441\u0438<\/a> in Python scraping is accessing region-specific content: localized pricing, search results, product availability, or geo-blocked pages. NodeMaven supports ZIP-level targeting, the most granular geo-targeting available:<\/p>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"%23%20Country-level%20targeting%0Aproxy_us%20%3D%20%22http%3A%2F%2FUSER%3APASS%40proxy.nodemaven.com%3A8080%3Fcountry%3DUS%22%0Aproxy_de%20%3D%20%22http%3A%2F%2FUSER%3APASS%40proxy.nodemaven.com%3A8080%3Fcountry%3DDE%22%0A%0A%23%20City-level%20targeting%0Aproxy_nyc%20%3D%20%22http%3A%2F%2FUSER%3APASS%40proxy.nodemaven.com%3A8080%3Fcountry%3DUS%26city%3DNewYork%22%0A%0A%23%20Compare%20prices%20across%20markets%0Afor%20country%2C%20proxy%20in%20%5B%28%22US%22%2C%20proxy_us%29%2C%20%28%22DE%22%2C%20proxy_de%29%5D%3A%0A%20%20%20%20resp%20%3D%20requests.get%28%0A%20%20%20%20%20%20%20%20%22https%3A%2F%2Fshop.example.com%2Fproduct%2F123%22%2C%0A%20%20%20%20%20%20%20%20proxies%3D%7B%22http%22%3A%20proxy%2C%20%22https%22%3A%20proxy%7D%2C%0A%20%20%20%20%20%20%20%20timeout%3D15%0A%20%20%20%20%29%0A%20%20%20%20soup%20%3D%20BeautifulSoup%28resp.text%2C%20%22lxml%22%29%0A%20%20%20%20price%20%3D%20soup.select_one%28%22.price%22%29.text%0A%20%20%20%20print%28f%22%7Bcountry%7D%3A%20%7Bprice%7D%22%29\"><\/code><\/pre><\/figure>\n\n<div\n\t\t\t\n\t\t\tclass=\"so-widget-rhinocore-addons-rhino-alert-banner so-widget-rhinocore-addons-rhino-alert-banner-default-d75171398898\"\n\t\t\t\n\t\t><div class=\"rhino-widget rhino-widget--rhinocore-addons-rhino-alert-banner section-alert\"    style=\"--alert-background-color: #E6E6FF\"\n>\n            <div class=\"section-alert__icon\">\n            <img decoding=\"async\" src=\"https:\/\/nodemaven.com\/wp-content\/uploads\/2026\/02\/icon-4.svg\" alt=\"\" loading=\"lazy\" width=\"64\" height=\"64\">        <\/div>\n    \n            <div class=\"section-alert__main\">\n            \n                            <div class=\"section-alert__description\"><p><strong>Scrape localized prices & content with ZIP-level targeting across 190+ locations<\/strong><\/p>\n<\/div>\n                    <\/div>\n    \n            <a\n            class=\"section-alert__button b-btn b-btn--static-xl b-btn--secondary-black\"\n            href=\"https:\/\/dashboard.nodemaven.com\/accounts\/signup\/?next=\/checkout\/pag\/trial&_gl=1*lri4ul*_gcl_aw*R0NMLjE3NzkyODYzNDMuQ2p3S0NBand0N1hRQmhCa0Vpd0F0U3RwcDBSV2xNVVBsMXk5M2xzV2JJUnVkT0dPRjdDc1M4enh5X2JGb0tabEZJMGtBSXFZMHFlTVdCb0MwMzBRQXZEX0J3RQ..*_gcl_au*MTk3NzAwNDQ4My4xNzcyNDc5NzU3*_ga*MTAxNzQyMTIwOC4xNzYyODcwMDE5*_ga_33JL89XFQ5*czE3NzkzNTk0MzMkbzE4MCRnMSR0MTc3OTM2MDAxNCRqNDYkbDAkaDI1MTU5Mjk0NA..\"\n             target=\"_blank\" rel=\"noopener noreferrer\">\n            \u041f\u043e\u043f\u0440\u043e\u0431\u043e\u0432\u0430\u0442\u044c        <\/a>\n    <\/div>\n<\/div>\n\n\n<h3 class=\"wp-block-heading\">Proxies with Playwright<\/h3>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"from%20playwright.sync_api%20import%20sync_playwright%0A%0Aproxy_config%20%3D%20%7B%0A%20%20%20%20%22server%22%3A%20%20%20%22http%3A%2F%2Fproxy.nodemaven.com%3A8080%22%2C%0A%20%20%20%20%22username%22%3A%20%22YOUR_USERNAME%22%2C%0A%20%20%20%20%22password%22%3A%20%22YOUR_PASSWORD%22%2C%0A%7D%0A%0Awith%20sync_playwright%28%29%20as%20p%3A%0A%20%20%20%20browser%20%3D%20p.chromium.launch%28proxy%3Dproxy_config%29%0A%20%20%20%20context%20%3D%20browser.new_context%28%29%0A%20%20%20%20page%20%3D%20context.new_page%28%29%0A%20%20%20%20page.goto%28%22https%3A%2F%2Fhttpbin.org%2Fip%22%29%0A%20%20%20%20print%28page.content%28%29%29%0A%20%20%20%20browser.close%28%29\"><\/code><\/pre><\/figure>\n\n\n<h3 class=\"wp-block-heading\">Production Retry Logic<\/h3>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"from%20requests.adapters%20import%20HTTPAdapter%0Afrom%20urllib3.util.retry%20import%20Retry%0A%0Adef%20make_session%28proxy%3A%20str%29%20-%3E%20requests.Session%3A%0A%20%20%20%20session%20%3D%20requests.Session%28%29%0A%20%20%20%20session.proxies%20%3D%20%7B%22http%22%3A%20proxy%2C%20%22https%22%3A%20proxy%7D%0A%20%20%20%20session.headers.update%28HEADERS%29%0A%0A%20%20%20%20retry%20%3D%20Retry%28%0A%20%20%20%20%20%20%20%20total%3D4%2C%0A%20%20%20%20%20%20%20%20backoff_factor%3D2%2C%20%20%20%20%20%20%20%20%20%20%20%23%20waits%202%2C%204%2C%208%2C%2016%20seconds%0A%20%20%20%20%20%20%20%20status_forcelist%3D%5B429%2C%20500%2C%20502%2C%20503%2C%20504%5D%2C%0A%20%20%20%20%20%20%20%20allowed_methods%3D%5B%22GET%22%2C%20%22POST%22%5D%2C%0A%20%20%20%20%29%0A%20%20%20%20adapter%20%3D%20HTTPAdapter%28max_retries%3Dretry%29%0A%20%20%20%20session.mount%28%22http%3A%2F%2F%22%2C%20adapter%29%0A%20%20%20%20session.mount%28%22https%3A%2F%2F%22%2C%20adapter%29%0A%20%20%20%20return%20session\"><\/code><\/pre><\/figure>\n\n\n<p>NodeMaven\u2019s IP Quality Filter sets it apart from generic proxy providers. Before an IP enters the pool, it\u2019s checked against fraud databases and scored. Only IPs with clean records and <70% fraud scores are served \u2014 meaning you get fewer 403s, fewer CAPTCHAs, and longer scraping sessions without needing to rotate as aggressively. <a href=\"https:\/\/nodemaven.com\/ru\/features\/ip-quality-filter\/\">Learn about the quality filter<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Scaling with Scrapy<\/h2>\n\n\n\n<p>For projects that require scraping thousands or millions of pages, or need to run on a schedule with retry logic, rate limiting, and structured data pipelines, Scrapy is the right choice. It handles concurrency, middleware, item pipelines, and deployment out of the box.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Quick Setup<\/h3>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"pip%20install%20scrapy%0Ascrapy%20startproject%20bookscrawler%0Acd%20bookscrawler%0Ascrapy%20genspider%20books%20books.toscrape.com\"><\/code><\/pre><\/figure>\n\n\n<h3 class=\"wp-block-heading\">Production spider with proxy middleware<\/h3>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"import%20scrapy%0A%0Aclass%20BooksSpider%28scrapy.Spider%29%3A%0A%20%20%20%20name%20%3D%20%22books%22%0A%20%20%20%20start_urls%20%3D%20%5B%22https%3A%2F%2Fbooks.toscrape.com%2F%22%5D%0A%0A%20%20%20%20custom_settings%20%3D%20%7B%0A%20%20%20%20%20%20%20%20%22DOWNLOAD_DELAY%22%3A%20%20%20%20%20%20%20%20%201.5%2C%0A%20%20%20%20%20%20%20%20%22CONCURRENT_REQUESTS%22%3A%20%20%20%208%2C%0A%20%20%20%20%20%20%20%20%22AUTOTHROTTLE_ENABLED%22%3A%20%20%20True%2C%0A%20%20%20%20%20%20%20%20%22AUTOTHROTTLE_MAX_DELAY%22%3A%2010%2C%0A%20%20%20%20%20%20%20%20%22RETRY_TIMES%22%3A%20%20%20%20%20%20%20%20%20%20%20%203%2C%0A%20%20%20%20%20%20%20%20%22RETRY_HTTP_CODES%22%3A%20%20%20%20%20%20%20%5B429%2C%20500%2C%20503%5D%2C%0A%20%20%20%20%20%20%20%20%22FEEDS%22%3A%20%7B%22books.json%22%3A%20%7B%22format%22%3A%20%22json%22%7D%7D%2C%0A%20%20%20%20%7D%0A%0A%20%20%20%20def%20parse%28self%2C%20response%29%3A%0A%20%20%20%20%20%20%20%20for%20book%20in%20response.css%28%22article.product_pod%22%29%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20yield%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22title%22%3A%20%20book.css%28%22h3%20a%3A%3Aattr%28title%29%22%29.get%28%29%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22price%22%3A%20%20book.css%28%22p.price_color%3A%3Atext%22%29.get%28%29%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22rating%22%3A%20book.css%28%22p.star-rating%3A%3Aattr%28class%29%22%29.get%28%29.split%28%29%5B-1%5D%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%7D%0A%0A%20%20%20%20%20%20%20%20nxt%20%3D%20response.css%28%22li.next%20a%3A%3Aattr%28href%29%22%29.get%28%29%0A%20%20%20%20%20%20%20%20if%20nxt%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20yield%20response.follow%28nxt%2C%20self.parse%29\"><\/code><\/pre><\/figure>\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"import%20random%0A%0Aclass%20NodeMavenProxyMiddleware%3A%0A%20%20%20%20proxies%20%3D%20%5B%0A%20%20%20%20%20%20%20%20%22http%3A%2F%2FUSER%3APASS%40proxy1.nodemaven.com%3A8080%22%2C%0A%20%20%20%20%20%20%20%20%22http%3A%2F%2FUSER%3APASS%40proxy2.nodemaven.com%3A8080%22%2C%0A%20%20%20%20%20%20%20%20%22http%3A%2F%2FUSER%3APASS%40proxy3.nodemaven.com%3A8080%22%2C%0A%20%20%20%20%5D%0A%0A%20%20%20%20def%20process_request%28self%2C%20request%2C%20spider%29%3A%0A%20%20%20%20%20%20%20%20request.meta%5B%22proxy%22%5D%20%3D%20random.choice%28self.proxies%29%0A%0A%23%20settings.py%20additions%3A%0A%23%20DOWNLOADER_MIDDLEWARES%20%3D%20%7B%0A%23%20%20%20%22bookscrawler.middlewares.NodeMavenProxyMiddleware%22%3A%20100%2C%0A%23%20%20%20%22scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware%22%3A%20110%2C%0A%23%20%7D\"><\/code><\/pre><\/figure>\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"scrapy%20crawl%20books%20-o%20books.csv%0Ascrapy%20crawl%20books%20-o%20books.json%0Ascrapy%20crawl%20books%20-s%20LOG_LEVEL%3DWARNING%20%20%23%20quieter%20output\"><\/code><\/pre><\/figure>\n\n\n<h2 class=\"wp-block-heading\">Debugging & error handling<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><td><strong>Error \/ Symptom<\/strong><\/td><td><strong>Likely cause<\/strong><\/td><td><strong>Fix<\/strong><\/td><\/tr><\/thead><tbody><tr><td>403 Forbidden<\/td><td>Missing headers or IP blocked<\/td><td>Add full headers; switch proxy<\/td><\/tr><tr><td>429 \u0421\u043b\u0438\u0448\u043a\u043e\u043c \u043c\u043d\u043e\u0433\u043e \u0437\u0430\u043f\u0440\u043e\u0441\u043e\u0432<\/td><td>Rate limit hit<\/td><td>Add\/increase delays; rotate proxies<\/td><\/tr><tr><td>AttributeError: \u2018NoneType\u2019<\/td><td>select_one() returned nothing<\/td><td>Print raw HTML; verify selector in DevTools<\/td><\/tr><tr><td>Empty list from select()<\/td><td>JS-rendered content<\/td><td>Switch to Playwright; check XHR for API<\/td><\/tr><tr><td>CAPTCHA page returned<\/td><td>Bot detection triggered<\/td><td>Residential proxies + stealth headers<\/td><\/tr><tr><td>ConnectionError \/ ProxyError<\/td><td>Proxy failure or timeout<\/td><td>Retry logic; test proxy with httpbin.org<\/td><\/tr><tr><td>Data looks wrong or truncated<\/td><td>Wrong selector or encoding<\/td><td>Print soup.prettify(); check response.encoding<\/td><\/tr><tr><td>SSLError<\/td><td>Certificate issue<\/td><td>verify=False (dev only) or update certs<\/td><\/tr><tr><td>Playwright timeout<\/td><td>Selector never appeared (JS failed)<\/td><td>Increase timeout; add networkidle wait<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n<div\n\t\t\t\n\t\t\tclass=\"so-widget-rhinocore-addons-rhino-alert-banner so-widget-rhinocore-addons-rhino-alert-banner-default-d75171398898\"\n\t\t\t\n\t\t><div class=\"rhino-widget rhino-widget--rhinocore-addons-rhino-alert-banner section-alert\"    style=\"--alert-background-color: #E6E6FF\"\n>\n            <div class=\"section-alert__icon\">\n            <img decoding=\"async\" src=\"https:\/\/nodemaven.com\/wp-content\/uploads\/2026\/02\/icon-4.svg\" alt=\"\" loading=\"lazy\" width=\"64\" height=\"64\">        <\/div>\n    \n            <div class=\"section-alert__main\">\n                            <div class=\"section-alert__title\">Stop getting 403 errors. NodeMaven residential IPs look identical to real browser traffic<\/div>\n            \n                            <div class=\"section-alert__description\"><p>Rotating proxies with >98% stable performance \u2014 built for Python web scraping at scale<\/p>\n<\/div>\n                    <\/div>\n    \n            <a\n            class=\"section-alert__button b-btn b-btn--static-xl b-btn--secondary-black\"\n            href=\"https:\/\/dashboard.nodemaven.com\/accounts\/signup\/?next=\/checkout\/pag\/trial&_gl=1*lri4ul*_gcl_aw*R0NMLjE3NzkyODYzNDMuQ2p3S0NBand0N1hRQmhCa0Vpd0F0U3RwcDBSV2xNVVBsMXk5M2xzV2JJUnVkT0dPRjdDc1M4enh5X2JGb0tabEZJMGtBSXFZMHFlTVdCb0MwMzBRQXZEX0J3RQ..*_gcl_au*MTk3NzAwNDQ4My4xNzcyNDc5NzU3*_ga*MTAxNzQyMTIwOC4xNzYyODcwMDE5*_ga_33JL89XFQ5*czE3NzkzNTk0MzMkbzE4MCRnMSR0MTc3OTM2MDAxNCRqNDYkbDAkaDI1MTU5Mjk0NA..\"\n             target=\"_blank\" rel=\"noopener noreferrer\">\n            \u041f\u043e\u043f\u0440\u043e\u0431\u043e\u0432\u0430\u0442\u044c        <\/a>\n    <\/div>\n<\/div>\n\n\n<h3 class=\"wp-block-heading\">The Golden Debug Rule<\/h3>\n\n\n\n<p>When a selector returns nothing, the first thing to do is print what you actually received \u2014 not what you expected:<\/p>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"%23%20Step%201%3A%20What%20did%20we%20actually%20get%3F%0Aprint%28response.status_code%29%0Aprint%28response.url%29%20%20%20%20%20%20%20%20%20%20%20%23%20redirected%3F%0Aprint%28response.text%5B%3A2000%5D%29%20%20%23%20first%202000%20chars%0A%0A%23%20Step%202%3A%20Does%20the%20selector%20return%20anything%3F%0Aresults%20%3D%20soup.select%28%22.my-class%22%29%0Aprint%28f%22Found%20%7Blen%28results%29%7D%20elements%22%29%0A%0A%23%20Step%203%3A%20If%20zero%2C%20check%20what%27s%20actually%20on%20the%20page%0Aprint%28soup.prettify%28%29%5B%3A3000%5D%29%0A%0A%23%20Common%20result%3A%20you%27re%20getting%20a%20CAPTCHA%20page%20or%20%22Access%20Denied%22%0A%23%20%E2%86%92%20fix%3A%20residential%20proxy%20%2B%20proper%20headers\"><\/code><\/pre><\/figure>\n\n\n<h2 class=\"wp-block-heading\">Complete cheat sheet<\/h2>\n\n\n<figure class=\"rhino-code-snippet\" data-lang=\"python\"><button type=\"button\" class=\"rhino-code-snippet__copy\" aria-label=\"Copy code to clipboard\"><svg class=\"rhino-code-snippet__icon-copy\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><svg class=\"rhino-code-snippet__icon-check\" viewbox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.5\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><polyline points=\"20 6 9 17 4 12\"><\/polyline><\/svg><\/button><span class=\"rhino-code-snippet__sr\" aria-live=\"polite\"><\/span><pre class=\"line-numbers\"><code class=\"language-python\" data-rhino-code=\"%23%20%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%0A%23%20PYTHON%20WEB%20SCRAPING%20CHEAT%20SHEET%202026%0A%23%20%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%E2%95%90%0A%0Aimport%20requests%2C%20time%2C%20random%2C%20csv%2C%20json%2C%20sqlite3%0Afrom%20bs4%20import%20BeautifulSoup%0Afrom%20lxml%20import%20html%20as%20lxml_html%0Afrom%20urllib.parse%20import%20urljoin%0Afrom%20requests.adapters%20import%20HTTPAdapter%0Afrom%20urllib3.util.retry%20import%20Retry%0A%0A%23%20%E2%94%80%E2%94%80%20STATIC%20PAGE%20%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%0AHEADERS%20%3D%20%7B%0A%20%20%20%20%22User-Agent%22%3A%20%22Mozilla%2F5.0%20%28Windows%20NT%2010.0%3B%20Win64%3B%20x64%29%20Chrome%2F124.0.0.0%22%2C%0A%20%20%20%20%22Accept%22%3A%20%22text%2Fhtml%2Capplication%2Fxhtml%2Bxml%3Bq%3D0.9%2C%2A%2F%2A%3Bq%3D0.8%22%2C%0A%20%20%20%20%22Accept-Language%22%3A%20%22en-US%2Cen%3Bq%3D0.9%22%2C%0A%7D%0Ar%20%3D%20requests.get%28url%2C%20headers%3DHEADERS%2C%20timeout%3D10%29%0Ar.raise_for_status%28%29%0Asoup%20%3D%20BeautifulSoup%28r.text%2C%20%22lxml%22%29%0A%0Asoup.select_one%28%22div.class%22%29%20%20%20%20%20%20%20%23%20first%20match%0Asoup.select%28%22ul%20li%22%29%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20all%20matches%0Ael.text.strip%28%29%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20inner%20text%0Ael%5B%22href%22%5D%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20attribute%0Ael.get%28%22href%22%2C%20%22%22%29%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20safe%20get%0A%0A%23%20%E2%94%80%E2%94%80%20PROXIES%20%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%0APROXY%20%3D%20%22http%3A%2F%2FUSER%3APASS%40proxy.nodemaven.com%3A8080%22%0Ar%20%3D%20requests.get%28url%2C%20proxies%3D%7B%22http%22%3A%20PROXY%2C%20%22https%22%3A%20PROXY%7D%2C%20timeout%3D15%29%0A%0A%23%20%E2%94%80%E2%94%80%20JS%20PAGES%20%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%0Afrom%20playwright.sync_api%20import%20sync_playwright%0Awith%20sync_playwright%28%29%20as%20p%3A%0A%20%20%20%20browser%20%3D%20p.chromium.launch%28headless%3DTrue%29%0A%20%20%20%20page%20%3D%20browser.new_page%28%29%0A%20%20%20%20page.goto%28url%2C%20wait_until%3D%22networkidle%22%29%0A%20%20%20%20page.wait_for_selector%28%22.target%22%29%0A%20%20%20%20text%20%3D%20page.query_selector%28%22.target%22%29.inner_text%28%29%0A%20%20%20%20browser.close%28%29%0A%0A%23%20%E2%94%80%E2%94%80%20PAGINATION%20%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%0Awhile%20url%3A%0A%20%20%20%20soup%20%3D%20BeautifulSoup%28requests.get%28url%29.text%2C%20%22lxml%22%29%0A%20%20%20%20%23%20...%20extract%20...%0A%20%20%20%20nxt%20%3D%20soup.select_one%28%22a%5Brel%3D%27next%27%5D%22%29%0A%20%20%20%20url%20%3D%20urljoin%28base%2C%20nxt%5B%22href%22%5D%29%20if%20nxt%20else%20None%0A%20%20%20%20time.sleep%28random.uniform%281%2C%203%29%29%0A%0A%23%20%E2%94%80%E2%94%80%20SAVE%20CSV%20%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%0Awith%20open%28%22out.csv%22%2C%20%22w%22%2C%20newline%3D%22%22%29%20as%20f%3A%0A%20%20%20%20w%20%3D%20csv.DictWriter%28f%2C%20fieldnames%3Ddata%5B0%5D.keys%28%29%29%0A%20%20%20%20w.writeheader%28%29%3B%20w.writerows%28data%29%0A%0A%23%20%E2%94%80%E2%94%80%20RETRY%20SESSION%20%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%0As%20%3D%20requests.Session%28%29%0As.mount%28%22https%3A%2F%2F%22%2C%20HTTPAdapter%28max_retries%3DRetry%28%0A%20%20%20%20total%3D3%2C%20backoff_factor%3D2%2C%20status_forcelist%3D%5B429%2C%20500%2C%20503%5D%0A%29%29%29\"><\/code><\/pre><\/figure>\n\n<div\n\t\t\t\n\t\t\tclass=\"so-widget-rhinocore-addons-rhino-alert-banner so-widget-rhinocore-addons-rhino-alert-banner-default-d75171398898\"\n\t\t\t\n\t\t><div class=\"rhino-widget rhino-widget--rhinocore-addons-rhino-alert-banner section-alert\"    style=\"--alert-background-color: #E6E6FF\"\n>\n            <div class=\"section-alert__icon\">\n            <img decoding=\"async\" src=\"https:\/\/nodemaven.com\/wp-content\/uploads\/2026\/02\/icon-4.svg\" alt=\"\" loading=\"lazy\" width=\"64\" height=\"64\">        <\/div>\n    \n            <div class=\"section-alert__main\">\n                            <div class=\"section-alert__title\">Scraping social platforms or heavily protected sites? Use NodeMaven 5G\/LTE mobile proxies<\/div>\n            \n                            <div class=\"section-alert__description\"><p>Carrier-grade IPs with 24h+ sessions and guaranteed quality \u2014 the lowest detection risk available<\/p>\n<\/div>\n                    <\/div>\n    \n            <a\n            class=\"section-alert__button b-btn b-btn--static-xl b-btn--secondary-black\"\n            href=\"https:\/\/dashboard.nodemaven.com\/accounts\/signup\/?next=\/checkout\/pag\/trial&_gl=1*lri4ul*_gcl_aw*R0NMLjE3NzkyODYzNDMuQ2p3S0NBand0N1hRQmhCa0Vpd0F0U3RwcDBSV2xNVVBsMXk5M2xzV2JJUnVkT0dPRjdDc1M4enh5X2JGb0tabEZJMGtBSXFZMHFlTVdCb0MwMzBRQXZEX0J3RQ..*_gcl_au*MTk3NzAwNDQ4My4xNzcyNDc5NzU3*_ga*MTAxNzQyMTIwOC4xNzYyODcwMDE5*_ga_33JL89XFQ5*czE3NzkzNTk0MzMkbzE4MCRnMSR0MTc3OTM2MDAxNCRqNDYkbDAkaDI1MTU5Mjk0NA..\"\n             target=\"_blank\" rel=\"noopener noreferrer\">\n            \u041f\u043e\u043f\u0440\u043e\u0431\u043e\u0432\u0430\u0442\u044c        <\/a>\n    <\/div>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\">\u0427\u0430\u0441\u0442\u043e \u0437\u0430\u0434\u0430\u0432\u0430\u0435\u043c\u044b\u0435 \u0432\u043e\u043f\u0440\u043e\u0441\u044b \u043e \u043d\u0430\u0441\u0442\u0440\u043e\u0439\u043a\u0435 \u043f\u0440\u043e\u043a\u0441\u0438 \u0432 Telegram<\/h2>\n\n\n<div\n\t\t\t\n\t\t\tclass=\"so-widget-rhinocore-addons-faq so-widget-rhinocore-addons-faq-default-d75171398898\"\n\t\t\t\n\t\t>    <div class=\"rhino-widget rhino-widget--rhinocore-addons-faq section-faq\">\n        <div class=\"section-faq__list section-faq__list--columns-1\" role=\"list\" aria-label=\"\u0427\u0430\u0441\u0442\u043e \u0437\u0430\u0434\u0430\u0432\u0430\u0435\u043c\u044b\u0435 \u0432\u043e\u043f\u0440\u043e\u0441\u044b \u043e \u043d\u0430\u0441\u0442\u0440\u043e\u0439\u043a\u0435 \u043f\u0440\u043e\u043a\u0441\u0438 \u0432 Telegram\">\n                            <div class=\"section-faq__column\">\n                                            <div class=\"section-faq__item\" data-accordion=\"wrapper\" data-accordion-group=\"faq\" role=\"listitem\">\n                            <button class=\"section-faq__trigger\" data-accordion=\"trigger\" type=\"button\" aria-expanded=\"false\">\n                                <span class=\"section-faq__question\">What is the best Python library for web scraping in 2026?<\/span>\n                                <svg width=\"28\" height=\"28\" viewbox=\"0 0 28 28\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n                                    <path d=\"M7 10.5L14 17.5L21 10.5\" stroke=\"#5D5D5D\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n                                <\/svg>\n                            <\/button>\n                            <div class=\"section-faq__content\">\n                                <div class=\"section-faq__answer\">\n                                    <p>For static pages,\u00a0<strong>requests + BeautifulSoup<\/strong>\u00a0is the most beginner-friendly combination and covers the majority of scraping targets. For JavaScript-rendered sites,\u00a0<strong>\u0414\u0440\u0430\u043c\u0430\u0442\u0443\u0440\u0433<\/strong>\u00a0is now the preferred choice over Selenium \u2014 it\u2019s faster, has async support, and a cleaner API. For large-scale production crawls involving thousands of pages,\u00a0<strong>\u0421\u043a\u0440\u0430\u043f\u0438<\/strong>\u00a0provides built-in concurrency, retry logic, and pipeline management.<\/p>\n<p>If you\u2019re being blocked by Cloudflare, use\u00a0<em>curl_cffi<\/em>\u00a0which impersonates a real browser\u2019s TLS fingerprint. For the absolute hardest targets, Playwright with\u00a0<em>playwright-stealth<\/em>\u00a0\u0438 <a href=\"https:\/\/nodemaven.com\/ru\/proxies\/residential-proxies\/\">\u0440\u0435\u0437\u0438\u0434\u0435\u043d\u0442\u0441\u043a\u0438\u0435 \u043f\u0440\u043e\u043a\u0441\u0438<\/a> is the combination that works.<\/p>\n                                <\/div>\n                            <\/div>\n                        <\/div>\n                                            <div class=\"section-faq__item\" data-accordion=\"wrapper\" data-accordion-group=\"faq\" role=\"listitem\">\n                            <button class=\"section-faq__trigger\" data-accordion=\"trigger\" type=\"button\" aria-expanded=\"false\">\n                                <span class=\"section-faq__question\">Why do I keep getting 403 errors even with a User-Agent set?<\/span>\n                                <svg width=\"28\" height=\"28\" viewbox=\"0 0 28 28\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n                                    <path d=\"M7 10.5L14 17.5L21 10.5\" stroke=\"#5D5D5D\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n                                <\/svg>\n                            <\/button>\n                            <div class=\"section-faq__content\">\n                                <div class=\"section-faq__answer\">\n                                    <p>A User-Agent alone is not enough. Modern anti-bot systems check multiple signals simultaneously: TLS fingerprint (before headers are read), the full set of HTTP headers (not just User-Agent), IP reputation, and request timing patterns.<\/p>\n<p>The most common fix in 2026 is to switch from\u00a0requests\u00a0to\u00a0<em>curl_cffi<\/em>\u00a0which spoofs the TLS handshake,\u00a0<em>\u0438<\/em>\u00a0set a full header set including\u00a0<em>Accept<\/em>,\u00a0<em>Accept-Language<\/em>,\u00a0<em>Sec-Fetch-*<\/em>\u00a0headers. If you\u2019re still getting 403s, the IP is likely flagged \u2014 switching to residential proxies will fix this.<\/p>\n                                <\/div>\n                            <\/div>\n                        <\/div>\n                                            <div class=\"section-faq__item\" data-accordion=\"wrapper\" data-accordion-group=\"faq\" role=\"listitem\">\n                            <button class=\"section-faq__trigger\" data-accordion=\"trigger\" type=\"button\" aria-expanded=\"false\">\n                                <span class=\"section-faq__question\">What\u2019s the difference between rotating and static residential proxies?<\/span>\n                                <svg width=\"28\" height=\"28\" viewbox=\"0 0 28 28\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n                                    <path d=\"M7 10.5L14 17.5L21 10.5\" stroke=\"#5D5D5D\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n                                <\/svg>\n                            <\/button>\n                            <div class=\"section-faq__content\">\n                                <div class=\"section-faq__answer\">\n                                    <p><a href=\"https:\/\/nodemaven.com\/ru\/proxies\/rotating-residential-proxies\/\"><strong>\u0412\u0440\u0430\u0449\u0430\u044e\u0449\u0438\u0435\u0441\u044f \u0440\u0435\u0437\u0438\u0434\u0435\u043d\u0442\u043d\u044b\u0435 \u043f\u0440\u043e\u043a\u0441\u0438<\/strong><\/a>\u00a0give you a different IP address on each request (or each session, depending on configuration). This is ideal for high-volume scraping where you want maximum anonymity and can\u2019t afford to have any single IP associated with your traffic pattern.<\/p>\n<p><strong>\u0421\u0442\u0430\u0442\u0438\u0447\u0435\u0441\u043a\u0438\u0435 \u0440\u0435\u0437\u0438\u0434\u0435\u043d\u0442\u0441\u043a\u0438\u0435 \u043f\u0440\u043e\u043a\u0441\u0438<\/strong>\u00a0(also called <a href=\"https:\/\/nodemaven.com\/ru\/proxies\/isp-proxies\/\">ISP \u043f\u0440\u043e\u043a\u0441\u0438<\/a>) give you a persistent IP that stays the same across requests. These are better for login-based scraping, multi-step workflows, or any task where the website needs to maintain a consistent session identity. NodeMaven offers both, with static ISP proxies running 5x faster than standard residential while maintaining the same low fraud scores.<\/p>\n                                <\/div>\n                            <\/div>\n                        <\/div>\n                                            <div class=\"section-faq__item\" data-accordion=\"wrapper\" data-accordion-group=\"faq\" role=\"listitem\">\n                            <button class=\"section-faq__trigger\" data-accordion=\"trigger\" type=\"button\" aria-expanded=\"false\">\n                                <span class=\"section-faq__question\">How do I scrape websites that use infinite scroll?<\/span>\n                                <svg width=\"28\" height=\"28\" viewbox=\"0 0 28 28\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n                                    <path d=\"M7 10.5L14 17.5L21 10.5\" stroke=\"#5D5D5D\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n                                <\/svg>\n                            <\/button>\n                            <div class=\"section-faq__content\">\n                                <div class=\"section-faq__answer\">\n                                    <p>First, check the Network tab in DevTools as you scroll \u2014 most infinite-scroll sites make a background XHR\/Fetch request to an API endpoint that returns JSON. Calling that endpoint directly with\u00a0<em>\u0437\u0430\u043f\u0440\u043e\u0441\u044b<\/em>\u00a0is far more reliable than trying to automate scrolling.<\/p>\n                                <\/div>\n                            <\/div>\n                        <\/div>\n                                            <div class=\"section-faq__item\" data-accordion=\"wrapper\" data-accordion-group=\"faq\" role=\"listitem\">\n                            <button class=\"section-faq__trigger\" data-accordion=\"trigger\" type=\"button\" aria-expanded=\"false\">\n                                <span class=\"section-faq__question\">Is Python good for web scraping in 2026?<\/span>\n                                <svg width=\"28\" height=\"28\" viewbox=\"0 0 28 28\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n                                    <path d=\"M7 10.5L14 17.5L21 10.5\" stroke=\"#5D5D5D\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n                                <\/svg>\n                            <\/button>\n                            <div class=\"section-faq__content\">\n                                <div class=\"section-faq__answer\">\n                                    <p>Yes \u2014 Python remains the industry standard for modern web scraping in 2026 because it combines beginner-friendly syntax with one of the largest ecosystems of scraping libraries available. Python web scraping workflows can handle everything from simple HTML extraction to large-scale browser automation, async crawling, and anti-bot bypassing.<\/p>\n<p>For static pages, libraries like requests and BeautifulSoup are usually enough. For JavaScript-heavy websites, Playwright has become the preferred choice for web scraping with Python because it can automate a full browser and render dynamic content reliably. For production pipelines involving thousands of pages, Scrapy provides concurrency, retry systems, and built-in throttling.<\/p>\n                                <\/div>\n                            <\/div>\n                        <\/div>\n                                            <div class=\"section-faq__item\" data-accordion=\"wrapper\" data-accordion-group=\"faq\" role=\"listitem\">\n                            <button class=\"section-faq__trigger\" data-accordion=\"trigger\" type=\"button\" aria-expanded=\"false\">\n                                <span class=\"section-faq__question\">What is the best Python web scraping tutorial stack for beginners?<\/span>\n                                <svg width=\"28\" height=\"28\" viewbox=\"0 0 28 28\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n                                    <path d=\"M7 10.5L14 17.5L21 10.5\" stroke=\"#5D5D5D\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n                                <\/svg>\n                            <\/button>\n                            <div class=\"section-faq__content\">\n                                <div class=\"section-faq__answer\">\n                                    <p>The easiest way to start a python web scraping tutorial project is with:<\/p>\n<ol>\n<li>requests \u2014 download page HTML<\/li>\n<li>BeautifulSoup \u2014 parse HTML and extract data<\/li>\n<li>CSV or pandas \u2014 save scraped data<\/li>\n<\/ol>\n<p>This stack is lightweight, beginner-friendly, and ideal for learning selectors, pagination, and data extraction. Most web scraping using Python tutorial projects start here before moving into browser automation or large-scale crawling.<\/p>\n                                <\/div>\n                            <\/div>\n                        <\/div>\n                                            <div class=\"section-faq__item\" data-accordion=\"wrapper\" data-accordion-group=\"faq\" role=\"listitem\">\n                            <button class=\"section-faq__trigger\" data-accordion=\"trigger\" type=\"button\" aria-expanded=\"false\">\n                                <span class=\"section-faq__question\">What\u2019s the best web scraping Python BeautifulSoup workflow?<\/span>\n                                <svg width=\"28\" height=\"28\" viewbox=\"0 0 28 28\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n                                    <path d=\"M7 10.5L14 17.5L21 10.5\" stroke=\"#5D5D5D\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n                                <\/svg>\n                            <\/button>\n                            <div class=\"section-faq__content\">\n                                <div class=\"section-faq__answer\">\n                                    <p>The most common web scraping Python BeautifulSoup workflow looks like this:<\/p>\n<ol>\n<li>Send an HTTP request with requests<\/li>\n<li>Parse the HTML using BeautifulSoup<\/li>\n<li>Locate elements with CSS selectors<\/li>\n<li>Clean and normalize extracted data<\/li>\n<li>Export to CSV, JSON, or a database<\/li>\n<\/ol>\n                                <\/div>\n                            <\/div>\n                        <\/div>\n                                            <div class=\"section-faq__item\" data-accordion=\"wrapper\" data-accordion-group=\"faq\" role=\"listitem\">\n                            <button class=\"section-faq__trigger\" data-accordion=\"trigger\" type=\"button\" aria-expanded=\"false\">\n                                <span class=\"section-faq__question\">Can Python scrape JavaScript-rendered websites?<\/span>\n                                <svg width=\"28\" height=\"28\" viewbox=\"0 0 28 28\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n                                    <path d=\"M7 10.5L14 17.5L21 10.5\" stroke=\"#5D5D5D\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n                                <\/svg>\n                            <\/button>\n                            <div class=\"section-faq__content\">\n                                <div class=\"section-faq__answer\">\n                                    <p>Yes \u2014 modern web scraping in Python often involves JavaScript-rendered websites built with React, Vue, or Next.js. Traditional requests-based scrapers only download the initial HTML response, which may contain little or no actual data.<\/p>\n<p>For dynamic websites, the preferred solution is Playwright. It launches a real browser, executes JavaScript, waits for content to render, and then extracts the final page state.<\/p>\n                                <\/div>\n                            <\/div>\n                        <\/div>\n                                            <div class=\"section-faq__item\" data-accordion=\"wrapper\" data-accordion-group=\"faq\" role=\"listitem\">\n                            <button class=\"section-faq__trigger\" data-accordion=\"trigger\" type=\"button\" aria-expanded=\"false\">\n                                <span class=\"section-faq__question\">Can I use Python to scrape Google search results?<\/span>\n                                <svg width=\"28\" height=\"28\" viewbox=\"0 0 28 28\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n                                    <path d=\"M7 10.5L14 17.5L21 10.5\" stroke=\"#5D5D5D\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n                                <\/svg>\n                            <\/button>\n                            <div class=\"section-faq__content\">\n                                <div class=\"section-faq__answer\">\n                                    <p>Technically yes, but Google\u2019s anti-bot systems are among the most sophisticated in existence. Scraping Google directly with a standard Python script will get you blocked almost immediately. You\u2019ll need residential proxies with aggressive rotation, TLS fingerprint spoofing via\u00a0<em>curl_cffi<\/em>, and CAPTCHA handling.<\/p>\n<p>For most use cases, using the official Google Search API or a third-party SERP API is far more reliable and cost-effective than building and maintaining your own Google scraper.<\/p>\n                                <\/div>\n                            <\/div>\n                        <\/div>\n                                            <div class=\"section-faq__item\" data-accordion=\"wrapper\" data-accordion-group=\"faq\" role=\"listitem\">\n                            <button class=\"section-faq__trigger\" data-accordion=\"trigger\" type=\"button\" aria-expanded=\"false\">\n                                <span class=\"section-faq__question\">How many requests per second can I send before getting blocked?<\/span>\n                                <svg width=\"28\" height=\"28\" viewbox=\"0 0 28 28\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n                                    <path d=\"M7 10.5L14 17.5L21 10.5\" stroke=\"#5D5D5D\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n                                <\/svg>\n                            <\/button>\n                            <div class=\"section-faq__content\">\n                                <div class=\"section-faq__answer\">\n                                    <p>There\u2019s no universal answer \u2014 it depends entirely on the target site\u2019s infrastructure and anti-bot configuration. As a safe starting point: 1 request every 1\u20132 seconds per IP. With rotating residential proxies, you can increase this significantly because the rate limiting is per IP, not per scraper.<\/p>\n<p>A practical approach is to start slow and use Scrapy\u2019s\u00a0<em>AUTOTHROTTLE<\/em>\u00a0feature, which automatically adjusts request speed based on server response times and error rates.<\/p>\n                                <\/div>\n                            <\/div>\n                        <\/div>\n                                            <div class=\"section-faq__item\" data-accordion=\"wrapper\" data-accordion-group=\"faq\" role=\"listitem\">\n                            <button class=\"section-faq__trigger\" data-accordion=\"trigger\" type=\"button\" aria-expanded=\"false\">\n                                <span class=\"section-faq__question\">What\u2019s the difference between BeautifulSoup and Scrapy?<\/span>\n                                <svg width=\"28\" height=\"28\" viewbox=\"0 0 28 28\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n                                    <path d=\"M7 10.5L14 17.5L21 10.5\" stroke=\"#5D5D5D\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n                                <\/svg>\n                            <\/button>\n                            <div class=\"section-faq__content\">\n                                <div class=\"section-faq__answer\">\n                                    <p>BeautifulSoup is an HTML parsing library \u2014 it takes an HTML string and lets you extract data from it. It has no built-in HTTP client, scheduler, or pipeline system. You pair it with\u00a0<em>\u0437\u0430\u043f\u0440\u043e\u0441\u044b<\/em>\u00a0to fetch pages, then use it to parse those pages.<\/p>\n<p>Scrapy is a complete web crawling\u00a0<em>framework<\/em>\u00a0that handles everything: sending requests (with concurrency), following links, retrying failures, parsing responses, cleaning data, and saving it. It uses CSS selectors and XPath for parsing natively. Use BeautifulSoup for simple one-off scrapers; use Scrapy when you need a production-grade pipeline.<\/p>\n                                <\/div>\n                            <\/div>\n                        <\/div>\n                                            <div class=\"section-faq__item\" data-accordion=\"wrapper\" data-accordion-group=\"faq\" role=\"listitem\">\n                            <button class=\"section-faq__trigger\" data-accordion=\"trigger\" type=\"button\" aria-expanded=\"false\">\n                                <span class=\"section-faq__question\">Can I scrape e-commerce websites with Python?<\/span>\n                                <svg width=\"28\" height=\"28\" viewbox=\"0 0 28 28\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n                                    <path d=\"M7 10.5L14 17.5L21 10.5\" stroke=\"#5D5D5D\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n                                <\/svg>\n                            <\/button>\n                            <div class=\"section-faq__content\">\n                                <div class=\"section-faq__answer\">\n                                    <p>Yes \u2014 web scraping ecommerce websites Python workflows are one of the most common scraping use cases today. Companies scrape e-commerce platforms for:<\/p>\n<ul>\n<li>\u041c\u043e\u043d\u0438\u0442\u043e\u0440\u0438\u043d\u0433 \u0446\u0435\u043d<\/li>\n<li>Stock tracking<\/li>\n<li>Review aggregation<\/li>\n<li>Seller analysis<\/li>\n<li>Competitor monitoring<\/li>\n<\/ul>\n<p>However, e-commerce sites also deploy some of the strongest anti-bot protections:<\/p>\n<ul>\n<li>Cloudflare<\/li>\n<li>DataDome<\/li>\n<li>Akamai<\/li>\n<li>PerimeterX<\/li>\n<\/ul>\n<p>NodeMaven <a href=\"https:\/\/nodemaven.com\/ru\/proxies\/rotating-residential-proxies\/\">\u0432\u0440\u0430\u0449\u0430\u044e\u0449\u0438\u0435\u0441\u044f \u0436\u0438\u043b\u044b\u0435 \u043f\u0440\u043e\u043a\u0441\u0438<\/a> are especially useful for e-commerce scraping because requests can rotate across clean residential IPs automatically, reducing rate limits and detection risk.<\/p>\n                                <\/div>\n                            <\/div>\n                        <\/div>\n                                            <div class=\"section-faq__item\" data-accordion=\"wrapper\" data-accordion-group=\"faq\" role=\"listitem\">\n                            <button class=\"section-faq__trigger\" data-accordion=\"trigger\" type=\"button\" aria-expanded=\"false\">\n                                <span class=\"section-faq__question\">Can I build a Python script for web scraping without proxies?<\/span>\n                                <svg width=\"28\" height=\"28\" viewbox=\"0 0 28 28\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n                                    <path d=\"M7 10.5L14 17.5L21 10.5\" stroke=\"#5D5D5D\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" \/>\n                                <\/svg>\n                            <\/button>\n                            <div class=\"section-faq__content\">\n                                <div class=\"section-faq__answer\">\n                                    <p>Technically yes, but only for low-protection websites or very small scraping workloads. A basic python script for web scraping may work temporarily with a normal IP, but once request volume increases, most modern sites will begin rate limiting or blocking traffic.<\/p>\n<p>For reliable scraping at scale, <a href=\"https:\/\/nodemaven.com\/ru\/proxies\/residential-proxies\/\">\u0440\u0435\u0437\u0438\u0434\u0435\u043d\u0442\u0441\u043a\u0438\u0435 \u043f\u0440\u043e\u043a\u0441\u0438<\/a> are now standard infrastructure. They distribute requests across real ISP IP addresses, making traffic appear like normal user activity.<\/p>\n<p>NodeMaven residential proxies are particularly useful for:<\/p>\n<ul>\n<li>e-commerce scraping<\/li>\n<li>localized search results<\/li>\n<li>account-based scraping<\/li>\n<li>Google scraping<\/li>\n<li>large-scale data collection<\/li>\n<\/ul>\n<p>Because the <a href=\"https:\/\/nodemaven.com\/ru\/features\/ip-quality-filter\/\">IP pool is pre-filtered<\/a> for quality and fraud risk, scrapers experience fewer CAPTCHAs and fewer 403 responses during long scraping sessions.<\/p>\n                                <\/div>\n                            <\/div>\n                        <\/div>\n                                    <\/div>\n                    <\/div>\n    <\/div>\n<\/div>","protected":false},"excerpt":{"rendered":"Complete Python web scraping guide covering requests, Playwright, proxy rotation, JavaScript rendering, and scaling techniques","protected":false},"author":80,"featured_media":38406,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[213,205],"class_list":["post-38401","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized","tag-guides-tutorials","tag-web-scraping"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Web Scraping with Python: Full Step-by-Step Guide for 2026<\/title>\n<meta name=\"description\" content=\"Complete guide to Python web scraping with BeautifulSoup, Playwright, proxies, and anti-bot techniques for dynamic and static websites\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/nodemaven.com\/ru\/blog\/python-web-scraping\/\" \/>\n<meta property=\"og:locale\" content=\"ru_RU\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Web Scraping with Python: Full Step-by-Step Guide for 2026\" \/>\n<meta property=\"og:description\" content=\"Complete guide to Python web scraping with BeautifulSoup, Playwright, proxies, and anti-bot techniques for dynamic and static websites\" \/>\n<meta property=\"og:url\" content=\"https:\/\/nodemaven.com\/ru\/blog\/python-web-scraping\/\" \/>\n<meta property=\"og:site_name\" content=\"NodeMaven\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-21T10:58:33+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-05-21T12:24:26+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/nodemaven.com\/wp-content\/uploads\/2025\/03\/cropped-Untitled-design-8-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"512\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Olga Kotko\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u041d\u0430\u043f\u0438\u0441\u0430\u043d\u043e \u0430\u0432\u0442\u043e\u0440\u043e\u043c\" \/>\n\t<meta name=\"twitter:data1\" content=\"Olga Kotko\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u041f\u0440\u0438\u043c\u0435\u0440\u043d\u043e\u0435 \u0432\u0440\u0435\u043c\u044f \u0434\u043b\u044f \u0447\u0442\u0435\u043d\u0438\u044f\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 \u043c\u0438\u043d\u0443\u0442\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/nodemaven.com\\\/blog\\\/python-web-scraping\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/nodemaven.com\\\/blog\\\/python-web-scraping\\\/\"},\"author\":{\"name\":\"Olga Kotko\",\"@id\":\"https:\\\/\\\/nodemaven.com\\\/#\\\/schema\\\/person\\\/79a9c10c7956e31a5628504fe9cffe2e\"},\"headline\":\"Web Scraping with Python: The Complete Guide [2026]\",\"datePublished\":\"2026-05-21T10:58:33+00:00\",\"dateModified\":\"2026-05-21T12:24:26+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/nodemaven.com\\\/blog\\\/python-web-scraping\\\/\"},\"wordCount\":2248,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/nodemaven.com\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/nodemaven.com\\\/blog\\\/python-web-scraping\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/nodemaven.com\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/scrapinf-featured.svg\",\"keywords\":[\"Guides &amp; Tutorials\",\"Web Scraping\"],\"articleSection\":[\"Uncategorized\"],\"inLanguage\":\"ru-RU\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/nodemaven.com\\\/blog\\\/python-web-scraping\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/nodemaven.com\\\/blog\\\/python-web-scraping\\\/\",\"url\":\"https:\\\/\\\/nodemaven.com\\\/blog\\\/python-web-scraping\\\/\",\"name\":\"Web Scraping with Python: Full Step-by-Step Guide for 2026\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/nodemaven.com\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/nodemaven.com\\\/blog\\\/python-web-scraping\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/nodemaven.com\\\/blog\\\/python-web-scraping\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/nodemaven.com\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/scrapinf-featured.svg\",\"datePublished\":\"2026-05-21T10:58:33+00:00\",\"dateModified\":\"2026-05-21T12:24:26+00:00\",\"description\":\"Complete guide to Python web scraping with BeautifulSoup, Playwright, proxies, and anti-bot techniques for dynamic and static websites\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/nodemaven.com\\\/blog\\\/python-web-scraping\\\/#breadcrumb\"},\"inLanguage\":\"ru-RU\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/nodemaven.com\\\/blog\\\/python-web-scraping\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"ru-RU\",\"@id\":\"https:\\\/\\\/nodemaven.com\\\/blog\\\/python-web-scraping\\\/#primaryimage\",\"url\":\"https:\\\/\\\/nodemaven.com\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/scrapinf-featured.svg\",\"contentUrl\":\"https:\\\/\\\/nodemaven.com\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/scrapinf-featured.svg\",\"caption\":\"NodeMaven proxy infrastructure illustration for Python web scraping, proxy routing, and anti-bot protection\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/nodemaven.com\\\/blog\\\/python-web-scraping\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/nodemaven.com\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Web Scraping with Python: The Complete Guide [2026]\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/nodemaven.com\\\/#website\",\"url\":\"https:\\\/\\\/nodemaven.com\\\/\",\"name\":\"NodeMaven\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/nodemaven.com\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/nodemaven.com\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ru-RU\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/nodemaven.com\\\/#organization\",\"name\":\"NodeMaven\",\"url\":\"https:\\\/\\\/nodemaven.com\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ru-RU\",\"@id\":\"https:\\\/\\\/nodemaven.com\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/nodemaven.com\\\/wp-content\\\/uploads\\\/2025\\\/03\\\/cropped-Untitled-design-8-1.png\",\"contentUrl\":\"https:\\\/\\\/nodemaven.com\\\/wp-content\\\/uploads\\\/2025\\\/03\\\/cropped-Untitled-design-8-1.png\",\"width\":512,\"height\":512,\"caption\":\"NodeMaven\"},\"image\":{\"@id\":\"https:\\\/\\\/nodemaven.com\\\/#\\\/schema\\\/logo\\\/image\\\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/nodemaven.com\\\/#\\\/schema\\\/person\\\/79a9c10c7956e31a5628504fe9cffe2e\",\"name\":\"Olga Kotko\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ru-RU\",\"@id\":\"https:\\\/\\\/nodemaven.com\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/olga-kotko_avatar-96x96.jpg\",\"url\":\"https:\\\/\\\/nodemaven.com\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/olga-kotko_avatar-96x96.jpg\",\"contentUrl\":\"https:\\\/\\\/nodemaven.com\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/olga-kotko_avatar-96x96.jpg\",\"caption\":\"Olga Kotko\"},\"description\":\"I write about proxies and automation, translating complicated digital topics into research-driven content people can actually enjoy reading\",\"url\":\"https:\\\/\\\/nodemaven.com\\\/ru\\\/author\\\/olga-kotko\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Web Scraping with Python: Full Step-by-Step Guide for 2026","description":"Complete guide to Python web scraping with BeautifulSoup, Playwright, proxies, and anti-bot techniques for dynamic and static websites","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/nodemaven.com\/ru\/blog\/python-web-scraping\/","og_locale":"ru_RU","og_type":"article","og_title":"Web Scraping with Python: Full Step-by-Step Guide for 2026","og_description":"Complete guide to Python web scraping with BeautifulSoup, Playwright, proxies, and anti-bot techniques for dynamic and static websites","og_url":"https:\/\/nodemaven.com\/ru\/blog\/python-web-scraping\/","og_site_name":"NodeMaven","article_published_time":"2026-05-21T10:58:33+00:00","article_modified_time":"2026-05-21T12:24:26+00:00","og_image":[{"width":512,"height":512,"url":"https:\/\/nodemaven.com\/wp-content\/uploads\/2025\/03\/cropped-Untitled-design-8-1.png","type":"image\/png"}],"author":"Olga Kotko","twitter_card":"summary_large_image","twitter_misc":{"\u041d\u0430\u043f\u0438\u0441\u0430\u043d\u043e \u0430\u0432\u0442\u043e\u0440\u043e\u043c":"Olga Kotko","\u041f\u0440\u0438\u043c\u0435\u0440\u043d\u043e\u0435 \u0432\u0440\u0435\u043c\u044f \u0434\u043b\u044f \u0447\u0442\u0435\u043d\u0438\u044f":"10 \u043c\u0438\u043d\u0443\u0442"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/nodemaven.com\/blog\/python-web-scraping\/#article","isPartOf":{"@id":"https:\/\/nodemaven.com\/blog\/python-web-scraping\/"},"author":{"name":"Olga Kotko","@id":"https:\/\/nodemaven.com\/#\/schema\/person\/79a9c10c7956e31a5628504fe9cffe2e"},"headline":"Web Scraping with Python: The Complete Guide [2026]","datePublished":"2026-05-21T10:58:33+00:00","dateModified":"2026-05-21T12:24:26+00:00","mainEntityOfPage":{"@id":"https:\/\/nodemaven.com\/blog\/python-web-scraping\/"},"wordCount":2248,"commentCount":0,"publisher":{"@id":"https:\/\/nodemaven.com\/#organization"},"image":{"@id":"https:\/\/nodemaven.com\/blog\/python-web-scraping\/#primaryimage"},"thumbnailUrl":"https:\/\/nodemaven.com\/wp-content\/uploads\/2026\/05\/scrapinf-featured.svg","keywords":["Guides &amp; Tutorials","Web Scraping"],"articleSection":["Uncategorized"],"inLanguage":"ru-RU","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/nodemaven.com\/blog\/python-web-scraping\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/nodemaven.com\/blog\/python-web-scraping\/","url":"https:\/\/nodemaven.com\/blog\/python-web-scraping\/","name":"Web Scraping with Python: Full Step-by-Step Guide for 2026","isPartOf":{"@id":"https:\/\/nodemaven.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/nodemaven.com\/blog\/python-web-scraping\/#primaryimage"},"image":{"@id":"https:\/\/nodemaven.com\/blog\/python-web-scraping\/#primaryimage"},"thumbnailUrl":"https:\/\/nodemaven.com\/wp-content\/uploads\/2026\/05\/scrapinf-featured.svg","datePublished":"2026-05-21T10:58:33+00:00","dateModified":"2026-05-21T12:24:26+00:00","description":"Complete guide to Python web scraping with BeautifulSoup, Playwright, proxies, and anti-bot techniques for dynamic and static websites","breadcrumb":{"@id":"https:\/\/nodemaven.com\/blog\/python-web-scraping\/#breadcrumb"},"inLanguage":"ru-RU","potentialAction":[{"@type":"ReadAction","target":["https:\/\/nodemaven.com\/blog\/python-web-scraping\/"]}]},{"@type":"ImageObject","inLanguage":"ru-RU","@id":"https:\/\/nodemaven.com\/blog\/python-web-scraping\/#primaryimage","url":"https:\/\/nodemaven.com\/wp-content\/uploads\/2026\/05\/scrapinf-featured.svg","contentUrl":"https:\/\/nodemaven.com\/wp-content\/uploads\/2026\/05\/scrapinf-featured.svg","caption":"NodeMaven proxy infrastructure illustration for Python web scraping, proxy routing, and anti-bot protection"},{"@type":"BreadcrumbList","@id":"https:\/\/nodemaven.com\/blog\/python-web-scraping\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/nodemaven.com\/"},{"@type":"ListItem","position":2,"name":"Web Scraping with Python: The Complete Guide [2026]"}]},{"@type":"WebSite","@id":"https:\/\/nodemaven.com\/#website","url":"https:\/\/nodemaven.com\/","name":"NodeMaven","description":"","publisher":{"@id":"https:\/\/nodemaven.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/nodemaven.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ru-RU"},{"@type":"Organization","@id":"https:\/\/nodemaven.com\/#organization","name":"NodeMaven","url":"https:\/\/nodemaven.com\/","logo":{"@type":"ImageObject","inLanguage":"ru-RU","@id":"https:\/\/nodemaven.com\/#\/schema\/logo\/image\/","url":"https:\/\/nodemaven.com\/wp-content\/uploads\/2025\/03\/cropped-Untitled-design-8-1.png","contentUrl":"https:\/\/nodemaven.com\/wp-content\/uploads\/2025\/03\/cropped-Untitled-design-8-1.png","width":512,"height":512,"caption":"NodeMaven"},"image":{"@id":"https:\/\/nodemaven.com\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/nodemaven.com\/#\/schema\/person\/79a9c10c7956e31a5628504fe9cffe2e","name":"Olga Kotko","image":{"@type":"ImageObject","inLanguage":"ru-RU","@id":"https:\/\/nodemaven.com\/wp-content\/uploads\/2026\/05\/olga-kotko_avatar-96x96.jpg","url":"https:\/\/nodemaven.com\/wp-content\/uploads\/2026\/05\/olga-kotko_avatar-96x96.jpg","contentUrl":"https:\/\/nodemaven.com\/wp-content\/uploads\/2026\/05\/olga-kotko_avatar-96x96.jpg","caption":"Olga Kotko"},"description":"I write about proxies and automation, translating complicated digital topics into research-driven content people can actually enjoy reading","url":"https:\/\/nodemaven.com\/ru\/author\/olga-kotko\/"}]}},"_links":{"self":[{"href":"https:\/\/nodemaven.com\/ru\/wp-json\/wp\/v2\/posts\/38401","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nodemaven.com\/ru\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nodemaven.com\/ru\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nodemaven.com\/ru\/wp-json\/wp\/v2\/users\/80"}],"replies":[{"embeddable":true,"href":"https:\/\/nodemaven.com\/ru\/wp-json\/wp\/v2\/comments?post=38401"}],"version-history":[{"count":8,"href":"https:\/\/nodemaven.com\/ru\/wp-json\/wp\/v2\/posts\/38401\/revisions"}],"predecessor-version":[{"id":38414,"href":"https:\/\/nodemaven.com\/ru\/wp-json\/wp\/v2\/posts\/38401\/revisions\/38414"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/nodemaven.com\/ru\/wp-json\/wp\/v2\/media\/38406"}],"wp:attachment":[{"href":"https:\/\/nodemaven.com\/ru\/wp-json\/wp\/v2\/media?parent=38401"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nodemaven.com\/ru\/wp-json\/wp\/v2\/categories?post=38401"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nodemaven.com\/ru\/wp-json\/wp\/v2\/tags?post=38401"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}