When I’m writing webscrapers I mostly just pivot between selenium (because the website is too “fancy” and definitely needs a browser) and pure requests calls (both in conjunction with bs4).

But when reading about scrapers, scrapy is often the first mentioned Python package. What am I missing out on if I’m not using it?

2 points

The huge feature of scrapy is it’s pipelining system: you scrape a page, pass it to the filtering part, then to the deduplication part, then to the DB and so on

Hugely useful when you’re scraping and extraction data, I reckon if you’re only extracting raw pages then it’s less useful I guess

permalink
report
reply
1 point

Oh shit that sounds useful. I just did a project where I implemented a custom stream class to chain together calls to requests and beautifulsoup.

permalink
report
parent
reply
2 points

Yep try scrapy. And also it handles for you the concurrency of your pipelines items, configuration for every part,…

permalink
report
parent
reply

Python

!python@programming.dev

Create post

Welcome to the Python community on the programming.dev Lemmy instance!

📅 Events
Past

November 2023

October 2023

July 2023

August 2023

September 2023

🐍 Python project:
💓 Python Community:
✨ Python Ecosystem:
🌌 Fediverse
Communities
Projects
  • Pythörhead: a Python library for interacting with Lemmy
  • Plemmy: a Python package for accessing the Lemmy API
  • pylemmy pylemmy enables simple access to Lemmy’s API with Python
  • mastodon.py, a Python wrapper for the Mastodon API
Feeds

Community stats

  • 802

    Monthly active users

  • 468

    Posts

  • 2.4K

    Comments