5

Has anyone here worked on news scraping?

I am currently doing my academic project where I need to scrap the news headlines. I have built scrappers for some news sources using their native API. I also tried using newsapi.org, but it returns only 10 results.

If anyone have worked on similar projects or know of their existence, some advice would be highly appreciated.

Comments
  • 3
    I started working on my first scraper today 😀 I'm scraping something similar but without api.. just getting the whole page and if there's a Next button I'm looping to the next page..

    *disclamer: this answer is based on knowledge I gained in the last 3 hours
  • 2
    I did news scraping last to last year using java.. it was infact a crawler, I used jsoup to fetch and parse HTML and redis to store the URLs queue..

    One thing I can tell you is that it takes a lot of resources to continuously run a crawler, managing the queues of links crawled and to crawl is a daunting task

    You can use RSS feeds too, most publications provide them and you can easily find a RSS parser or can make one.. just run it periodically

    *Edit- typo correction
  • 2
    @vortex Good for you! 👍 I recommend you to use dataquest.io for learning more on scraping. They have great interactive tutorials.
  • 2
    @ergo thank you very much for the suggestion. Some sources have poor APIs which are slow and the JSON objects cannot be customized. I will definitely consider the RSS feed.
  • 0
    I'm delighted to find your post. I've been wondering for a long time if anyone was working on news scraping. News scraping, the process of extracting information from various online news sources, is becoming increasingly popular for real-time data collection and analytics. However, one of the challenges of news scraping is dealing with the anti-bots used by websites to prevent automatic data extraction. And a tool like https://www.zenrows.com/ offers a comprehensive solution to this problem by handling all anti-bot bypass mechanisms for newsgathering users. Whether it's CAPTCHAs, IP blocking, or other anti-bot measures, ZenRows provides users with the tools and technology they need to seamlessly overcome these obstacles. I think this should be of interest to you.
Add Comment