9

Can you help me understand how to start building a Web crawler?

I need to understand if my idea is impossible.

Comments
  • 8
    * make request to website
    * parse html
    * optional: execute js
    * get your wanted information
    * recursion: get links and start at the beginning
  • 3
    @plusgut so is it pretty much html parsing?
  • 6
    @AntaresStar sure, but there are libraries for it. i wouldn't recommend writing one by your own...
  • 2
    @AntaresStar
    There are different kinds of web crawling and scraping. What exactly are you trying to do?
  • 3
    @coolq i just want to extract all links inside a website.
  • 2
    @Alice I'll do it! :)
  • 6
    @AntaresStar In my Opinion you don't even need to parse everything.
    In my opinion it would be far more efficient and easy to build some regular expression to search for strings starting with http or https.
    Add all strings you find to a list.
    Remember, that you shouldn't be able to add duplicates. A sorted list might be useful for that. Or even a database system, if the whole thing might get bigger.
    Mark the analyzed urls and iterate further over the unmarked elements in the list.
    You might want to do this recursively, but this could get really memory consuming.
    Better also save the depth to the elements in the list, if you want to stop somewhere. Just set the depth of new items to depth of current item + 1.
    There are a lot of opportunities to optimize. For example multiprocessing (even with multiple clients, if you use a database for storing).
    Extend the Regex, if you want.
    Good start:
    ^["']https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)["']
  • 2
  • 2
    correction:
    remove the \b in the regex
  • 1
    @jAsE
    Ummm, you never told me this 😂
  • 0
    @jAsE
    It's all good mate, no need to talk about it.

    If you ever feel up to it, maybe a future rant? Let it all out, but that would be hard. I don't know what it's like, so I can only hope you're all right?
  • 0
    @jAsE
    Thanks for elaborating.

    Sounds to me like you've got this under control 👍
  • 1
    Take a look at PhantomJS, you can build a link scraper in maybe 10 lines of code with it
  • 0
    @-Neo
    PhantomJS is good, try pairing it with Selenium.
  • 0
    @-Neo phantomjs is dead. The maintainer said, that everyone should use chrome headless.
  • 3
    Guys, so it's easier to write a post on devrant than just Google the keyword and read any of thousands of articles on web crawlers with examples, pointers and links to resources? o.O
  • 0
    @kargaroth it's what I am doing.

    But now that I am part of this community I think that I can learn a lot also from your experience.

    So thank you all for your comments :)
Add Comment