13

When your crawler starts to find very weird pages on the internet...

Comments
  • 0
    Selfmade?
  • 0
    @linuxer4fun crawler*
  • 2
    Ours is running for over a month and has that too! The most odd sites you will ever see are crawledπŸ˜‚
  • 1
    @linuxer4fun yes, I started off regexing everything by myself. Like just using some bufferd reader and then regexing everything. I then moved on to use JSoup because, well, it offered everything I needed πŸ˜‚
    I added some features and am now working with a cluster-like Engine. Means you have a Master server which is actually a bot that adds links to a Queue. And every 10 links sends a Packet with the links to a slave, that processes it. You can have several instances of slaves that connect to the master. The slaves are multi-threading, for each link a thread.
    The communication is done with netty.
  • 0
    @Jappe woooow, that's insane πŸ˜‚πŸ˜‚πŸ˜‚
    Which Url did you Start off?
  • 1
    @DataSec Dmoz.org
  • 1
    @Jappe That's an awesome site to start Off. Did you write it in Java?
  • 0
    @DataSec Nope. Python. That was the most simple language for us to build our crawler
  • 0
    @Jappe Did you yet calculated how many links you can retrieve a minute for example. I'm quite curious of that because I'd like to know what's actually more efficient. To be honest I could just guess
  • 0
    We know that it crawls around 100.000 per hour.

    But it depends on how many crawlers are running though. For 100.000 link per hour are about 20 crawlers needed.
  • 0
    @Jappe Oh I expected more. What downloadrate have you got?
    I have a semi-fixed thread number, it uses a fixed thread pool which calculates its thread number by the number of available processors. This is for every slave that is connected to the master.
    With a 100 Mbit downloadrate it gets 100.000 links per minute and probably completely crawles and indexes 70.000 per minute if not more.
    Means with 1 Gbit you could fetch almost 1 million links per minute πŸ˜‚
  • 2
    Cats are indeed weird crawlers.
  • 0
    All right that's pretty awesome, but what are the specs of your server/computer? We only have two regular PC's each with only 2 Gb of RAM.. πŸ˜•

    Oh and we run it within a crappy school network. So optimisation is everything what we can do to make faster...πŸ˜‚
  • 0
    @Jappe Yea I was pretty impressed πŸ˜‚
    I totally forgot to mention that. I Run a computer with windows and it Till now just ran in IntelliJ not on a server. I have 8 Gb DDR4 and an i7-6700HQ @ 2.6 GHz, it's a Quadcore. So it's neat Hardware. On a server I would probably use a VPs with 2-4 Gb Ram and a decent CPU.
    But though my internet downloadrate is the most determining thing actually πŸ˜‚
    Tested it on school network and threw many exceptions πŸ˜‚πŸ˜‚
  • 0
    @DataSec That's awesome!! We are going to upgrade both PC's from 2Gb to 4Gb each, so it's gonna be a little faster than it is right now..😎
  • 1
    @Jappe That sounds very promising! 😏
    I am sure it will fasten the crawling up :D
  • 1
    Look at our daily resultπŸ˜‡ with @hahaha1234 and @papierbouwer
Add Comment