3
retoor
8d

Since strangely enough lack of decent site downloaders I've written one myself.

It's battle tested by downloading WHOLE devrant and a big part of molodetz. Both big sites. It makes the downloaded sites portable by making absolute urls relative.

It downloads with a high concurrency.

Reason I've made this, is because I want to have all this data is so I have a lot of spam examples to train a model on.

Project page and features here: https://retoor.molodetz.nl/retoor/.... Source code at bottom as always.

I hope someone will give it a try :)

And yes, the docs costed almost the same time as the code. Code doesn't contain unit tests, it's production tested instead. I applied many optimizations mentioned by my review tool. When i was done I was too tired for unit tests.

Comments
  • 3
    I've put whole devRant in a full text search database and can exactly see how many times someone talked about a certain subject for example. Also, who is mentioning each other. For some reason, devRant is a fun dataset to play with for me since I know a lot from it, but not all by far. So I can do tests where I expect certain outcome but surprises are still possible.
  • 2
    If you're already doing that, make devrant archives, in case the site goes down since it's not being maintained anymore
  • 2
    @SoldierOfCode that's also the plan. I gonna convert the local html data to a database in the structure of the api objects.
  • 1
    Why !. I'll give the downie /* Cool name, BTW. */ a spin in a day or two.

    I remember needing that sort of software a few years back. I used `HTTrack` back then.

    ...probably did the job, but can't remember.
  • 2
    @D-4got10-01 found two bugs. Dammit. It wasn't a problem for my use case tho.
Add Comment