StanTheMan

7y

Hi, I am using a Wikipedia scrapper in one of my Open Source project. The data extracted from it is the stored in Elasticsearch... Now I have decided to create library out of it so that other people can use it too... My question is should also include the Elasticsearch storing module in library or just add the scrapper... Please let me know your thoughts.

rant

wikipedia

python

scraper

Ranter

Comments

2

metamourge

7989

7y

If possible,
Try to separate the storing as much as possible, from the other stuff.
So that other people can use your storing-module or can use their own , without having to change your library.
1

commanderkeen

2056

7y

A bit offtopic. Doesn't wikipedia have an API so you would not need to use a scraper?
1

StanTheMan

1310

7y

@commanderkeen I am using Wikipedia API for searching and fetching pages, but to extract and to parse data, I am relying on the scrapper. So, this allows me to extract only relevant text and ingore other texts. Also, extract data from tables and image descriptions and so on...
1

Polarina

118

7y

You know that it is possible to download everything on Wikipedia, officially. There is no need for scraping.

https://en.wikipedia.org/wiki/...
0

StanTheMan

1310

7y

@Polarina True but I don't want to do that...
0

StanTheMan

1310

7y

@StanTheMan @Polarina I understand though why you would suggest this, as I am either way querying wikipedia pages and storing them in Elasticsearch...
0

py2js

2724

7y

Why not just use the Wikipedia data dump its around 10 GB. Use that and parse the data no need to use scraper. Scraper will take forever
1

StanTheMan

1310

7y

@py2js @Polarina Maybe you guys are right, there is no point in making that scrapper since, Wikipedia either way gives the data dumps. It worked for my specific use case, but no point it making it a library if no one is going to use...
Well, efforts saved, thanks.

Related Rants

devRant © 2021 Hexical Labs LLC
Privacy Policy | Terms of Service