3

Hi, I am using a Wikipedia scrapper in one of my Open Source project. The data extracted from it is the stored in Elasticsearch... Now I have decided to create library out of it so that other people can use it too... My question is should also include the Elasticsearch storing module in library or just add the scrapper... Please let me know your thoughts.

Comments
  • 2
    If possible,
    Try to separate the storing as much as possible, from the other stuff.
    So that other people can use your storing-module or can use their own , without having to change your library.
  • 1
    A bit offtopic. Doesn't wikipedia have an API so you would not need to use a scraper?
  • 1
    @commanderkeen I am using Wikipedia API for searching and fetching pages, but to extract and to parse data, I am relying on the scrapper. So, this allows me to extract only relevant text and ingore other texts. Also, extract data from tables and image descriptions and so on...
  • 1
    You know that it is possible to download everything on Wikipedia, officially. There is no need for scraping.

    https://en.wikipedia.org/wiki/...
  • 0
    @Polarina True but I don't want to do that...
  • 0
    @StanTheMan @Polarina I understand though why you would suggest this, as I am either way querying wikipedia pages and storing them in Elasticsearch...
  • 0
    Why not just use the Wikipedia data dump its around 10 GB. Use that and parse the data no need to use scraper. Scraper will take forever
  • 1
    @py2js @Polarina Maybe you guys are right, there is no point in making that scrapper since, Wikipedia either way gives the data dumps. It worked for my specific use case, but no point it making it a library if no one is going to use...
    Well, efforts saved, thanks.
Add Comment