Got a report from a customer saying that our scraper does not correctly scrap content of one of their news articles. After two seconds of investigation it turned out that the "article" is just one huge JPG file with text, photos and even something looking like links.

  • 5
  • 3
    I just fucking hate the stupid bastard customers. I’d love to call them and say ‘it’s a fucking jpeg you stupid cunt, stop wasting my time due to your fucking lack of brain cells... wanker. Piss off!!!’
  • 2
    apache tika does analyse images (ocr) too
    It's pretty useful as a microservice / docker container. And works together with with Apache Nutch, Lucene / Solr and OpenNLP.
Add Comment