9
donuts
4y

So I just had this thought that nlegs.com (NSFW) kinda feels like a test.

When I first found it, and it still is, the front-end/layout is basically a BootStrap grid.

It was super easy to scrape.

Then over time, the owner made small tweaks and changes which felt like "oh you guys are still here.... let's make it a bit harder and see who drops out next"

So it got more and more tricky to scrape or fool the site.

But it never became completely unfoolable. I figured if he signed up for Cloudflare, that probably make it impossible to scrape....

Well I was curious today so did a whois.... And one of the things it mentioned was Cloudflare...

So now I'm like.... Hmmm.... What???!!! Ok.... ¯\_(ツ)_/¯

Comments
  • 3
    I got good with scraping and reverse engineering sites to get info. It is always fun to deal with these little changes to the layout (billboard charts did them frequently)

    Be mindful of what headers to send and what user agent is specified. If user agent is outdated or has a misspelling, cloud flare would block the scraper from reaching the site.
  • 2
    @MrCSharp I switched to selenium driver a long time ago. Then if I need to download an image, I create a HTTPRequest that with the same headers, cookies...

    But not exactly super advanced? Cloudflare should be able to detect/prevent that with some JS stuff ...
  • 0
    Ahh yeah. I haven't played with selenium much but most of my scraping was done with bespoke code in c#
  • 2
    @MrCSharp I used to use just WebClient but too easy to detect... Always had the thought "if only I could just control a browser things would be so much easier (don't need to reinvent the wheel (browser)" and then Chromium and headless browsers were born.

    I actually learned about it at work, someone did a POC for a webapp testing framework with it (but didn't really go far) and I'm like "hmmm that's just what i always wanted!"

    Had a few scraper apps and when they got blocked, switched them to Selenium.
  • 1
    Yeah, sounds like you had the perfect use case for it.

    For me, the code was distributed with the application so with user agent headers and other small tweaks, the risk of it getting blocked was very small.

    I'm planning on centralizing some of the scraping work for performance reasons so I think I'll end up switching to selenium with a C# driver.

    Do you have any good resource you used?
  • 1
    @MrCSharp Not sure if you would use Selenium for performance. It's basically like a remote control or COM Interface. It starts a browser either in the background or foreground so compared to a System.Net it's going to be relatively slow.

    For resources I don't quite remember, I think just started with the official driver docs. And Googled a lot.

    Maybe can share a project with you via GitHub in the next few days but a few tips/warnings for now.

    1. I think usually I install it via Nuget, have to install Core and a Browser driver packages like Firefox or Chrome; Chrome I think is more powerful, more options

    2. Selenium is technically for unit testing, to use it as an app component, need to add the chromedriver.exe to the project folder as a Resource (probably some docs you can Google on this, forget what I searched for)

    3. If you use the default startup configs it will use whatever version of Chrome is currently installed... And if Chrome updates, you will need to update chromedriver.exe and the lib

    The way I get around this is I download the Chromium binaries and place it in a separate folder on my computer and there's a ChromeOption or arg can pass in on the init to specify a specific Chrome exe to run. All my apps have a Setting window where I can set this.
  • 3
    Its been almost 20 years since I worked with scraping and back then there was no real protections except broken html ;) and that was most likely not intentional ;)

    We also used it not specifically to fetch thing but rather to check if some parts changed, like an external change subscription.

    And the scraping was to ignore any really dynamic content like time or dates
  • 0
    @tits-r-us the site, yes...
  • 0
    @billgates to clarify, the performance improvement comes from doing the data scraping on the server side with caching so the client app doesn't consume too much CPU resources.
    thanks for the tips there, sure will come in handy mate.
  • 0
Add Comment