Just a legal question here.

Is web scraping legal in USA? I am asking here for the sole reason that I am sure that someone might have developed projects with web scraping.

I've heard that Walmart does it a lot.

  • 2
    This made me wonder and I found an interesting read on the subject.

  • 0
    UPDATE: You have to login to scrape the data. Is it legal or illegal?
  • 0
    @aldoblack I'm pretty sure it's illegal if you have to log in, if you agreed to not do it in ToS. It's considered trespassing, or something. Can't remember exactly why atm.
  • 3
    @taigrr breaking a contract is not illegal. Illegal means breaking the law.
  • 1
    @electrineer I'm aware that's the case for most things. But I think there was a common law case where using a login + breaking tos constituted trespassing.

    Hence my mention of trespassing.
  • 0
    I doubt it.
    The information is public.

    If you need a login and have one, it's still available to you.

    Now, if you take information given to you and share it with others without access, that could get you in trouble. But as @electrineer said, doing so is breaking a contract, but probably not the law. So it's likely not illegal.
  • 0
    I think you also might need to comply with any cease and desist requests as well
  • 1
    @Root AT&T got Andrew Auernheimer thrown in jail for accessing information that they mistakenly made public that he was able to receive without a login. He did not profit off of it -- just obtained it.
  • 3
    @steaksauce Sounds like a paid off judge to me. That's like saying a newspaper could have someone thrown in jail for copying a classified ad the paper had mistakenly published.

    Once published, the information is public. If anyone should go to jail, it should be those responsible for publishing it.
  • 4
    @Root well they're AT&T.. duh the judge was paid off
  • 0
    It's written in the HTTP1.1 Release Papers that you should respectfully not make crawlers, as they generate huge amounts of traffic.
  • 1
    @beggarboy That's not law, though. That's documentation.
  • 3
    I love this debate. I've done it enough times before, so let's dive in:

    define scraping? many people simply say using a bot to request a web page, and then act on the returned page.

    you could argue a browser is just a scraper. what it scrapes is determined by user input, and it's action is to display it.
    or that `curl http://somewebsite.com/ >/dev/null` is just scraping; the action you take is to disregard the results completely!

    "now that's a bit out there. you're intentionally playing it loose to make all web activity seem like scraping!"

    then let's try again: scraping is just browsing, but faster! much faster.

    so 1000 users accessing the same site at the same time manually, provided they all come from the same IP, looks near indistinguishable. does it count? up to you.

    if you want an interesting read on this, look into LinkedIn v. hiQ
  • 2
    Scrapping, done correctly, should look like normal traffic. Selenium+Chrome is awesome at this because you can’t scrape fast.
  • 3
    another question: from the server side, the only real difference is speed of the requests (assuming the bot is smart enough to not put WebScraper/2.1 in it's user agent. we're trying to look legit here.)

    how fast do humans have to refresh for it to count?

    how slow do scripts have to run for it to not?

    or is it all about what we actually do to the data? or are we just against the idea that every web request need not start with a human and a keyboard?

    I don't think web scraping could ever be illegal. at least not enforcably. it's just not defined what is and isn't scraping. nor can you tell from the server side assuming you're decent.

    so I guess a better question for webadmins is rate limiting: how often is too often in terms of periodically refreshing?
  • 0
    UPDATE 2: The data that are supposed to be scraped are ONLY FOR INTERNAL USE inside the company. Data analysis mostly. NOT FOR RESALE.
  • 3
    @aldoblack exposing the data to you doesn't sound like internal use
  • 0
    @aldoblack you mentioned Walmart in your original post and then with Update 2 gave some verbiage that sounds like Retail Link, either the DSS or OTIF pages. There are companies that do that Harvest Corp, Retail Solutions, Mookster, Atlas, and a lot of suppliers do that internally (If they don’t get their RL data via EDI). I have a program that scrapes DSS and OTIF data, but I don’t have a RL login so I can’t use it. As long as you use valid RL logins and you work for the supplier (internal) you are fine. If you don’t work for the supplier, WM has a “third party” form that authorizes your access, the company will have to engage the RM for that form.
  • 0
    @Root Why does everything have to be a law nowadays to stay conform with something that many people agree upon?

    Thats like you basically saying: "Well there's no law that I can't drive my car like an absolute asshole, so I will"

    No one can stop you from doing it directly, but you'll not make any friends that way.
  • 2
    @beggarboy the question was about legality. And driving like an asshole is probably illegal, depending on the type of asshole you mean.
  • 0
    @beggarboy Scraping isn't immoral or mean though? So your analogy doesn't fit. Also, the original question was about legality anyway.
  • 0
    @Root ....Scraping is like a small form of DOSing because you are generating a lot of unnecessary exponential traffic.
  • 2
    @beggarboy Depends on rate, but no matter how you look at it increased traffic isn't a denial of service. Imagine the amount of scraping you would have to do in order to overload a server! Most sites don't have even close to that much content.

    You might cost the company a few cents more in hosting, but if you're going to be collecting the same data manually anyway, the traffic would only differ in speed, not volume, so it would cost them the same regardless.
  • 3
    @beggarboy depends what you mean by unnecessary.

    the way I see it, if I'm getting use out of constantly scraping, then I see no problem. if they dislike the speed at which I'm doing it, it's their job to rate limit me or revoke my access to the data. and I take the same approach with the servers I manage.

    usually I intentionally throttle myself to the minimum speed needed out of respect for their bandwidth, but I see that as more of a courtesy.
Add Comment