scraping

Ranter

TopsyKretts

2451

Comments

1

steaksauce

1422

6y

This made me wonder and I found an interesting read on the subject.

https://resources.distilnetworks.com/...
0

TopsyKretts

2451

6y

UPDATE: You have to login to scrape the data. Is it legal or illegal?
0

taigrr

862

6y

@aldoblack I'm pretty sure it's illegal if you have to log in, if you agreed to not do it in ToS. It's considered trespassing, or something. Can't remember exactly why atm.
2

electrineer

28373

6y

@taigrr breaking a contract is not illegal. Illegal means breaking the law.
1

taigrr

862

6y

@electrineer I'm aware that's the case for most things. But I think there was a common law case where using a login + breaking tos constituted trespassing.

Hence my mention of trespassing.
0

Root

77451

6y

I doubt it.
The information is public.

If you need a login and have one, it's still available to you.

Now, if you take information given to you and share it with others without access, that could get you in trouble. But as @electrineer said, doing so is breaking a contract, but probably not the law. So it's likely not illegal.
0

taigrr

862

6y

I think you also might need to comply with any cease and desist requests as well
0

steaksauce

1422

6y

@Root AT&T got Andrew Auernheimer thrown in jail for accessing information that they mistakenly made public that he was able to receive without a login. He did not profit off of it -- just obtained it.
3

Root

77451

6y

@steaksauce Sounds like a paid off judge to me. That's like saying a newspaper could have someone thrown in jail for copying a classified ad the paper had mistakenly published.

Once published, the information is public. If anyone should go to jail, it should be those responsible for publishing it.
2

steaksauce

1422

6y

@Root well they're AT&T.. duh the judge was paid off
0

beggarboy

2123

6y

It's written in the HTTP1.1 Release Papers that you should respectfully not make crawlers, as they generate huge amounts of traffic.
1

Root

77451

6y

@beggarboy That's not law, though. That's documentation.
2

deadPix3l

2221

6y

I love this debate. I've done it enough times before, so let's dive in:

define scraping? many people simply say using a bot to request a web page, and then act on the returned page.

you could argue a browser is just a scraper. what it scrapes is determined by user input, and it's action is to display it.
or that `curl http://somewebsite.com/ >/dev/null` is just scraping; the action you take is to disregard the results completely!

"now that's a bit out there. you're intentionally playing it loose to make all web activity seem like scraping!"

then let's try again: scraping is just browsing, but faster! much faster.

so 1000 users accessing the same site at the same time manually, provided they all come from the same IP, looks near indistinguishable. does it count? up to you.

if you want an interesting read on this, look into LinkedIn v. hiQ
2

bkwilliams

7195

6y

Scrapping, done correctly, should look like normal traffic. Selenium+Chrome is awesome at this because you can’t scrape fast.
2

deadPix3l

2221

6y

another question: from the server side, the only real difference is speed of the requests (assuming the bot is smart enough to not put WebScraper/2.1 in it's user agent. we're trying to look legit here.)

how fast do humans have to refresh for it to count?

how slow do scripts have to run for it to not?

or is it all about what we actually do to the data? or are we just against the idea that every web request need not start with a human and a keyboard?

I don't think web scraping could ever be illegal. at least not enforcably. it's just not defined what is and isn't scraping. nor can you tell from the server side assuming you're decent.

so I guess a better question for webadmins is rate limiting: how often is too often in terms of periodically refreshing?
0

TopsyKretts

2451

6y

UPDATE 2: The data that are supposed to be scraped are ONLY FOR INTERNAL USE inside the company. Data analysis mostly. NOT FOR RESALE.
1

electrineer

28373

6y

@aldoblack exposing the data to you doesn't sound like internal use
0

bkwilliams

7195

6y

@aldoblack you mentioned Walmart in your original post and then with Update 2 gave some verbiage that sounds like Retail Link, either the DSS or OTIF pages. There are companies that do that Harvest Corp, Retail Solutions, Mookster, Atlas, and a lot of suppliers do that internally (If they don’t get their RL data via EDI). I have a program that scrapes DSS and OTIF data, but I don’t have a RL login so I can’t use it. As long as you use valid RL logins and you work for the supplier (internal) you are fine. If you don’t work for the supplier, WM has a “third party” form that authorizes your access, the company will have to engage the RM for that form.
0

beggarboy

2123

6y

@Root Why does everything have to be a law nowadays to stay conform with something that many people agree upon?

Thats like you basically saying: "Well there's no law that I can't drive my car like an absolute asshole, so I will"

No one can stop you from doing it directly, but you'll not make any friends that way.
1

electrineer

28373

6y

@beggarboy the question was about legality. And driving like an asshole is probably illegal, depending on the type of asshole you mean.
0

Root

77451

6y

@beggarboy Scraping isn't immoral or mean though? So your analogy doesn't fit. Also, the original question was about legality anyway.
0

beggarboy

2123

6y

@Root ....Scraping is like a small form of DOSing because you are generating a lot of unnecessary exponential traffic.
2

Root

77451

6y

@beggarboy Depends on rate, but no matter how you look at it increased traffic isn't a denial of service. Imagine the amount of scraping you would have to do in order to overload a server! Most sites don't have even close to that much content.

You might cost the company a few cents more in hosting, but if you're going to be collecting the same data manually anyway, the traffic would only differ in speed, not volume, so it would cost them the same regardless.
3

deadPix3l

2221

6y

@beggarboy depends what you mean by unnecessary.

the way I see it, if I'm getting use out of constantly scraping, then I see no problem. if they dislike the speed at which I'm doing it, it's their job to rate limit me or revoke my access to the data. and I take the same approach with the servers I manage.

usually I intentionally throttle myself to the minimum speed needed out of respect for their bandwidth, but I see that as more of a courtesy.