Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "crawlers"
-
The only hacked sites I had to fix were running on ... [prepare your stomach] ... Joomla.
I'm not sure if there is even one single solid developer for Joomla. This shit piece has more vulnerabilities than a crack hobo infested with pest-ebola-hyperAIDS.
The sites were full of hidden viagra and pr0n ads and links so the crawlers would list them.
Luckily for me, I was able to pursuade the clients in all 3 cases to build a new site from scratch on a different CMS.2 -
It's finally happened. I've used my mail servers for about a year to give out different email addresses on my domain to things I sign up for online, and only used my "actual" email address that received all this email for the whole domain but the single one that I used outbound for private communications.
This worked well for a long time as I could see when spam comes in, where it came from by looking at the email address I designated it. Each company's email would be sent not only from an email address that they choose, but also to an email address that I choose. It allowed me to easily determine where there were problems. For example, on Freenode IRC my vhost happened to make my username@host there a valid email address. It eventually got blacklisted due to too much incoming spam as crawlers started detecting it. Another one was "nickname"@my.domain as I posted it a few times here. Got crawled as well. But it allowed me to easily blacklist each.
I'd never thought my actual outbound email address, my real one, to get crawled though. That would require the mail server of a company I explicitly communicated with to get hacked. But today that happened. I wonder whose it is, but I can't tell.
Time to make my outgoing email bound to a designated email address as well. I want to know which companies this happens to, even if they don't disclose it.4 -
Spent a month working on a website that relied on crawled data
Got the memory leaks and usage down from 700mb to ~150mb
CPU usage from ~100% to <5%
Shrink-wrapped the DB requirements based on data
Created self-supporting services and what not
When everything FINALLY worked good enough for me to look at it and go "damn, this actually worked"
the whole monitoring sys got dyed in red :v
A quick look up and my crawlers exhausted my godaddy's per-user db limits.
Kill me.
Just fuckin kill me.7 -
So I just launched a website where you can create web-scrapers with just the click of a button:
http://scrape.host16 -
I love doing crawlers to test stuff. Client wanted me to crawl his page for certain errors.... seems i ddossed them2
-
*The things you find watching logs*
From an innocent script I wrote to crawl Alexa categories for datascience team.1 -
Me: This element is ID located (and visible) in your page, so it must contain it.
Selenium: I've never seen that element!9 -
Question for Web Server Gurus and Security Ninjas.
How to prevent bots, crawlers, spammers sending various numerous requests to your web servers?
There have been numerous requests to routes like /admin /ssh /phpmyadmin etc etc and all kinds of stuff to the web server.
Is there a way to automatically block those stupid IPs :/9 -
There are a few email addresses on my domain that I keep on receiving spam on, because I shared them on forums or whatever and crawlers picked it up.
I run Postfix for a mail server in a catch-all configuration. For whatever reason in this setup blacklisting email addresses doesn't work, and given Postfix' complexity I gave up after a few days. Instead I wrote a little bash script called "unspam" to log into the mail server, grep all the emails in the mail directory for those particular email addresses, and move whatever comes up to the .Junk directory.
On SSD it seems reasonably fast, and ZFS caching sure helps a lot too (although limited to 1GB memory max). It could've been a lot slower than it currently is. But I'm not exactly proud of myself for doing that. But hey it works!1 -
It's an irony in my case. Python is so simple and fast to implement that I end up doing all my projects ( web dev, ML, crawlers, etc.) But still I can't use Python for solving competitive programming. Python seems unknown if I don't have access to google. Way to go to learn Python. Though able to think Pythonic nowadays.. ;p3
-
In my experience, any BE dev or old architect/lead programmer that says they “can do frontend” does shit like writing Ajax calls in script tags directly in the html. They are the ones who add style attributes directly in html. They are the ones who google how to center a div and they still use float positioning because all of them are old, arrogant BE devs who get caught in a single framework who convince themselves they are an expert. They can’t give any good UX advice. They don’t know how to use a screen reader. They don’t know what WCAG means. They don’t constantly keep up to date on what browsers are supporting and what’s being released in the unstable versions. They don’t know what a web component is. They don’t know what a closure is. They don’t know anything about optimizing web perf metrics. They couldn’t tell you what web crawlers look for. They couldn’t tell you anything about design principles and anti-patterns. They don’t know how to manage a web application that will be seen by millions AND keep it nice, shiny, and refactorable on the code side. What do they really fucking know? how to write an MVC app? How to connect APIs and integrate code that other people wrote? I do full stack all day and writing anything not-client-facing is super easy.
Take that stick out of your ass and get over yourself you asshole. You haven’t written anything close to amazing even though you constantly act like you’re a god-tier programmer and your shit doesn’t stink.
Hit the books like the rest of us you fuck.
The Frontend is anything but fucking easy.25 -
Just this tiny website, that's a complete database of all cars ever created, of course with every variant and different versions through the years.
Supposed to be searchable, so that a you can compare cars with in a class or by features or something completely different.
And final icing, it should have crawlers, searching used-car-sites, to inform the user of changes is price over time.3 -
Searched an error on Google
Only one result was relevant to my search.
It had the entire error line in it. Yay!
It was the GitHub source page of the compilation code that generates the actual error 💫
GitHub must disallow the programming extensions to web crawlers.1 -
I set up a Linux server on a Rock 64 which connected with a external hard drive at home. Now I can download torrent and run crawlers anywhere from my phone.
-
A beginner in learning java. I was beating around the bushes on internet from past a decade . As per my understanding upto now. Let us suppose a bottle of water. Here the bottle may be considered as CLASS and water in it be objects(atoms), obejcts may be of same kind and other may differ in some properties. Other way of understanding would be human being is CLASS and MALE Female be objects of Class Human Being. Here again in this Scenario objects may differ in properties such as gender, age, body parts. Zoo might be a class and animals(object), elephants(objects), tigers(objects) and others too, Above human contents too can be added for properties such as in in Zoo class male, female, body parts, age, eating habits, crawlers, four legged, two legged, flying, water animals, mammals, herbivores, Carnivores.. Whatever.. This is upto my understanding. If any corrections always welcome. Will be happy if my answer modified, comment below.
And for basic level.
Learn from input, output devices
Then memory wise cache(quick access), RAM(runtime access temporary memory), Hard disk (permanent memory) all will be in CPU machine. Suppose to express above memory clearly as per my knowledge now am writing this answer with mobile net on. If a suddenly switch off my phone during this time and switch on.Cache runs for instant access of navigation,network etc.RAM-temporary My quora answer will be lost as it was storing in RAM before switch off . But my quora app, my gallery and others will be on permanent internal storage(in PC hard disks generally) won't be affected. This all happens in CPU right. Okay now one question, who manages all these commands, input, outputs. That's Software may be Windows, Mac ios, Android for mobiles. These are all the managers for computer componential setup for different OS's.
Java is high level language, where as computers understand only binary or low level language or binary code such as 0’s and 1’s. It understand only 00101,1110000101,0010,1100(let these be ABCD in binary). For numbers code in 0 and 1’s, small case will be in 0 and 1s and other symbols too. These will be coverted in byte code by JVM java virtual machine. The program we write will be given to JVM it acts as interpreter. But not in C'.
Let us C…
Do comment. Thank you6 -
It goes back two years ago, i was writing web crawlers with scrapy. i don't remember how long i worked, but i think it took a full day
why: because, web crawling is so much fun and also i was young and stupid -
Every ten years, a new social nexus, from Usenet to Reddit. Every day, a flame war. Every year, a great leader that wins flame wars, convinces people to follow them. The question is, what happens next? What do you preach to the gullible masses you won over?
Every single time it gets to politics, and then, to philosophy. Yet, there are no large strides in sight to world peace.
You've seen that meme where everything is just applied math. Well, math is applied philosophy, and philosophy is a product of misunderstanding the language.
In the end, the flame war you won never mattered. Archived threads, Wayback Machine, inactive Usenet mirrors. Acres upon acres of human thought, passionately expressed in computer text, roamed by no one but web crawlers. Give them three days, and they'll forget what you taught them.
WWI had shown us that we couldn't improve the masses with art and education. There is no vaccine against stupidity.
Life on Earth is hell. People are hell. Living among people is hell. If your life isn't hell, you're fortunate enough to be paying criminals that are stronger than other criminals around them, for protection.
Only the habit of systematically denying yourself pleasures your inner animal wants, plus a healthy dose of doubt, can make you human. Without restraint, a man is merely a greedy beast.4 -
let's say i want to host my own local search engine, i have the application ready.
now i want to activate my crawlers to scrap and index the web.
would i be in hot water for doing this? is there any implementation level rule that i can check other than robots.txt?
any thoughts or inputs on the subject other than it being a huge waste of time and resources :D.2 -
man if i could figure out how to do stuff and had the money to do stuff i'd be dangerous as fuck, but as of now i can only posit questions... it sucks.
Examples:
- What do modern browsers/crawlers do when hit with, say, an "HTTP 450 Blocked by Windows Parental Controls" or an "HTTP 374" status code?
- What happens if I do <xyz minor edge case thing> on <system?> (just use your imagination, this happens for every edge case i can think of for every system and the list wouldn't fit in a few megs' worth of half-byte ASCII, much less *here*)
- What if I made like a board to fuck with busses while systems were on? Press a button and for like five bus clock cycles pins like 6 and 7 are shorted? That sort of thing. As for system/bus types, *literally any* (old consoles with expansion ports, PCI/-e/-X/whatever, southbridge, etc.)
- What if I did <filetype> shenanigans by doing <something indescribably horrible> to this file? How do things react?3 -
More news from the verge, as the internet itself becomes smarter, big tech companies rethink internet policy.
https://theverge.com/24067997/...
https://en.wikipedia.org/wiki/...
#news #links #theverge #wikipedia #interesting #changing #bigtech #ai3