Do all the things like ++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatarSign Up
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple APILearn More
Search - "dataset"
We have a few pieces of news we're very excited to share with everyone today. Apologies for the long post, but there's a lot to cover!
First, as some of you might have already seen, we just launched the "subscribed" tab in the devRant app on iOS and Android. This feature shows you a feed of the most recent rant posts, likes, and comments from all of the people you subscribe to. This activity feed is updated in real-time (although you have to manually refresh it right now), so you can quickly see the latest activity. Additionally, the feed also shows recommended users (based on your tastes) that you might want to subscribe to. We think both of these aspects of the feed will greatly improve the devRant content discovery experience.
This new feature leads directly into this next announcement. Tim (@trogus) and I just launched a public SaaS API service that powers the features above (and can power many more use-cases across recommendations and activity feeds, with more to come). The service is called Pipeless (https://pipeless.io) and it is currently live (beta), and we encourage everyone to check it out. All feedback is greatly appreciated. It is called Pipeless because it removes the need to create complicated pipelines to power features/algorithms, by instead utilizing the flexibility of graph databases.
Pipeless was born out of the years of experience Tim and I have had working on devRant and from the desire we've seen from the community to have more insight into our technology. One of my favorite (and earliest) devRant memories is from around when we launched, and we instantly had many questions from the community about what tech stack we were using. That interest is what encouraged us to create the "about" page in the app that gives an overview of what technologies we use for devRant.
Since launch, the biggest technology powering devRant has always been our graph database. It's been fun discussing that technology with many of you. Now, we're excited to bring this technology to everyone in the form of a very simple REST API that you can use to quickly build projects that include real-time recommendations and activity feeds. Tim and I are really looking forward to hopefully seeing members of the community make really cool and unique things with the API.
Pipeless has a free plan where you get 75,000 API calls/month and 75,000 items stored. We think this is a solid amount of calls/storage to test out and even build cool projects/features with the API. Additionally, as a thanks for continued support, for devRant++ subscribers who were subscribed before this announcement was posted, we will give some bonus calls/data storage. If you'd like that special bonus, you can just let me know in the comments (as long as your devRant email is the same as Pipeless account email) or feel free to email me (email@example.com).
Lastly, and also related, we think Pipeless is going to help us fulfill one of the biggest pieces of feedback we’ve heard from the community. Now, it is going to be our goal to open source the various components of devRant. Although there’s been a few reasons stated in the past for why we haven’t done that, one of the biggest reasons was always the highly proprietary and complicated nature of our backend storage systems. But now, with Pipeless, it will allow us to start moving data there, and then everyone has access to the same system/technology that is powering the devRant backend. The first step for this transition was building the new “subscribed” feed completely on top of Pipeless. We will be following up with more details about this open sourcing effort soon, and we’re very excited for it and we think the community will be too.
Anyway, thank you for reading this and we are really looking forward to everyone’s feedback and seeing what members of the community create with the service. If you’re looking for a very simple way to get started, we have a full sample dataset (1 click to import!) with a tutorial that Tim put together (https://docs.pipeless.io/docs/...) and a full dev portal/documentation (https://docs.pipeless.io).
Let us know if you have any questions and thanks everyone!
- David & Tim (@dfox & @trogus)51
Seven months ago:
Project Manager: - "Guys, we need to make this brand new ProjectX, here are the specs. What do you think?"
Bored Old Lead: - "I was going to resign this week but you've convinced me, this is a challenge, I never worked with this stack, I'm staying! I'll gladly play with this framework I never used before, it seems to work with this libA I can use here and this libB that I can use here! Such fun!"
Project Manager: - "Awesome! I'm counting on you!"
Six months ago:
Cprn: - "So this part you asked me to implement is tons of work due to the way you're using libA. I really don't think we need it here. We could use a more common approach."
Bored Old Lead: - "No, I already rewrote parts of libB to work with libA, we're keeping it. Just do what's needed."
Cprn: - "Really? Oh, I see. It solves this one issue I'm having at least. Did you push the changes upstream?"
Bored Old Lead: - "No, nobody uses it like that, people don't need it."
Cprn: - "Wait... What? Then why did you even *think* about using those two libs together? It makes no sense."
Bored Old Lead: - "Come on, it's a challenge! Read it! Understand it! It'll make you a better coder!"
Four months ago:
Cprn: - "That version of the framework you used is loosing support next month. We really should update."
Bored Old Lead: - "Yeah, we can't. I changed some core framework mechanics and the patches won't work with the new version. I'd have to rewrite these."
Cprn: - "Please do?"
Bored Old Lead: - "Nah, it's a waste of time! We're not updating!"
Three months ago:
Bored Old Lead: - "The code you committed doesn't pass the tests."
Cprn: - "I just run it on my working copy and everything passes."
Bored Old Lead: - "Doesn't work on mine."
Cprn: - "Let me take a look... Ah! Here you go! You've misused these two options in the framework config for your dev environment."
Bored Old Lead: - "No, I had to hack them like that to work with libB."
Cprn: - "But the new framework version already brings everything we need from libB. We could just update and drop it."
Bored Old Lead: - "No! Can't update, remember?"
Bored Old Lead: - "You need to rewrite these tests. They work really slow. Two hours to pass all."
Cprn: - "What..? How come? I just run them on revision from this morning and all passed in a minute."
Bored Old Lead: - "Pull the changes and try again. I changed few input dataset objects and then copied results from error messages to assertions to make the tests pass and now it takes two hours. I've narrowed it to those weird tests here."
Cprn: - "Yeah, all of those use ORM. Maybe it's something with the model?"
Bored Old Lead: - "No, all is fine with the model. I was just there rewriting the way framework maps data types to accommodate for my new type that's really just an enum but I made it into a special custom object that needs special custom handling in the ORM. I haven't noticed any issues."
Cprn: - "What!? This makes *zero* sense! You're rewriting vendor code and expect everything to just work!? You're using libs that aren't designed to work together in production code because you wanted a challenge!?? And when everything blows up you're blaming my test code that you're feeding with incorrect dataset!??? See you on Monday, I'm going home! *door slam*"
Project Manager: - "Cprn, Bored Old Lead left on Friday. He said he can't work with you. You're responsible for Project X now."26
I’m so mad I’m fighting back anger tears. This is a long rant and I apologize but I’m so freaking mad.
So a few weeks ago I was asked by my lead staff person to do a data analysis project for the director of our dept. It was a pretty big project, spanning thousands of users. I was excited because I love this sort of thing and I really don’t have anything else to do. Well I don’t have access to the dataset, so I had to get it from my lead and he said he’d do it when he had a chance. Three days later he hadn’t given it to me yet. I approach him and he follows me to my desk, gives me his login and password to login to the secure freaking database, then has me clone it and put it on my computer.
So, I start working on it. It took me about six hours to clean the database, 2 to set up the parameters and plan of attack, and two or three to visualize the data. I realized about halfway through that my lead wasn’t sure about the parameters of the analysis, and I mentioned to him that the director had asked for more information than what he was having me do. He tells me he will speak with director.
So, our director is never there, so I give my lead about a week to speak with her, in the mean time I finish the project to the specifications that the director gave. I even included notes about information that I would need to make more accurate predictions, to draw conclusions, etc. It was really well documented.
Finally, exasperated, and with the project finished but just sitting on my computer for a week, I approached my director on a Saturday when I was working overtime. She confirmed that I needed to what she said in the project specs (duh), and also mentioned she needed a bigger data set than what I was working with if we had one. She told me to speak to my lead on Monday about this, but said that my work looked great.
Monday came and my lead wasn’t there so I spoke with my supervisor and she said that what I was using was the entire dataset, and that my work looked great and I could just send it off. So, at this point 2/3 of my bosses have seen the project, reviewed it, told me it was great, and confirmed that I was doing the right thing.
I sent it off to the director to disseminate to the appropriate people. Again, she looked at it and said it was great.
A week later (today) one of the people that the project was sent to approaches me and tells me that i did a great job and thank you so much for blah blah blah. She then asks me if the dataset I used included blahblah, and I said no, that I used what was given to me but that I’d be happy to go in and fix it if given the necessary data.
She tells me, “yeah the director was under the impression that these numbers were all about blahblah, so I think there was some kind of misunderstanding.” And then implied that I would not be the one fixing the mistake.
I’m being taken off of the project for two reasons: 1. it took to long to get the project out in the first place,
2. It didn’t even answer the questions that they needed answered.
I fucking told them in the notes and ALL THROUGH THE VISUALIZATIONS that I needed additional data to compare these things I’m so fucking mad. I’m so mad.15
That awkward moment when you realize the code you have been debugging for an hour actually works fine and updating the entire dataset. You've just been returning only the top 1000 rows.2
This was WAY back in my first job as a programmer where I was working on a custom built CMS that we took over from another dev shop. So a standard feature was of course pagination for a section that had well over 400,000 records. The client would always complain about this section always being very slow to load. My boss at that job would tell me to not look at the problem as it wasn't a part of the scope.
But being a young enthusiastic programmer, I decided to delve into the problem anyway. What I came to discover was that the pagination was simply doing a select all 400,000 records, and then looping through the entire dataset until it got to the slice it needed to display.
So I fixed the pagination and page loads went from around 1 min to only a few seconds. I felt pretty proud about that. But I later got told off by my boss as he now can't bill for that fix. Personally I didn't care since I learned a bit about SQL pagination, and just how terrible some developers can be.5
I've optimised so many things in my time I can't remember most of them.
Most recently, something had to be the equivalent off `"literal" LIKE column` with a million rows to compare. It would take around a second average each literal to lookup for a service that needs to be high load and low latency. This isn't an easy case to optimise, many people would consider it impossible.
It took my a couple of hours to reverse engineer the data and implement a few hundred line implementation that would look it up in 1ms average with the worst possible case being very rare and not too distant from this.
In another case there was a lookup of arbitrary time spans that most people would not bother to cache because the input parameters are too short lived and variable to make a difference. I replaced the 50000+ line application acting as a middle man between the application and database with 500 lines of code that did the look up faster and was able to implement a reasonable caching strategy. This dropped resource consumption by a minimum of factor of ten at least. Misses were cheaper and it was able to cache most cases. It also involved modifying the client library in C to stop it unnecessarily wrapping primitives in objects to the high level language which was causing it to consume excessive amounts of memory when processing huge data streams.
Another system would download a huge data set for every point of sale constantly, then parse and apply it. It had to reflect changes quickly but would download the whole dataset each time containing hundreds of thousands of rows. I whipped up a system so that a single server (barring redundancy) would download it in a loop, parse it using C which was much faster than the traditional interpreted language, then use a custom data differential format, TCP data streaming protocol, binary serialisation and LZMA compression to pipe it down to points of sale. This protocol also used versioning for catchup and differential combination for additional reduction in size. It went from being 30 seconds to a few minutes behind to using able to keep up to with in a second of changes. It was also using so much bandwidth that it would reach the limit on ADSL connections then get throttled. I looked at the traffic stats after and it dropped from dozens of terabytes a month to around a gigabyte or so a month for several hundred machines. The drop in the graphs you'd think all the machines had been turned off as that's what it looked like. It could now happily run over GPRS or 56K.
I was working on a project with a lot of data and noticed these huge tables and horrible queries. The tables were all the results of queries. Someone wrote terrible SQL then to optimise it ran it in the background with all possible variable values then store the results of joins and aggregates into new tables. On top of those tables they wrote more SQL. I wrote some new queries and query generation that wiped out thousands of lines of code immediately and operated on the original tables taking things down from 30GB and rapidly climbing to a couple GB.
Another time a piece of mathematics had to generate all possible permutations and the existing solution was factorial. I worked out how to optimise it to run n*n which believe it or not made the world of difference. Went from hardly handling anything to handling anything thrown at it. It was nice trying to get people to "freeze the system now".
I build my own frontend systems (admittedly rushed) that do what angular/react/vue aim for but with higher (maximum) performance including an in memory data base to back the UI that had layered event driven indexes and could handle referential integrity (overlay on the database only revealing items with valid integrity) or reordering and reposition events very rapidly using a custom AVL tree. You could layer indexes over it (data inheritance) that could be partial and dynamic.
So many times have I optimised things on automatic just cleaning up code normally. Hundreds, thousands of optimisations. It's what makes my clock tick.4
Context: we analyzed data from the past 10 years.
So the fuckface who calls himself head of research tried to put blame on me again, what a surprise. He asked for a tool what basically adds a lot of numbers together with some tweakable stuff, doesnt really matter. Now of course all the datanwas available already so i just grabbed it off of our api, and did the math thing. Then this turdnugget spends 4 literal weeks, tryna feed a local csv file into the program, because he 'wanted to change some values'. One, this isnt what we agreed on, he wanted the data from the original. When i told him this, he denied it so i had to dig out a year old email. Two, he never explicitly specified anything so i didnt use a local file because why the fuck would i do that. Three, i clearly told him that it pulls data from the server. Four, what the fuck does he wanna change past values for, getting ready for time travel? Five, he ranted for like 3 pages, when a change can be done by currentVal - changedVal + newVal, even a fucking 10 year old could figure that out. Also, when i allowed changes in a temporary api, he bitched about how the additional info, what was calculated from yet another original dataset, doesnt get updated, when he fucking just randimly changes values in the end set. Pinnacle of professionalism.2
Look, I'm not even mad that your dataset is the spaghettiest of all spaghetti, but why do you have ten different jupyter notebook files lying around?
I mean, I'm not implying that a monkey has more brain in his armpit than you have in your entire body, but like, you call this a dataset while all over seen so far is half-processed garbage. You could've just dipped your pc in sewage and the results would still be cleaner than this.
Luckily, your paper is half decent so what the hell, let's see if I can fish anything useful out of this. But I swear to god if I come across another static path in this... And here we go! Another static path! Ladies and gentlemen, I propose we get this guy's phd back until he learns to fucking do a decent code.
(It's actually a massively complicated project, so it kinda makes sense to be this big of a mess. But still!)10
FFS customer, if you want me to test with real data and solve your problem quickly, then maybe NOT send me an export in which the social security number says "12" for about a quarter of the entries in the dataset and in which the date format is still wrong.
Sincerely, the pissed off dev who told you this about 4 times already.
PS: I hope you step on a Lego.4
So I just finished a prototyp for my thesis. Still need to segment the real data myself and collect some statistics stuff to write about the network, but I am pretty proud of the result considering the dataset is very small.
For now I need some god damn sleep.6
Someone put a fucking \b in this dataset I'm working with, which just so happens to be an illegal character for xml.
FUCKING HOW. FUCKING WHY. FUCKING WHERE ARE YOU AND WHY DO YOU WANT TO SEE ME SUFFER THIS MUCH4
Dude, publish your damn dataset with your damn ML study!!!! I'm not even asking for your Godforsaken model!
If anyone has been keeping up with my data warehouse from hell stories, we're reaching the climax. Today I reached my breaking point and wrote a strongly worder email about the situation. I detailed 3 separate cases of violated referential integrity (this warehouse has no constraints) and a field pulling from THE WRONG FLIPPING TABLE. Each instance was detailed with the lying ER diagram, highlighted the violating key pairs, the dangers they posed, and how to fix it. Note that this is a financial document; a financial document with nondeterministic behavior because the previous contractors' laziness. I feel like the flipping harbinger of doom with a cardboard sign saying "the end is near" and keep having to self-validate that if I was to change anything about this code, **financial numbers would change**, names would swap, description codes would change, and because they're edge cases in a giant dataset, they'll be hard to find. My email included SQL queries returning values where integrity is violated 15+ times. There's legacy data just shoved in ignoring all constraints. There are misspellings where a new one was made instead of updating, leaving the pk the same.
Now I'd just put sorting and other algos, but the data is processed by a crystal report. It has no debugger. No analysis tools. 11 subreports. The thing takes an hour to run and 77k queries to the oracle backend. It's one of the most disgusting infrastructures I've ever seen. There's no other solution to this but to either move to a general programming language or get the contractor to fix the data warehouse. I feel like I've gotten nowhere trying to debug this for 2 months. Now that I've reached what's probably the root issue, the office beaucracy is resisting the idea of throwing out the fire hazard and keeping the good parts. The upper management wants to just install sprinklers, and I'm losing it.
Random af project idea that will see me burned alive by the internet (because if I do it I intend to put it in dev.to which is full of "that offends me" people):
Generate a classifier that will scan text from different websites and categorize where the person might be from.
Example: "plz send bob and vagene" <--- we all know
"mami que ricas nalgas" <--- Mexican for the most part.
"there, their, they're and similar text" <--- my fellow Americans for the most part....
"cyka blyat" <--- 0.o we know
"pompous statement about the way Americans do shit" <--- European, meaning, from Yurop.
"angry as fuck rant/banter" <-- German
"lol whatever Trump is the best president ever" <--- some moron from the south of the U.S (south much like myself but I am not a Trump supporter nor a republican)
What makes this complex is that I would have to put together my own dataset in the highly likely chance of something like that not existing already for me to use.
Can you imagine the chaos?18
when you start machine learning on you laptop, and want to it take to next level, then you realize that the data set is even bigger that your current hard-disk's size. fuuuuuucccckkk😲😲
P. S. even metadata csv file was 500 mb. Took at least 1 min to open it. 😭😧11
Today I had to present my final year project called segmentation and detection of glioma using deep learning.
It took 15 minutes for the evaluators to understand what an mri image dataset (BRaTs) dataset looks like (they are voxels and not pixels). My god, these peasants...
And I was there expecting them to understand down sampling convolutions and up scaling convolutions of U Net model 😂. Life is so convoluted right now!2
Where do you personally draw the line between "good enough" and "not good enough" code?
I just spent a considerable amount of time refactoring a function into 4 sub functions. The original one was fetching alerts for 4 related entities. All of the alerts are needed together in one dataset. A lot of variables and some of the logic was shared, therefore I originally chose to keep it as one, admittedly very large function. Where do you guys draw the line between refactoring or not refactoring in cases like these? How much weight do you give to separation of concerns vs simplicity? I'm curious.9
So I downloaded the handwritten character dataset from EMNIST, took an hour to extract. Found out it has 814k + pngs. I haven't optimized all of them with Gimp 😕 change them to idx3-ubyte format, or make labels for each char in a c++ automation.. While Gimp was frozen in bulk process, I started hearing crackling sounds from my desk 😨😨 I'm like fckkk shits gonna blow..
That's when the poorly taped 0x7E2 calendar fell from the wall.🤣🤣🤣2
Started off with a prototype with about 20 backend data points per hour and 10 concurrent front-end users. Total dataset in the 10,000s.
Now about 5 backend data points per second and 500 concurrent front-end users. Total dataset in excess of 20,000,000.
A few days ago PM started asking me once a day when we will have fixed an error he saw at customer site.
I always tell him that it is not a software error, just missing data. I try to explain the issue and that the root cause is the incomplete data given by the customer.
Then he says he will talk with the data import guys if they can fix something and I tell him that from my point of view the data import is fine, but the customer has to provide a full dataset and the "error" will vanish immediately.
He walks to the data import colleague anyway and gets told that everything is ok with the import.
Next day he appears at my desk to tell me that the import seems ok and asks me how we could fix the error and I tell him that it's not a software issue, try to explain it...
I wonder how long he will keep up on it.3
Data scientist and related devs, how do you handle large datasets?
I was given a .txt file containing +1M edges of a directed graph. I tried to analyze it with networkx, but my computer killed the process as it was eating too much CPU/memory.
I would be grateful for any advice!24
When your teacher tells you to run a model that uses a 1gb dataset on a computer that has 256 mb ram.
Ah, what sin have I done, my lord? :/
Had some fun with textgenrnn (Tensorflow text generating thingy on Github). So I created a tiny dataset with some example c# code and let it train for a while.
Sorry people, but I ruined our jobs. We don't need to write code anymore.
Update: image was unreadable due to compression. Let me find an alternative.7
Well... I can think of several bugs that I found on a previous project, but one of the worst (if not the worst, because the damage scope) it's one bug that only appears for a couple of days at the end of every month.
What happens is the following: this bug occurs in a submodule designed (heh) to control the monthly production according the client requirements (client says "I want 1000 thoot picks", that submodule calculates the daily production requirements in order to full fill the order).
Ideally, that programming need to be done once a week (for the current month), because the quantities are updated by client on the same schedule, and one of the edge cases is that when the current date is >= 16th of the month, the user can start programming the production of the following month.
So, according to this specific case, there's an unidentified, elusive, and nasty bug that only shows up on the two last days of every month, when it doesn't allow to modify/create anything for the following month. I mean, normally, whenever you try to edit/create new data, the application shows either an estimated of the quantities to produce, or the previous saved data. But on those specific days it doesn't show any information at all, disregarding of there's something saved or not.
The worst thing is that such process involves both a very overcomplicated stored procedure, and an overcomplicated functionality on the client side (did I mentioned that it dynamically generates a pseudo-spreadsheet with the procedure dataset? Cell by cell), that absolutely no one really fully understands, and the dude that made those artifacts is no longer available (and by now, I'm not so sure that he even remember what he done there).
One of the worst thing is that at this point, it's easier to handle with that error rather to redesign all of that (not because technical limitations, but for bureaucratic and management issues).
The another worst thing (the most important none) is that this specific bug can create a HUGE mess as it prevents the programming of the production to be done the next day (you know, people tends to procrastinate and start doing things at the very end of the day/week/month)... And considering that the company could lose a huge amount of money by every minute without production, you can guess the damage scope of this single bug.
Anyway, this bug has existed since, I don't know, 2015 (Q4?) and we have tried so many things trying to solve it, but that spaghettis refuse to be understood (specially the stored procedure, as it has dynamically generated queries). During my tenure (that ended last year) I spent a good amount of time (considering what I mentioned on the last rant, about the toxic environment) trying to solve that, just giving up after the first couple of weeks.
Anyway... I'm guessing that this particular bug will survive another 4-ish years, or even outlive the current full development team... But, who knows ¯\_(ツ)_/¯ ?
Late night kaggle session, and I'm enjoying how cute and clean this dataset is!
I'm jealous if data scientists always get to work with such neat sets! Dude! I got .95 acc without any effort! This is so... Weird. 🤔4
-----------Jr Dev Fucked by Sr Dev RANT------
Huge data set (300X) that looks like this :
( Primary_key, group_id,100more columns) .
Dataset to be split in records of X sized files such that all primary_key(s) of same group_id has to go in same file.
Sde2 with MS from Australia, 12 years of 'experience' generates an 'algo'. 70% Test case FAILED.
I write a bin packing algo with 100% test case pass, raises pull request to MASTER in < 1 day. Same sde2 does not approve, blocking same day release.
|-_-| What the fuck |-_-| Incompetent people getting 2x my salary with <.5x my work2
I used to be a big security guy, not allowing stuff like most of the social media, not bringing my phone anywhere, carrying a RPi tablet for privacy reasons. Very Stallman stuff.
Recently I noticed that I don't care so much.. I see these things as opportunities, for instance Microsoft products could be benefitial for job opportunities, I have some workout sessions on my phone.
I could restrict myself... but is it worth it just to decline some capitalist/politician's row in a dataset for analysis?
But then again I feel as a society I think we should either do this or request this data to be distributed to us as well.
Should you be playing a game of cards, when the enemy can see your hand? What do u think?6
Started doing deep learning.
Me: I guess the training will take 3 days to get about 60% accuracy
Electricity: I dont think so! *Power cut every day lately
My dataset: I dont think so! *Running training only achieve 30% of learning accuracy and 19% of validation accuracy
Project submission: next week
So I decided to run mozilla deep speech against some of my local language dataset using transfer learning from existing english model.
I adjusted alphabet and begin the learning.
I have pc with gtx1080 laying around so I utilized that but I recommend to use at least newest rtx 3080 to not waste time ( you can read about how much time it took below ).
Waited for 3 days and error goes to about ~30 so I switched the dataset and error went to about ~1 after a week.
Yeah I waited whole got damn week cause I don’t use this computer daily.
So I picked some audio from youtube to translate speech to text and it works a little. It’s not a masterpiece and I didn’t tested it extensively also didn’t fine tuned it but it works as I expected. It recognizes some words perfectly, other recognize partially, other don’t recognize.
I stopped test at this point as I don’t have any business use or plans for this but probably I’m one of the couple of companies / people right now who have my native language speech to text machine learning model.
I was doing transfer learning for the first time, also first time training from audio and waiting for results for such long time. I can say I’m now convinced that ML is something big.
To sum up, probably with right amount of money and time - about 1-3 months you can make decent speech to text software at home that will work good with your accent and native language.
So when I am pissed at everything in life, I take out my frustration at my laptop.
I give it to clone 2 400MBs repository.
I give it to load 2 400 MBs dataset to load and train a resNet and say "yeah Multiprocess bitch"1
Today I came across a very strange thing or a coincidence(maybe).
I was working on my predictive analytics project and I had registered on Kaggle(repository for datasets) long back and was searching on how to scrape websites, as I couldn't find any relevant dataset. So, while I was searching for ways to scrape a website, suddenly after visiting a few websites, I get notifications of a new email. And it was from Kaggle with the subject line
"How to Scrape a Tidy Dataset for Analysis"
Now I don't how to feel about it. Mixed feelings! It is either a wild coincidence, or Kaggle is tracking all the pages visited by the user. The latter makes more sense. By the way, Kaggle wasn't open in any of the tabs on my browser.1
This is why code reviews are important.
Instead of loading a relevant dataset from the database once, the developer was querying the database for every field, every time the method interacted with it.
What should have been one call for 200k records ended up as 50+ calls for 200k records for every one of 300+ users.
The whole production application server was locked.2
"""Itty bitty frustration"""
# wannabe mode on
iris_dataset = sklearn.datasets.load_iris()
I've noticed loading half of a small dataset twice is faster than loading the full small dataset once from files around half a terabyte.
Just because I didn't know the direction to work on doesn't mean I didn't do shit
Also, aren't you the professor so you please tell me what to do
And no you don't need to focus on the sample dataset I'm working on. Yes its name is "Breast Cancer" SO WHAT!!!2
What is this format? Spaces are separated by + sign and columns are separated by & sign and it’s value is followed by = sign. Can I directly convert it to Spark dataframe?8
Today i received a hard drive contraining one million malicious non-PE files for a ML baed project.
It's going to be a fun week.14
Have been working all day long on a dataset.. Worked like charm in the end.. Went for dinner, now it's outputs have changed all together and I haven't touch the code, I think Skynet is here.
Reading over a note I left myself "Numbers returned will be slightly off due to new implementation and the Fuck that we use a different dataset"
Oh boy. This was there for 2 weeks, Im so lucky no one saw this.
Looks like unzipping on disk drive where you keep your dataset can crash machine learning training session.
Thanks nvidia for your great drivers.
I like your solutions.
Just spent a week creating a distributed api architecture which I found out won't work due to a singular issue which can't be solved - not unless I hack stuff to a degree where I might as well write my own frameworks.
I've been aiming the user application's requests towards my wsgi, which based on a custom header will proxy it towards the correct api. Each customer base has their own api and dataset, but they all visit the same address.
I've handled CORS manually, just picking up when there's an options request, asserting the origin, then returning the correct headers. Cool everyone's happy. Turns out, socket.io includes session id and handshake info as part of their options preflight, which I can't pair with my api header (or cookie, for that matter) which means my wsgi doesn't know where to send it. You get a 400! You get a 400! You get a 401! </oprah>
So my option is to either roll my own sockets engine or just assign each api to a subdomain or give it some url prefix or something. Subdomains are probably pretty clean and tidy, but that doesn't change having to rewrite a bunch of stuff and the hours I spent staring at empty headers in options preflights.
At least this discussion saved me some time in trying to make it work. One of my bad habits is getting in those grooves of "but surely... what the hell, surely there's a way. There has to be"
Deep learning. Working on an image classification problem for a big company. The "boss" ask me to teach an AI to classify images into a few classes.
"Mmm, ok...I just need to create the dataset and then build the AI...so.."
Where is the problem??
The problem is that the classes are so perfectly similar that no one knows how to help me create the dataset and I have to do it alone.
That's how you spend your weeks in a loop where you look at thousands of images over and over just to have something decent start your work.
After that I felt like...
"I'm the hero they deserves, but not the one they need right now" - Cit2
How hard can it be to let sql just multiply some values and sum the results, right? As it turns out, damn hard!
I hear you thinking, surely you can just do select SUM(price*amount) AS total right? Nope! I mean, yes you can, but it fucks up. Oddly. It always ends up giving me wrong results. Always. Wtf sql? And it's not like I'm running a massive dataset or anything, it's like 100 records at most?28
I just ended up making a program that shows news depending on the user's current emotions.
I implemented OpenCV and CK+ dataset to make this possible. Also, Feedparser for getting news. Now looking into PyQt for making all of this into a GUI app.
So happy that all of this was completed within a week and without much hitches.
Anyone ever tried using user-to-user collaborative filtering to classify the mnist digits dataset?
This is about as far as I got:
It's literal copy-and-paste frankencode because this is only the second time I've ever done something like this, so pardon the hatchjob.1
I'm looking for a image segmentation and classification web based tool to create ground truth for my dataset in next Deep Learning project, what tool do You use?1
Any idea to get around the cluster storage limitation?
I have to train a model on a large dataset, but the limit I have is about 39GB. There is space on my local disk but I don't know if I can store the data on my computer and have the model train on the cluster resources.1
So I wrote a while ago .ndjson shapes dataset player.
https://quickdraw.withgoogle.com/ this site contain peoples shitdrawings of particular object.
Grab dataset from here in .ndjson format logging into google account
go to here
browse the file and press play when it's enabled.
Attached picture is a frog.2
Power BI: wonderful tool, pretty graphics, and can do a lot of powerful stuff.
But it’s also quite frustrating when you want to do advanced things, as it’s such a closed platform.
* No way to run powerquery scripts in a command line
* Unit testing is a major pain, and doesn’t really test all the data munging capabilities
* The various layers (offline/online, visualisation, DAX, Powerquery, Dataset, Dataflow) are a bit too seamless: locating where an issue is happening when debugging can be pain, especially as filtering works differently in Query Editing mode than Query Visualisation mode.
And my number 1 pet peeve:
* No version control
It’s seriously disconcerting to go back to a no version control system, especially as you need to modify “live code” sometimes in order to debug a visual.
At best, I’ve been looking into extracting the code from the file, and then checking that into git, but it’s still a one-way street that means a lot of copying and pasting back into the program in order to roll back, and makes forking quite difficult.
It’s rewarding to work with the system, but these frustrations can really get to me sometimes2
When I browsed for a Food Recipes (Especially Indian Food) Dataset, I could not find one (that I could use) online. So, I decided to create one.
The dataset can be found here: https://lnkd.in/djdh9nX
It contains following fields (self-explanatory) - ['RecipeName', 'TranslatedRecipeName', 'Ingredients', 'TranslatedIngredients', 'Prep', 'Cook', 'Total', 'Servings', 'Cuisine', 'Course', 'Diet', 'Instructions', 'TranslatedInstructions']. The datset contains a csv and a xls file. Sometimes, the content in Hindi is not visible in the csv format.
You might be wondering what the columns with the prefix 'Translated' are. So, a lot of entries in the dataset were in Hindi language. To take care of such entries and translating them to English for consistency, I went ahead and used 'googletrans'. It is a python library that implements Google Translate API underneath.
The code for the crawler, cleaning and transformation is on Github (Repo:https://lnkd.in/dYp3sBc) (@kanishk307).
The dataset has been created using Archana's Kitchen Website (https://lnkd.in/d_bCPWV). It is a great website and hosts a ton of useful content. You should definitely consider viewing it if you are interested.
#python #dataAnalytics #Crawler #Scraper #dataCleaning #dataTransformation
Is a picture worth a thousand words?
Super fun data driven analysis based on Google's Conceptual Captions Dataset.
#python #dataanalysis #exploratorydataanalysis #statistics #bigdata3
It feels good to be a gangsta. Convinced my coworkers to use Dapper.net (used by StackOverflow) as ORM of the choice instead of old internal ORM which heavily utilizes DataSet.
I have a project in need of machine learning. It takes an image and turns it into text. How do I begin acquiring the data needed to feed the machine? Should I just start taking pictures of this particular item on many different devices and get as many friends to do the same? How do I begin gathering my data is the question?4
That moment when you waste two hours of your work life trying to find a dataset in a sea of crap to answer your bosse's question...
Here's the dataset, model training, and output phases for a generative adversarial network I wrote that basically learned about...Me, and subsequently created a custom social media avatar.
I wrote the damn thing and it still couldn't figure it out. I'm too complex. My therapist was right.4
Grrrrrrr!!!!!!! How you frustrate me SQL SERVER REPORTING SERVICES! Designing a report changed query on dataset to include new field, fields started displaying all sorts of random stuff, booleans in text fields etc. Just spent 20 mins "checking" by rebuilding the first few bits of report and first dataset it's something weird with SSRS. Bye bye Sunday evening!!!
Hi guys, for school I need to use OPTICS (clustering algorithm) on at least two data sets for a project. For the first one, I'm thinking of doing some classic document clustering. But for the second one, I'd like to do something a bit funnier. I'm thinking of clustering faces. The goal would be to cluster faces that are similar, and the dataset would have many images of the same person to see if the clustering works. I think I must reduce the size of the image first.
Do you think it's feasible and how would you prepare the data, would you feed directly the matrix representation of the image?
Just Started learning unsupervised learning algorithms, and i write this: Unsupervised Learning is an AI procedure, where you don’t have to set the standard. Preferably, you have to allow the model to take a chance at its own to see data.
Unsupervised Learning calculations allow you to make increasingly complex planning projects contrasted with managed learning. Albeit, Unsupervised Learning can be progressively whimsical contrasted and other specific learning plans.
Unsupervised machine learning algorithm induces patterns from a dataset without relating to known or checked results. Not at all like supervised machine learning, Unsupervised Machine Learning approaches can’t be legitimately used to loss or an order issue since you have no proof of what the conditions for the yield data may be, making it difficult for you to prepare the estimate how you usually would. Unsupervised Learning can preferably be used to get the essential structure of the data.
Not enough disk space error..just when I am done writing code and unzipping the bigger dataset.
Hours later.. Now mounted 200Gigs to machine.
Feels like a boss.!