Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API

From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "datasets"
-
That moment when a friend was talking to you about an artificial intelligence he is building that is supposed to be a voice assistence and "even better" than Cortana. After a long time I asked him for the code like I wanted to check out the revolutionary techniques of machine learning he was talking about. So here is a short part of the 600 lines long "voice logic".
I almost started crying 😂😂15 -
I saw this yesterday and thought it's kind of nice. Probably not everybody will understand it as it's German. But the level of creativity is definitely gratifying19
-
I've dreamed of learning business intelligence and handling big data.
So I went to an university info event today for "MAS Data Science".
Everything's sounded great. Finding insights out of complex datasets, check! Great possibilities and salery.
Yay! 😀
Only after an hour they've explained that the main focus of this course is on leading a library, museum or an archive. 😟 huh why? WTF?
Turns out, they've relabled their librarian education course to Data science for getting peoples attention.
Hey you cocksuckers! I want my 2 hours of wasted life time back!
Fuck this "english title whitewashing"4 -
I recently got mailed by our support department.
A customer using my program experienced performance issues after updating the whole system; attached is a video recorded by the customer clearly showing the difference.
I watched the video: the old version took 20 seconds to load. The new version 26. After querying the data, it's shown in a list.
The updated version showed 61 datasets instead of 6 in the old version.
I asked the supporter if he even watched the video and he answered: sure, and I'm able to see the performance issue!2 -
A couple of weeks ago, I got to the second stage of a recruitment process with a relatively big fintech in the crypto space (I know) - all went well and although I did not think much of it at first, with all the information I had gathered I came to realize this might as well be the best opportunity I've had in my pursuit of finding a new job (i.e looking for high technical challenges, unsure of where I see myself in 5 years, wanting to give full-remote work a try, etc.).
Cue to the end of the interview;
"That's great! I really enjoyed speaking with you, your technical background seems excellent so we would like to move to the next stage which is a take-home test to do in your free time.", said the interviewer.
"Wow! Much amaze, well of course! What's it gonna be?", said the naive interviewee.
"I'm sending you the details via email, please send it back in 48 hours, buhbye now", she hangs up.
...
"48 hours?? Right, this should be easy then, probably some online leetcoding platform, as usual.", thought the naive interviewee, who evidently went through this sh*t numerous times already.
A day later I receive the email: this was the whole deal. The take-home test supreme with bacon and cheese. A full-blown project, with tests, a project structure, a docker image, testing and bullet points for bonus points! The assessment was poorly written with lots of typos and overall ambiguity, a few datasets were also provided but bloated with inconsistent comments and trailing whitespace.
What the actual fck??? Am I supposed to sleep deprive myself to death while also working my day job? What are you trying to assess? How much of my life I'm willing to sacrifice for your stupid useless coding challenge? You are not all Google, have some respect, jeez.
I did not get the job.2 -
I was recently interviewing a fresh college graduate. According to his resume, he had practically invented machine learning. He eats kaggle datasets for breakfast and was all set to invent a cure for cancer with deep learning. And then, I ask him a simple question. What is 's' in 'https' and he says... "simple".3
-
A little late but whatever.
About half a year ago, I started working on setting up self hosted (slippy) maps. For one, because of privacy reasons, for two, because it'd be in my own control and I could, with enough knowledge, be entirely in control of how this would work.
While the process has been going on for hours every day for about half a year (with regular exceptions), I'll briefly lay out what I've accomplished.
I started with the OpenMapTiles project and tried to implement it myself. This went well but there were two major pitfalls:
1. It worked postgres database based. This is fine but when you want to have the entire world.... the queries took insanely long (minutes, at lower zoom levels) and quite intimate postgres/tooling knowledge was required, which I don't have.
2. Due to the long queries and such, the performance was so bad that the maps could take minutes to render and when you'd want that in production... yeah, no.
After quite some time I finally let that idea sail and started looking into the MBTiles solution; generating sqlite databases of geojson features. Very fast data serving but the rendering can take quite some time.
After some more months, I finally got the hang of it to the point that I automated 50-70 percent of the entire process. The one problem? It takes a shitload of resources and time to generate a worldwide mbtiles database.
After infinite numbers of trial and error, I figured out that one can devide a 'render' (mbtiles aka sqlite database) into multiple layers (one for building data, one for water, one for roads and so on), so I started doing renders that way.
Result? Styling became way more easy and logical and one could pick specific data to display; only want to display the roads? Its way more simple this way. (Not impossible otherwise but figuring out how that works... Good luck).
Started rendering all the countries, continents and such this way and while this seemed like a great idea; the entire world is at 3-4 percent after about a month. And while 40-70 percent generates 10 times as fast, that's still way too slow.
Then, I figured out that you can fetch data per individual layer/source. Thus, I could render every layer separately which is way faster.
Tried that with a few very tiny datasets and bam, it works. (And still very fast).
So, now, I'm generating all layers per continent. I want to do it world based but figured out that that's just not manageable with my resources/budget.
Next to that, I'm working on an API which will have exactly the features I want/need!13 -
While scraping web sources to build datasets, has legality been ever a concern?
Is it a standard practice for checking whether a site prohibits scraping?25 -
Data Disinformation: the Next Big Problem
Automatic code generation LLMs like ChatGPT are capable of producing SQL snippets. Regardless of quality, those are capable of retrieving data (from prepared datasets) based on user prompts.
That data may, however, be garbage. This will lead to garbage decisions by lowly literate stakeholders.
Like with network neutrality and pii/psi ownership, we must act now to avoid yet another calamity.
Imagine a scenario where a middle-manager level illiterate barks some prompts to the corporate AI and it writes and runs an SQL query in company databases.
The AI outputs some interactive charts that show that the average worker spends 92.4 minutes on lunch daily.
The middle manager gets furious and enacts an Orwellian policy of facial recognition punch clock in the office.
Two months and millions of dollars in contractors later, and the middle manager checks the same prompt again... and the average lunch time is now 107.2 minutes!
Finally the middle manager gets a literate person to check the data... and the piece of shit SQL behind the number is sourcing from the "off-site scheduled meetings" database.
Why? because the dataset that does have the data for lunch breaks is labeled "labour board compliance 3", and the LLM thought that the metadata for the wrong dataset better matched the user's prompt.
This, given the very real world scenario of mislabeled data and LLMs' inability to understand what they are saying or accessing, and the average manager's complete data illiteracy, we might have to wrangle some actions to prepare for this type of tomfoolery.
I don't think that access restriction will save our souls here, decision-flumberers usually have the authority to overrule RACI/ACL restrictions anyway.
Making "data analysis" an AI-GMO-Free zone is laughable, that is simply not how the tech market works. Auto tools are coming to make our jobs harder and less productive, tech people!
I thought about detecting new automation-enhanced data access and visualization, and enacting awareness policies. But it would be of poor help, after a shithead middle manager gets hooked on a surreal indicator value it is nigh impossible to yank them out of it.
Gotta get this snowball rolling, we must have some idea of future AI housetraining best practices if we are to avoid a complete social-media style meltdown of data-driven processes.
Someone cares to pitch in?14 -
So I just started going to university and have a subject called "programming", we are taught Java, Haskell and Prolog. Every week there is a sheet with homeworks, programming tasks. Often we get something like a boilerplate, so we implement some methods and stuff like that. Those tasks are prepared and created by scientific assistants. They upload the boilerplate and sheets. Take a look at the programming style they follow in Java. Actually I can't find a pattern they follow, except from the spacing between the lines. We are 1000 students in the informatics course, of which probably 10% know how to properly program 😅
So like 900 people see and adapt/learn this real bad coding convention. It really pisses me off, that they basically don't give a shit about convention or teaching them. I have to say that the logic some times is as worse as the conventions 😓
Besides I am not cocky with conventions, but I think at a high-class university they should teach proper convention.17 -
I have a new UNTRAINED bot on my site. It's based on openai now. And that's why it's blazing fast and blazing usless.
I can tell you why bots are so boring and will sure cause the dead internet theory. My datasets for example never contain real disturbing stuff ACCORDING TO NORMAL PEOPLE. EVERY TIME:
"The job failed due to an invalid training file. This training file was blocked by our moderation system because it contains too many examples that violate OpenAI's usage policies, or because it attempts to create model outputs that violate OpenAI's usage policies."
Now i'm really done. I gonna email them about their unusable training system.
In theory, i could test the message one by one if it is bad first. Don't want to do or pay for that. There should be an option to skip the data it considers disturbing instead of cancelling a whole data set for 0.1%. You also don't want to know how long it takes BEFORE he is finished validating you set. I think someone is doing it manually and clicks 'Uh uh..'-button..
Also, for the people who think they have gpt4o by having the API, you're lied to. The 'own gpt'-option on the paid openai is way more advanced than the ones you make locally.
They don't give us the real good stuff!
Oh, btw! The input data for my training is based on FORMER conversations with the bot. I automated a script to repeat a conversation I had and selected those messages and clicked 'train'. So it even complained about its OWN data! That data was already saying stuff like 'I can't help you with that' IN my training data. So, you 'corrected' and corrupted my data and now its still nog good enough for round 2?
I would really love to go back to local LLM's, but I can't imagine having ever a machine that generates as fast as the real GPT does. I also prefer to do it myself, but it's David vs. Goliath, even with a 5k computer. I'm sure.
Low quality rant, I know. I'm typing while still frustrated. For people who think censorship is needed often, this is the result! According to someone else, YOU are the one who has to be censored. Don't forget that.11 -
Yes, i can surely calculate the distribution of datasets when they look like
9 x x x x x x x x x x x x x x x 1
That will be SO accurate2 -
Proud of finally having pushed some projects to GitHub 💪
Feels really pleasant to finally have something to show off!
Please feel free to check it out: https://github.com/DataSecs3 -
We use at our company one of the largest Python ORM and dont code ourselfs on it, event tough I can code. Its some special contract which our General Manager made, before we as Devs where in the Project and everything is provided from the external Company as Service. The Servers are in our own Datacenter, but we dont have access.
We have our Consultants (Project Manager) as payd hires and they got their own Devs.
Im in lead of Code Reviews and Interfaces. Also Im in the "Run" Team, which observes, debuggs and keeps the System alive as 3rd-Level (Application Managers).
What Im trying to achieve is going away from legacy .csv/sftp connections to RestAPI and on large Datasets GraphQL. Before I was on the Project, they build really crappy Interfaces.
Before I joined the Project in my Company, I was a Dev for a couple of Finance Applications and Webservices, where I also did coding on Business critical Applications with high demand Scaling.
So forth, I was moved by my Boss over to the Project because it wasn't doing so well and they needed our own Devs on it.
Alot of Issues/Mistakes I identified in the Software:
- Lots of Code Bugs
- Missing Process Logic
- No Lifecycle
- Very fast growing Database
- A lot of Bad Practices
Since my switch I fixed alot of bugs, was the man of the hour for fixing major Incidents and so on so forth. A lot of improvements have been made. Also the Team Spirit of 15+ People inside the Project became better, because they could consult me for solutions/problems.
But damn I hate our Consultants. We pay them and I need to sketch the concepts, they are to dumb for it. They dont understand Rest or APIs in general, I need to teach them alot about Best Practices and how to Code an API. Then they question everything and bring out a crooked flawed prototype back to me.
WE F* PAY THEM FOR BULLCRAP! THEY DONT EVEN WRITE DOCUMENTATION, THEY ARE SO LAZY!
I even had a Meeting with the main Consultant about Performance Problems and how we should approach it from a technical side and Process side. The Software is Core Business relevant and its running over 3 Years. He just argumented around the Problem and didnt provide solutions.
I confronted our General Manager a couple of times with this, but since 3 Years its going on and on.
Im happy with my Team and Boss, they have my back and I love my Job, but dealing with these Nutjobs of Consultants is draining my nerves/energy.
Im really am at my wits end how to deal with this anymore? Been pulling trough since 1 year. I wanna stay at my company because everything else besides the Nutjob Consultants is great.
I told my Boss about it a couple of times and she agrees with me, but the General Manager doesnt let go of these Consultants.
Even when they fuck up hard and crash production, they fucking Bill us... It's their fault :(3 -
The Zen Of Ripping Off Airtable:
(patterned after The Zen Of Python. For all those shamelessly copying airtables basic functionality)
*Columns can be *reordered* for visual priority and ease of use.
* Rows are purely presentational, and mostly for grouping and formatting.
* Data cells are objects in their own right, so they can control their own rendering, and formatting.
* Columns (as objects) are where linkages and other column specific data are stored.
* Rows (as objects) are where row specific data (full-row formatting) are stored.
* Rows are views or references *into* columns which hold references to the actual data cells
* Tables are meant for managing and structuring *small* amounts of data (less than 10k rows) per table.
* Just as you might do "=A1:A5" to reference a cell range in google or excel, you might do "opt(table1:columnN)" in a column header to create a 'type' for the cells in that column.
* An enumeration is a table with a single column, useful for doing the equivalent of airtables options and tags. You will never be able to decide if it should be stored on a specific column, on a specific table for ease of reuse, or separately where it and its brothers will visually clutter your list of tables. Take a shot if you are here.
* Typing or linking a column should be accomplishable first through a command-driven type language, held in column headers and cells as text.
* Take a shot if you somehow ended up creating any of the following: an FSM, a custom regex parser, a new programming language.
* A good structuring system gives us options or tags (multiple select), selections (single select), and many other datatypes and should be first, programmatically available through a simple command-driven language like how commands are done in datacells in excel or google sheets.
* Columns are a means to organize data cells, and set constraints and formatting on an entire range.
* Row height, can be overridden by the settings of a cell. If a cell overrides the row and column render/graphics settings, then it must be drawn last--drawing over the default grid.
* The header of a column is itself a datacell.
* Columns have no order among themselves. Order is purely presentational, and stored on the table itself.
* The last statement is because this allows us to pluck individual columns out of tables for specialized views.
*Very* fast scrolling on large datasets, with row and cell height variability is complicated. Thinking about it makes me want to drink. You should drink too before you embark on implementing it.
* Wherever possible, don't use a database.
If you're thinking about using a database, see the previous koan.
* If you use a database, expect to pick and choose among column-oriented stores, and json, while factoring for platform support, api support, whether you want your front-end users to be forced to install and setup a full database,
and if not, what file-based .so or .dll database engine is out there that also supports video, audio, images, and custom types.
* For each time you ignore one of these nuggets of wisdom, take a shot, question your sanity, quit halfway, and then write another koan about what you learned.
* If you do not have liquor on hand, for each time you would take a shot, spank yourself on the ass. For those who think this is a reward, for each time you would spank yourself on the ass, instead *don't* spank yourself on the ass.
* Take a sip if you *definitely* wildly misused terms from OOP, MVP, and spreadsheets.5 -
I have been discussing for quite a while now in a WhatsApp group what IDE is better, IntelliJ vs Eclipse. I personally prefer IntelliJ, for a lot of reasons. But what is your opinion, what are the pros/cons of each IDE for you?12
-
Did the MIT seriously just take down one of their own huge tiny image datasets for machine learning because of the same problem as GitHub's Master branch?10
-
One of my minions (erm, I mean, "a valued junior member of my team") asked to be assigned to tasks more "data science related".
Regardless of the very last-decade sounding request, I tried to explain to the Jr that there is more to "data science" than distilling custom llms and downloading pytorch models. There are several entire fields of study. And those are all sciences. In this context, science equals math.
But they said they were not scared of math.
I've seen them using their phones to calculate freaking tips. If you can't do 15% of a lunch bill in your head, hypothesis tests might be a bit more than challenging.
But, ok then. Here we go.
So I had them do some semi-supervisioned clustering. On a database as raw as dirt, but with barely 5Gb, few dimensions and regarding subjects with easily available experts.
Even better, we had hundreds of manually classified training and test cases.
The Jr came back a month later with some convoluted mess of convoluted networks; just the serialized weights of the poor thing were about as large as the database itself.
And when I tried it on some other manually classified test datasets... Freaking 41% error rate, for something that should be a slam dunk. Little better than a coin toss.
One month of their time wasted on an overfitted unusable mess.
I had to re-assign the task to someone else, more experienced, last friday. It was monday when they came up with an iterative KNN approach giving error rates for several values of K... some of them with less than 15% error on the test dataset.
WTF are schools teaching and calling "data science" nowadays?!?!?
I reeeeally need to watch those juniors more closely. Maybe ask for middle-sprint demonstrations. But those are soooo boring and waste so much time from people who know what they are doing...
Does anyone have a better idea to prevent this type of off-track deviation? Without being a total bore, that is.
And... should I start asking people "gotcha" data analysis questions before giving them free reign on this type of tasks? Or is it an asshole boss move? I would hate someone giving me a pop quizz before letting me work... But I got no other ideas.3 -
There are days I like to pull my hair out and create a dynamic 4D map that holds a list of records. 🤯
Yes there's a valid reason to build this map, generally I'm against this kind of depth 2 or 3 is usually where I draw the line, but I need something searchable against multiple indexes that doesn't entail querying the database over and over again as it will be used against large dynamic datasets, and the only thing I could come up with was a tree to filter down on as required.6 -
Playing pokemon. I'm extremely into the series and played nearly every gen since the first ones. Even bought the cheap 2ds just to play these 3ds versions even tho i got acutally no time during exams and sometimes even played the different editions of the same gen just because i can.😅
And I have the habit to think about "how is it done?" With everything that is displayed on a screen or just blinks. And with 800 pokemon and their stats and subforms and IVs and all that nerdshit (back then compressed in such a small rom and running on pretty low end devices) i began to think about data structures, organization of them and such, especially when there are many big, wide datasets.😪3 -
Today I came across a very strange thing or a coincidence(maybe).
I was working on my predictive analytics project and I had registered on Kaggle(repository for datasets) long back and was searching on how to scrape websites, as I couldn't find any relevant dataset. So, while I was searching for ways to scrape a website, suddenly after visiting a few websites, I get notifications of a new email. And it was from Kaggle with the subject line
"How to Scrape a Tidy Dataset for Analysis"
Now I don't how to feel about it. Mixed feelings! It is either a wild coincidence, or Kaggle is tracking all the pages visited by the user. The latter makes more sense. By the way, Kaggle wasn't open in any of the tabs on my browser.1 -
First rant here...
Hand full of devs have to create a huge web platform that can shovel a lot of data around in about two months which is impossible...
Project lead has left major decisions in the hands of interns like database we want to use because no question can.be answered by that person. Inexperienced intern has chosen a fucking nosql database for highly relational datasets... why? Because new tech...
Development began and a bunch of problems arised... database was accessable from internet from day one. Random crashes because out of memory exceptions. Every possible feature had a description of at most 10 words... and no standards where enforced on anything.
Now that finaaaally we switch to sql after almost a year of prototypical production everybody keeps coding on new features so i have to port all the crap to the new database...
best part: a bunch of clients on different op systems have to be ported as well!
Even better part: i have to do that cause everybody else has practically no experience in any field...
And now the joke: i got hired for gui/desktop application development
Am i a wizard now? -
We had made an api which had endpoints for each different domain model, so /user, /company, the usual. Beyond being restful they all had basic filtering and pagination.
We also had an endpoint to return an entity from any set based on guid for when you needed to attach the related entity to notifications and logging and such.
We received a bug report on how you couldn't use filtering or pagination on this endpoint, and after weeks of asking what they need it for we just had to implement it.
You can imagine how non-trivial it is to "just" filter across different datasets, but we eventually got it working so now you can get a user via /user/123 or /entity?type=user&id=123. They only use it for one type and id at the time.2 -
There are sites that list different datasets, this makes them dataset of datasets.
As there are multiple of these sites, and they are listed on Google, that made Google dataset of dataset datasets.1 -
For me that would be Proxmox. I know, people like it - but for no apparent reason it decided to nuke half my ZFS datasets in a pool, with no logic behind it whatsoever. All disks were tested, all came out good. Within the same pool there were datasets that were lost and some that remained.
I really don't get it. Looking at Proxmox' source code, it's more or less the command line tools and then there's the web interface (e.g. https://github.com/proxmox/...). Oh and they have the audacity to use their own file extension. Why not I guess?
Anyway, half my data was gone. I couldn't tell how or why or what the fuck even happened there. But Proxmox runs Debian underneath and I've been rather pissed about Proxmox' idea of "don't touch the host system aaa" for a while at that point. So I figured, fuck it I'll just take pure Debian then and write my own slightly better garbage on top of that. And as such the distribution project was born. I've been working on it for a little over a year now. And I've never had such issues again.
I somewhat get the idea of "don't touch the host" now, but still not quite. Yes, the more you do in the containers, the better. And the less you do on the host in terms of reconfiguration, the longer it will stay alive for. That goes for any system - more reconfiguration means usually means less stability and harder to replace. But sometimes you just have to work from the host. Like say migrating a container between hosts, which my code can do. You can't do that from a container, at all. There are good reasons to work with the host. Proxmox isn't telling that. Do they expect their users to be idiots? Only enterprise sysadmins amirite?
So yeah, that project - while I do take inspiration from it in mine - I don't like it. It's enterprise, it has the ZFS and the Ceph and the LXC and the VM's - woohoo! Not like anyone could implement that on a base Debian system. But they have the configuration database (pmxcfs), the distributed configuration database of a couple MB large and capped there, woah!
Ok sure it isn't Microsoft or IBM or Oracle or whatever, and those are definitely worse. But those are usually vendor lock-ins.. I avoid those on that premise alone :)3 -
Now I have 6 projects on JIRA and I have lots of data collected from all projects (for example visitors on website, how many of them donated money in our project and etc.).
I am looking for some software/app where I could import lots of data from these reports and I want to be able to compare datasets or see some overview.
What should I use?3 -
This is not a rant. Rather just a question or an ask for advice, as I have seen a lot of people talk about web development around here. I am planning to create a website for my search engine. I created a Rest API for my VPS so I can do http requests and retrieve some links for certain key words. But I need some good ideas to do this from a website. As I am not sure what would be the best way to do http requests. As far as I know it's possible with Js and PHP, but I am not sure what's better, more secure or convenient? So here I am to ask you guys, especially those who have experience with this, what I should consider to do.
Oh and please forgive me my limited knowledge about Js and PHP 😅😊3 -
For you graphic dev, Disney has released Clouds and the island from Moana.
https://disneyanimation.com/technol...1 -
Should I participate in a hackathon that interests me but I have no idea what I could make out of the datasets they'll give us? (hackathon revolves around some datasets)3