Do all the things like ++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatarSign Up
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple APILearn More
Search - "etl"
To all the data engineers in here: WTF is going on in your field?
I've worked closely with a dozen data engineers in the last 5 years (and talked to friends and internet strangers about this and get similiar responses), mine if them seem to know how to use a computer!
They don't understand git, ORMs, best practices, how to use a terminal, DAGs (important for using modern ETL scheduling tools like airflow and prefext), etc
Guys with 10 years of experience on their resume and they can't wrap a model into a flask app with 1 endpoint. They'll reference local files on their machine in w jupyter notebook and are shocked it won't work on other computers!17
Time to switch to offline and hide in some dark corner to get work done. Tired of all the IM’s and coming over to my desk from 1 person for “critical” work. If they’re all critical then none of them are truly critical. If you sit on the data for 2 months, and then today is the day it becomes critical and the compliance issue is because of your ineptitude then its a you problem not an IT problem. Then on top of that you submit your data to be loaded in the incorrect request form and spreadsheet format you can go fuck yourself asking this be done in an hour. It could be done in 15 minutes if you had it in the correct format as specified in the 20 meetings over the past year which removed all manual analysis and automated the entire process you idiot. Now I have to get it into the correct format in that hour so I don’t have to do the analysis for you.
I have other things to do besides your etl tickets, like finding the actual problems in our actual critical applications. You know the ones where the VP’s of this giant corporation start calling if they go down.
Sorry for the rambling guys.
Data wrangling is messy
I'm doing the vegetation maps for the game today, maybe rivers if it all goes smoothly.
I could probably do it by hand, but theres something like 60-70 ecoregions to chart,
each with their own species, both fauna and flora. And each has an elevation range its
found at in real life, so I want to use the heightmap to dictate that. Who has time for that? It's a lot of manual work.
And the night prior I'm thinking "oh this will be easy."
(Also why does Devrant have to mangle my line breaks? -_-)
Laid out the requirements, how I could go about it, and the more I look the more involved
So what I think I'll do is automate it. I already automated some of the map extraction, so
I don't see why I shouldn't just go the distance.
Also it means, later on, when I have access to better, higher resolution geographic data, updating it will be a smoother process. And even though I'm only interested in flora at the moment, theres no reason I can't reuse the same system to extract fauna information.
Of course in-game design there are some things you'll want to fudge. When the players are exploring outside the rockies in a mountainous area, maybe I still want to spawn the occasional mountain lion as a mid-tier enemy, even though our survivor might be outside the cats natural habitat. This could even be the prelude to a task you have to do, go take care of a dangerous
creature outside its normal hunting range. And who knows why it is there? Wild fire? Hunted by something *more* dangerous? Poaching? Maybe a nuke plant exploded and drove all the wildlife from an adjoining region?
Having the extraction mostly automated goes a long way to updating those lists down the road.
But for now, flora.
For deciding plants and other features of the terrain what I can do is:
* rewrite pixeltile to take file names as input,
* along with a series of colors as a key (which are put into a SET to check each pixel against)
* input each region, one at a time, as the key, and the heightmap as the source image
* output only the region in the heightmap that corresponds to the ecoregion in the key.
* write a function to extract the palette from the outputted heightmap. (is this really needed?)
* arrange colors on the bottom or side of the image by hand, along with (in text) the elevation in feet for reference.
For automating this entire process I can go one step further:
* Do this entire process with the key colors I already snagged by hand, outputting region IDs as the file names.
* setup selenium
* selenium opens a link related to each elevation-map of a specific biome, and saves the text links
(so I dont have to hand-open them)
* I'll save the species and text by hand (assuming elevation data isn't listed)
* once I have a list of species and other details, to save them to csv, or json, or another format
* I save the list of species as csv or json or another format.
* then selenium opens this list, opens wikipedia for each, one at a time, and searches the text for elevation
* selenium saves out the species name (or an "unknown") for the species, and elevation, to a text file, along with the biome ID, and maybe the elevation code (from the heightmap) as a number or a color (probably a number, simplifies changing the heightmap later on)
Having done all this, I can start to assign species types, specific world tiles. The outputs for each region act as reference.
The only problem with the existing biome map (you can see it below, its ugly) is that it has a lot of "inbetween" colors. Theres a few things I can do here. I can treat those as a "mixing" between regions, dictating the chance of one biome's plants or the other's spawning. This seems a little complicated and dependent on a scraped together standard rather than actual data. So I'm thinking instead what I'll do is I'll implement biome transitions in code, which makes more sense, and decouples it from relying on the underlaying data. also prevents species and terrain from generating in say, towns on the borders of region, where certain plants or terrain features would be unnatural. Part of what makes an ecoregion unique is that geography has lead to relative isolation and evolutionary development of each region (usually thanks to mountains, rivers, and large impassible expanses like deserts).
Maybe I'll stuff it all into a giant bson file or maybe sqlite. Don't know yet.
As an entry level programmer I may not know what I'm doing, and I may be supposed to be looking for a job, but that won't stop me from procrastinating.
Data wrangling is fun.2
Working on a feature which heavies relies on a data pipeline. I noticed it is a couple of lambda functions calling each other ( Fuck you to the guy who made it). The best way to get sanity back is build a proper etl pipeline. Any suggestions for building a etl in python with reliability.
Options already considered
1. Celery tasks - Worked well but no overview of the single task progress across celery tasks
2. Airflow - Gives good overview but the docs make less sense than a 10 yr talking. Mostly because they introduced a new syntax and not everything has migrated fully yet. Also no support for reusing dags2
It wasn't an entirely solo projext but ever part of it was completely solo. I felt very proud of the ETL, DICOM metadata search database and Ci/CD pipelines that I built for an oil and gas company. They didn't understand the CI/CD parts so didn't take it anywhere after we'd finished.