"So Alecx, how did you solve the issues with the data provided to you by hr for <X> application?"

Said the VP of my institution in charge of my department.

"It was complex sir, I could not figure out much of the general ideas of the data schema since it came from a bunch of people not trained in I.T (HR) and as such I had to do some experiments in the data to find the relationships with the data, this brought about 4 different relations in the data, the program determined them for me based on the most common type of data, the model deemed it a "user", from that I just extracted the information that I needed, and generated the tables through Golang's gorm"

VP nodding and listening intently...."how did you make those relationships?" me "I started a simple pattern recognition module through supervised mach..." VP: Machine learning, that sounds like A.I

Me: "Yes sir, it was, but the problem was fairly easy for the schema to determ.." VP: A.I, at our institution, back in my day it was a dream to have such technology, you are the director of web tech, what is it to you to know of this?"

Me: "I just like to experiment with new stuff, it was the easiest rout to determine these things, I just felt that i should use it if I can"

VP: "This is amazing, I'll go by your office later"

Dude speaks wonders of me. The idea was simple, read through the CSV that was provided to me, have the parsing done in a notebook, make it determine the relationships in the data and spout out a bunch of JSON that I could use. Hook it up to a simple gorm golang script and generate the tables for that. Much simpler than the bullshit that we have in php. I used this to create a new database since the previous application had issues. The app will still have a php frontend and backend, but now I don't leave the parsing of the data to php, which quite frankly, php sucks for imho. The Python codebase will then create the json files through the predictive modeling (98% accuaracy) and then the go program will populate the db for me.

There are also some node scripts that help test the data since the data is json.

All in all a good day of work. The VP seems scared since he knows no one on this side of town knows about this kind of tech. Me? I am just happy I get to experiment. Y'all should have seen his face when I showed him a rather large app written in Clojure, the man just went 0.0 when he saw Lisp code.

I think I scare him.

  • 1
    How did you do pattern recognition?

    I need to do similar task. I have a lot of address which are not formatted correctly and I need to extract city, street and so on.

    I think I need to find out in which pattern the address are formatted first before I clean them in correct format
  • 0
    @mr-user how about an parser that parses the format you defined and everything else is parsed through other parsers and then updated in the db.
  • 0
    It has so Indian accent I can't understand what I'm reading :D
  • 0

    Adresses, phone numbers, etc.... have a plethora of possible formatting and rules.

    The more international adresses, the lesser the chance to get sth useful out of it.

    Just as an example of the most funny thing: https://en.m.wikipedia.org/wiki/...

    Ireland didn't have postcodes till 2014.

    I'm not saying it's impossible, rather that it's harder the more deviation you'll have based on different standards in different countries.

    Finding relations is easier in my opinion.
  • 0

    Ireland did not have postal code until 2014. My country did not have postal code.
  • 0

    I thought of it before but since user manually input data there are different separator like slash, comma and so on.

    I also thought of creating list of all the possible city in my country and search it with that list to extract city name but there are problem.

    1. User misspell city name

    2. Some city name are same as state name such like New York City, New York State.
  • 1
    @mr-user Plus that city names aren't unique.

    Without a unique identifier like a zip code it's really shotgun mode xD
  • 1
    @blindXfish I am from Texas, so you can go eat shit :D
  • 0
    @mr-user your seems more a problem of regex than it might do pattern recognition. At the same time, what general format do you have for those addresses?
  • 1

    That the thing. I am trying to find general format. There is not much data education and people can fill it out in whatever format they wanted.

    Image a single text field called Address. They can either put their street name at front or they can use whatever format they think off.

    I was trying to look at those millions unclean address and try to figure out how many of the customer live in which city, which street and so on.

    I am not hoping for 100 accuracy since I know it is impossible but I atleast want to clean large portion of it.
  • 0
    @mr-user then some data transformation will have to take place in order to determine categorical positioning. Take two tables with the exact "type of data" and in one of them you have a column for address, this in itself just by the label is something that you know already and can be represented by a 0 or a 1, even if the users put information in a different way e.g "Street, apt # etc" over "Apt # Street" etc, it is well known by the label to represent an address.

    Now, as far as having a system being able to properly generalize and fix to your format, this seems rather heavy. But if you want your system to be able to pick a string of text by another data set and determine if it is an address or not it would be even more difficult.

    At this time, standard machine learning might not be able to pinpoint if something is an address, but an actual neural net that as an additional step checks for an address against something like google might
  • 1
    @IntrusionCM for some reason i went back and started thinking on silicon valley where an AI started correcting user inputs, don't wanna say more in case I make a big spoiler. Apologies if I did and someone here wants to watch the TV show
Add Comment