Currently getting into Machine Learning and working on a joke-project to identify the main programming language of GitHub repositories based on commit messages. For half of the commits, the language is predicted correctly out of 53 possible languages. Which is not too bad given the fact that I have no clue what I'm doing...

  • 11
    Thats great idea. You can also try to classify the actual source code.
  • 8
    @hack that would be too easy. Now let's try guessing if it is contains a web site, web app, server/client, mobile or desktop application. That would be impressive.
  • 1
    For those interested, I'm doing a bit of pre-processing on the crawled commits, then a Tfidf vectorization without stemming and without stop word removal. Finally a random forrest classification on around 5000 data points (~1min)
    Suggestions for improvement are welcome 😊
  • 4
    "I have no idea what I'm doing, but it's working, so I must be doing something right"
    One of the most hilarious (but also frustrating) aspects of programming 🤣
  • 3
    My last commit message: "improved all the things". Good luck ;)
  • 1
    This is amazing!
  • 2
    Are you using neural nets or classic classifiers?

    My first approach would be a bag of words or a string kernel to calculate the (contextual) dissimilarities between the entries, then you can use a random projection or PCA to reduce the amount of features, and then you use a simple k-NearestNeighbour classifier to find the class for your entry.
  • 0
    Can you share the code?
  • 1
Add Comment