3picName

6y

Currently getting into Machine Learning and working on a joke-project to identify the main programming language of GitHub repositories based on commit messages. For half of the commits, the language is predicted correctly out of 53 possible languages. Which is not too bad given the fact that I have no clue what I'm doing...

random

project

machine learning

ml

supervised learning

Ranter

Comments

11

hack

6181

6y

Thats great idea. You can also try to classify the actual source code.
7

p100sch

1375

6y

@hack that would be too easy. Now let's try guessing if it is contains a web site, web app, server/client, mobile or desktop application. That would be impressive.
1

3picName

726

6y

For those interested, I'm doing a bit of pre-processing on the crawled commits, then a Tfidf vectorization without stemming and without stop word removal. Finally a random forrest classification on around 5000 data points (~1min)
Suggestions for improvement are welcome 😊
4

endor

5476

6y

"I have no idea what I'm doing, but it's working, so I must be doing something right"
One of the most hilarious (but also frustrating) aspects of programming 🤣
2

JohnnyBvo

80

6y

My last commit message: "improved all the things". Good luck ;)
1

AlmondSauce

15618

6y

This is amazing!
2

Emphiliis

1839

6y

Are you using neural nets or classic classifiers?

My first approach would be a bag of words or a string kernel to calculate the (contextual) dissimilarities between the entries, then you can use a random projection or PCA to reduce the amount of features, and then you use a simple k-NearestNeighbour classifier to find the class for your entry.
0

mt3o

1877

6y

Can you share the code?
1

3picName

726

6y

https://github.com/kvnmlr/...

Related Rants

devRant © 2021 Hexical Labs LLC
Privacy Policy | Terms of Service