Ranter
Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Comments
-
@AlgoRythm Stupid suggestion bro. They need an algorithm, not a programming language. Trust me, I'm a HTML programmer.
-
@ihatecomputers Bro, bro, HMTL has super duper algorithms for big data and AI. Trust me, I'm an HTML engineer.
-
-
-
@ewpratten Let's do it in scratch, compile it with level 69 optimization, and encode it with IAMVERYSMALL encoding algo. Should accomplish the task in 3 bytes of storage and 14 bits of memory.
-
I would suggest feature reduction methods. Since PCA won't work efficiently with high dimensions like that, you could remove highly frequent words (they are often stopwords) and low frequent words (they often don't provide much information).
Other than that use stemming and and convert all words to lowercase during tokenization. This will further reduce the number of words in your dictionary. -
What is wrong with all of you? If you have bullshit diarrhea go shit in someone else's comment section.
@TheSilent I can't remove stop words because I got an abstract dataset (only word IDs). Same for lower/higher case, I don't know how they handled it when they constructed the dataset.
I did a SVD to reduce the dimensionality, which worked OK. Can you suggest a clustering algorithm other than k-means that can work with big data? -
@NickyBones Sadly clustering is not really my field of expertise (I've mostly worked with classification and topic models). But I found a research paper that might be of interest to you, when it comes to reducing the number of features: https://stat.berkeley.edu/~mmahoney...
There are other general techniques to reducing features. You could train a neural network as an autoencoder which maps to lower dimensional vectors. Other then that aggressive pruning using variance and correlation might help.
When it comes to clustering algorithms I can't really help you much. But you could look into subspace clustering. It seems to be one of the gotos when it comes to clustering high dimensional data. -
@TheSilent It's just a project for data analytics course, and I don't really have NN background. I try to do as little as possible :)
I found a really nice embedding that allows me to cluster in lower dimensions.
This:
https://youtube.com/watch/...
I don't have enough mathematical background to understand how it works, but it runs way faster than t-SNE, and it's visually pleasing :) -
@12bitfloat I let the dataset listen to Infected Mushroom tracks. Beautifully psychedelic results!
Related Rants
Clustering high dimensional data - SOS!
I have to cluster documents. I computed the tf*idf, but I have ~45K docs with 28K words.
I did minibatch kmeans which works alright, but everything else runs forever. I need to compare 2 clustering algorithm, so I need at least another one that works. In this life time.
Suggestions?
question
clustering
data science
text
python