10

Clustering high dimensional data - SOS!

I have to cluster documents. I computed the tf*idf, but I have ~45K docs with 28K words.
I did minibatch kmeans which works alright, but everything else runs forever. I need to compare 2 clustering algorithm, so I need at least another one that works. In this life time.
Suggestions?

Comments
  • 1
  • 5
    Use HTML
  • 4
    @AlgoRythm Stupid suggestion bro. They need an algorithm, not a programming language. Trust me, I'm a HTML programmer.
  • 5
    @ihatecomputers Bro, bro, HMTL has super duper algorithms for big data and AI. Trust me, I'm an HTML engineer.
  • 5
    @AlgoRythm That doesn't sound right. Let me check with my manager. brb

    edit: no, you are correct
  • 4
    @AlgoRythm css now supports algo btw
  • 4
    @devTea No that wont do, trust me, you need HTML for this.
  • 3
    @AlgoRythm enjoy your high level language, I'm going to do it in plain text
  • 3
    @ewpratten Fuck, I dunno, I might do it in .docx
  • 3
    @AlgoRythm that's too much encoding.

    Gotta keep it small and efficient so it can run on my abacus
  • 3
    @ewpratten Let's do it in scratch, compile it with level 69 optimization, and encode it with IAMVERYSMALL encoding algo. Should accomplish the task in 3 bytes of storage and 14 bits of memory.
  • 2
    Fuck me sideways, that evolved right quick.
  • 2
    I would suggest feature reduction methods. Since PCA won't work efficiently with high dimensions like that, you could remove highly frequent words (they are often stopwords) and low frequent words (they often don't provide much information).

    Other than that use stemming and and convert all words to lowercase during tokenization. This will further reduce the number of words in your dictionary.
  • 1
    What is wrong with all of you? If you have bullshit diarrhea go shit in someone else's comment section.

    @TheSilent I can't remove stop words because I got an abstract dataset (only word IDs). Same for lower/higher case, I don't know how they handled it when they constructed the dataset.
    I did a SVD to reduce the dimensionality, which worked OK. Can you suggest a clustering algorithm other than k-means that can work with big data?
  • 1
    @NickyBones Sadly clustering is not really my field of expertise (I've mostly worked with classification and topic models). But I found a research paper that might be of interest to you, when it comes to reducing the number of features: https://stat.berkeley.edu/~mmahoney...

    There are other general techniques to reducing features. You could train a neural network as an autoencoder which maps to lower dimensional vectors. Other then that aggressive pruning using variance and correlation might help.

    When it comes to clustering algorithms I can't really help you much. But you could look into subspace clustering. It seems to be one of the gotos when it comes to clustering high dimensional data.
  • 1
    @TheSilent It's just a project for data analytics course, and I don't really have NN background. I try to do as little as possible :)
    I found a really nice embedding that allows me to cluster in lower dimensions.
    This:
    https://youtube.com/watch/...
    I don't have enough mathematical background to understand how it works, but it runs way faster than t-SNE, and it's visually pleasing :)
  • 1
    Have you tried miracle sort?
  • 2
    @12bitfloat I let the dataset listen to Infected Mushroom tracks. Beautifully psychedelic results!
Add Comment