Just finished dumping all ethereum tranasctions into one big 30 GB csv.

Only thing left is to configure Apache Spark cluster.

  • 0
    what's your goal? :)
  • 0
    For 30GB? Spark? It's way cheaper, easier and faster to go without spark on this one.
  • 1
    @achintyakumar yes.

    Spark is a big overhead on hardware. Its benefit is that it can scale indefinitely, and is not limited by the size of a hardware of a single machine.

    But if a machine is as large as several smaller machines, it will do this faster.

    You just have to organize your data properly to exploit the entire potential of the machine you have available. If you have a lot of RAM - use it to load data structures into it for processing, cache, other stuff. If you have lots of HDD space - add indexes wherever your operations will be more than 2 times O(n logn). Pre-order your data, partition it to minimize page lookups, normalize it if you read it in many different ways and each time only a small subset of attributes.

    It depends on what you want to do with it, but 30GB is truly a TINY size. I spin up larger random data sets just to test performance of different algorithms, on mid-range VPS from online.net
Add Comment