Doing linguistic research where I need to parse 2000 files of a total of 36 GB. Since we are using python the first thin

Ranter

K-ASS

2494

Comments

12

K-ASS

2494

8y

Also after I used multi processing the runtime now is 7 hours.
7

K-ASS

2494

8y

There was also one time my runtime shortened from 5 min to 10 sec when I used hash set to store my items, took me a while to believe that I did not fuck up anything
0

ragecoder

472

8y

@K-ASS I'm always confused with python when it comes to this. I can't remember if its multiple threads or processes.
7

K-ASS

2494

8y

@ragecoder so multi threading is like giving a core multiple work, making it overload while other cores are just chilling at the side watching. Multi processing is like distributing works among cores so everybody is working and everyone get paid.
0

Pokka

79

8y

Why not Spark? 🙃
1

K-ASS

2494

8y

@dprm oh boy I'm still trying to implement spark in python, if you know a good tutorial of how to use spark in python please reply to this comment
1

magicMirror

10338

8y

@K-ASS ok. that is pretty usefull knowledge. Now go read about async programing in python. Also go read about the GIL and why android threads work like that. From a performance point of view:
async>multiprocess>multithread
0

Pokka

79

8y

@K-ASS what I do is implement whole hadoop stack with Apache Ambari. Of course hadoop, spark, and zeppelin will be included. So you can run zeppelin notebook then run spark on top of that. Not the best method imo, but it works 👍
0

K-ASS

2494

8y

@magicMirror that is something new, I will look into it
1

dsteiner

1683

8y

Well the only time.i remember when I have done such a performance optimisation was when i worked at a startup that created a music app. We parsed iTunes data via the apple Enterprise feed and the original parser (python) took more than 30 hours for a feed file with all updates for the day (which is ridiculous as you are behind half a day when the next feed file is out) - I've rewritten that program in c# and it only took somewhat like 20 minutes to import the whole feed file.
0

micfort

137

8y

Actually, this holds in python. Pythons version of multitrheading can't use multiple cores. The problem comes from the global interpreter lock, GIL for short. That makes sure that a single python instruction is executed at all times. Not all python implementations have this limitation/feature btw. If you go to another language like c#/java/c++, then multitrheading can use all the cores and you don't have to start up another processes and use a communication layer.
0

JustAGuy

66

8y

@K-ASS

Try to use structs instead of classes to see something incredible!

Hint: structs are value type and they are stored in memory one near another so when your cpu calls for the first value the one near is also get because the cpu take 32/24 bits or more depending of your CPU. I made it out from 7 hours to 3 seconds with 2 billions items in a list.

Add Comment

undefined