31
K-ASS
7y

Doing linguistic research where I need to parse 2000 files of a total of 36 GB. Since we are using python the first thing I thought was to implement multi threading. Now I changed the total runtime from three days to like one day and a half. But then when I checked the activity monitor I saw only 20 percent of the CPU usage. After a searching process I started to understand how multi threading and multi processing works. Moral of the story: if you want to ping a website till they block you or do easy tasks that will not use up all power of one core, do multi thrading. If you need to do something complicated that can easily consume all the powers of a single CPU core, split up the work and do multi processing. In my case, when I tried to grab information from a website, I did multi thrading since the work is easy and I really wanted to pin the website 16 times simultaneously but only have 4 cores. But when it come to text processing which a single file will take 80 percent of cpu, split it up and do multi processing.

This is just a post for those who are confused with when to use which.

Comments
  • 12
    Also after I used multi processing the runtime now is 7 hours.
  • 7
    There was also one time my runtime shortened from 5 min to 10 sec when I used hash set to store my items, took me a while to believe that I did not fuck up anything
  • 0
    @K-ASS I'm always confused with python when it comes to this. I can't remember if its multiple threads or processes.
  • 7
    @ragecoder so multi threading is like giving a core multiple work, making it overload while other cores are just chilling at the side watching. Multi processing is like distributing works among cores so everybody is working and everyone get paid.
  • 0
    Why not Spark? 🙃
  • 1
    @dprm oh boy I'm still trying to implement spark in python, if you know a good tutorial of how to use spark in python please reply to this comment
  • 1
    @K-ASS ok. that is pretty usefull knowledge. Now go read about async programing in python. Also go read about the GIL and why android threads work like that. From a performance point of view:
    async>multiprocess>multithread
  • 0
    @K-ASS what I do is implement whole hadoop stack with Apache Ambari. Of course hadoop, spark, and zeppelin will be included. So you can run zeppelin notebook then run spark on top of that. Not the best method imo, but it works 👍
  • 0
    @magicMirror that is something new, I will look into it
  • 1
    Well the only time.i remember when I have done such a performance optimisation was when i worked at a startup that created a music app. We parsed iTunes data via the apple Enterprise feed and the original parser (python) took more than 30 hours for a feed file with all updates for the day (which is ridiculous as you are behind half a day when the next feed file is out) - I've rewritten that program in c# and it only took somewhat like 20 minutes to import the whole feed file.
  • 0
    Actually, this holds in python. Pythons version of multitrheading can't use multiple cores. The problem comes from the global interpreter lock, GIL for short. That makes sure that a single python instruction is executed at all times. Not all python implementations have this limitation/feature btw. If you go to another language like c#/java/c++, then multitrheading can use all the cores and you don't have to start up another processes and use a communication layer.
  • 0
    @K-ASS

    Try to use structs instead of classes to see something incredible!

    Hint: structs are value type and they are stored in memory one near another so when your cpu calls for the first value the one near is also get because the cpu take 32/24 bits or more depending of your CPU. I made it out from 7 hours to 3 seconds with 2 billions items in a list.
Add Comment