13

I would like to know if anyone has created a CSV file which has 10,000,000 objects ?

1) The data is received via an API call.
2) The maximum data received is 1000 objects at once. So it needs to be in some loop to retrieve and insert the data.

Comments
  • 4
    10 million - no. 60 million.
    If using python, simply use csvwriter in a loop.
    I used psql directly to export CSV. In both cases the speed was bound by the write speed of the output medium.

    In your case, you can even do parallel API read, if the order if objects isn't a concern, and if it is, you can still do it in parallel but do extra stuff to ensure order (commit when all threads are ready)
  • 1
    @njpugh90 Needs to be repeated every day. With technology I was using Node.js but I would like to know the python solution if you have that in mind. Technology can be changed.
  • 0
    I think python is the best option, it has a native library to handle CSV

    https://docs.python.org/3/library/...
  • 2
    I'd agree with using python of you need to modify the content of the csv, but if the only thing you need is to join them a few lines of bash should do the trick, tail to remove the first line if the first line is the fields name and then >> to the final file.

    tail -n +2 "$PARTIAL_FILE" >> "$FINAL_FILE"

    should do the trick
  • 0
    If they need to be processed by some other process late Ron, you might be better of creating multiple files by the way... Like 100k lines in each file orso...

    That way, you can have your other process on a cron orso, just processing a file at a time again...
  • 0
    @joykill multiple files was considered as an option by us but it would be too slow. If there are 80,000 objects and as suggested we created just 100 lines which is 100 rows there would be 800 files that we would need merging. It's a tedious and slow process with just 80,000 imagine that for 10,000,000. It's just not viable. Even if we increase the row count to a 1000 it's still creating 10,000 files on 10 million.
  • 2
    Is there a reason you dump it in a csv file instead of a database? You could simply use sqlite or if you want to be "fancy" postgress and elastic search.
  • 1
    @Wack "The Clients demand", He needs a excel file.
  • 2
    @MacDev I personally would still dump it in a db first and then export it. A db has the advantage, that it takes care of "failures" and parallel access. Assuming there is a key column that you can use for ordering, you can even parallize your crawler without having to deal with write locks and so on on a single file.
  • 0
    @Wack I am fetching the files from a database already which is created using Mongo DB, why would I fetch files and dump again in some database and export to CSV?
  • 2
    Didn't you say, you can only access it over an api in 1k junks?

    I would have a db on your end and then have a parallized script, querying the api simultanously. Assuming in the api data there is some field like "id", you can then query your db with an order by id asc and dump every thing into a csv at once. After that either delete the data or you could use it to run whatever analysis your client now does manually fully automated/with a nice web ui.
    If you would just use a CSV file, you'd most likely be stuck at one process at a time or you'd have to do some reordering on the finished file. Additionally you'd need write looks to ensure all data is written correctly.
Add Comment