questions

Ranter

WhatDaCode

2006

Comments

4

AndSoWeCode

5104

7y

10 million - no. 60 million.
If using python, simply use csvwriter in a loop.
I used psql directly to export CSV. In both cases the speed was bound by the write speed of the output medium.

In your case, you can even do parallel API read, if the order if objects isn't a concern, and if it is, you can still do it in parallel but do extra stuff to ensure order (commit when all threads are ready)
1

WhatDaCode

2006

7y

@njpugh90 Needs to be repeated every day. With technology I was using Node.js but I would like to know the python solution if you have that in mind. Technology can be changed.
0

ManuLG

428

7y

I think python is the best option, it has a native library to handle CSV

https://docs.python.org/3/library/...
2

benj

835

7y

I'd agree with using python of you need to modify the content of the csv, but if the only thing you need is to join them a few lines of bash should do the trick, tail to remove the first line if the first line is the fields name and then >> to the final file.

tail -n +2 "$PARTIAL_FILE" >> "$FINAL_FILE"

should do the trick
0

joykill

339

7y

If they need to be processed by some other process late Ron, you might be better of creating multiple files by the way... Like 100k lines in each file orso...

That way, you can have your other process on a cron orso, just processing a file at a time again...
0

WhatDaCode

2006

7y

@joykill multiple files was considered as an option by us but it would be too slow. If there are 80,000 objects and as suggested we created just 100 lines which is 100 rows there would be 800 files that we would need merging. It's a tedious and slow process with just 80,000 imagine that for 10,000,000. It's just not viable. Even if we increase the row count to a 1000 it's still creating 10,000 files on 10 million.
2

Wack

6133

7y

Is there a reason you dump it in a csv file instead of a database? You could simply use sqlite or if you want to be "fancy" postgress and elastic search.
1

WhatDaCode

2006

7y

@Wack "The Clients demand", He needs a excel file.
2

Wack

6133

7y

@MacDev I personally would still dump it in a db first and then export it. A db has the advantage, that it takes care of "failures" and parallel access. Assuming there is a key column that you can use for ordering, you can even parallize your crawler without having to deal with write locks and so on on a single file.
0

WhatDaCode

2006

7y

@Wack I am fetching the files from a database already which is created using Mongo DB, why would I fetch files and dump again in some database and export to CSV?
2

Wack

6133

7y

Didn't you say, you can only access it over an api in 1k junks?

I would have a db on your end and then have a parallized script, querying the api simultanously. Assuming in the api data there is some field like "id", you can then query your db with an order by id asc and dump every thing into a csv at once. After that either delete the data or you could use it to run whatever analysis your client now does manually fully automated/with a nice web ui.
If you would just use a CSV file, you'd most likely be stuck at one process at a time or you'd have to do some reordering on the finished file. Additionally you'd need write looks to ensure all data is written correctly.