Modeling a CSV file: What is the standard? Python or SQL?

Question

I have a wide CSV file of about 350mb, and want to load it into a SQL database and properly model the data to make it easier to use for analysis.

I could split the data into tables with python and then loaded into sql
Or load the file into the database as a table, and then split it using sql.

What would be the standard approach? Or how should I choose?

You only expect to do this once? Use whatever gets the job done fastest. — Thorbjørn Ravn Andersen, Commented Dec 22, 2023 at 16:34
For the current task I plan to do it only once, but I imagine that I'll have to perform similar tasks in the future. Thanks for the insight though, its definitely a viable option. — HappilyCoding, Commented Dec 22, 2023 at 17:32
"properly model the data" without even a hint what that means leaves a door open big enough to drive a Bagger 288 through. — whatsisname, Commented Dec 22, 2023 at 21:49
load it all into sql and then work on it. The reason being, you will want to report on the errored lines. having the raw data in a "rawData" table makes it easy to say "couldn't parse row 123" and work out what the issue is, rather than errored on line 12312412, col x start again! — Ewan, Commented Dec 22, 2023 at 22:01

Doc Brown · Accepted Answer · 2023-12-22 17:11:00Z

8

It may be unsatisfying for you, but for this kind of task (as well as for the majority of other software engineering tasks), the answer is

There is no standard.

I have actually seen both kind of approaches working (and working well) for comparable tasks - ETL processes can be designed with "Extract" and "Transform" fully inside the database server using stored procedures, or on a client, with whatever programming language one is most familiar with.

So as long as you superiors don't tell you "at our team, we prefer X over Y, since X is our internal standard", choose whichever approach you feel more comfortable with.

answered Dec 22, 2023 at 17:11

Doc Brown

218k34 gold badges402 silver badges612 bronze badges

Thank you, Its good to know that both approaches work well in production. Do you know if pandas would be a good tool to use in the python approach in production? I've used it and seen it widely used in ML competitions and such, but I don't think I've come across any job postings that mention it, unlike spark.
– HappilyCoding
Commented Dec 22, 2023 at 17:36
2

@HappilyCoding: No I did not use Pandas in the past. But even if I would have, I think it is never a good idea to give recommendations for or against a tool without knowing if the tool fits to the requirements.
– Doc Brown
Commented Dec 22, 2023 at 19:22
@HappilyCoding You don't even need to load the file into a DB to use SQL. There is a Python library called csvquery (for example) which supports using SQL directly on set of CSV files.
– JimmyJames supports Canada
Commented Dec 27, 2023 at 16:27

Add a comment |

Stack Exchange Network

Modeling a CSV file: What is the standard? Python or SQL?

1 Answer 1

Hot Network Questions

Modeling a CSV file: What is the standard? Python or SQL?

1 Answer 1

Related

Hot Network Questions