-1

I have a wide CSV file of about 350mb, and want to load it into a SQL database and properly model the data to make it easier to use for analysis.

  1. I could split the data into tables with python and then loaded into sql
  2. Or load the file into the database as a table, and then split it using sql.

What would be the standard approach? Or how should I choose?

4
  • 2
    You only expect to do this once? Use whatever gets the job done fastest. Commented Dec 22, 2023 at 16:34
  • For the current task I plan to do it only once, but I imagine that I'll have to perform similar tasks in the future. Thanks for the insight though, its definitely a viable option. Commented Dec 22, 2023 at 17:32
  • "properly model the data" without even a hint what that means leaves a door open big enough to drive a Bagger 288 through. Commented Dec 22, 2023 at 21:49
  • 1
    load it all into sql and then work on it. The reason being, you will want to report on the errored lines. having the raw data in a "rawData" table makes it easy to say "couldn't parse row 123" and work out what the issue is, rather than errored on line 12312412, col x start again!
    – Ewan
    Commented Dec 22, 2023 at 22:01

1 Answer 1

8

It may be unsatisfying for you, but for this kind of task (as well as for the majority of other software engineering tasks), the answer is

There is no standard.

I have actually seen both kind of approaches working (and working well) for comparable tasks - ETL processes can be designed with "Extract" and "Transform" fully inside the database server using stored procedures, or on a client, with whatever programming language one is most familiar with.

So as long as you superiors don't tell you "at our team, we prefer X over Y, since X is our internal standard", choose whichever approach you feel more comfortable with.

3
  • Thank you, Its good to know that both approaches work well in production. Do you know if pandas would be a good tool to use in the python approach in production? I've used it and seen it widely used in ML competitions and such, but I don't think I've come across any job postings that mention it, unlike spark. Commented Dec 22, 2023 at 17:36
  • 2
    @HappilyCoding: No I did not use Pandas in the past. But even if I would have, I think it is never a good idea to give recommendations for or against a tool without knowing if the tool fits to the requirements.
    – Doc Brown
    Commented Dec 22, 2023 at 19:22
  • @HappilyCoding You don't even need to load the file into a DB to use SQL. There is a Python library called csvquery (for example) which supports using SQL directly on set of CSV files. Commented Dec 27, 2023 at 16:27

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.