2
\$\begingroup\$

I tried to create a peice of code that helps me remove duplicates from my sql database and reset the ID's as well. I am not sure that it's error proof.

import mysql.connector

# Database connection parameters
host = 'host
database = 'database'
user = 'user'
password = 'password'
port = numbers

try:
    # Establishing the connection
    conn = mysql.connector.connect(
        host=host,
        database=database,
        user=user,
        password=password,
        port=port
    )
    
    if conn.is_connected():
        print('Connected to the database')

        cursor = conn.cursor()

        # Remove duplicates
        delete_duplicates_query = """
        DELETE t1 FROM articles t1
        INNER JOIN articles t2 
        WHERE 
            t1.ID < t2.ID AND 
            t1.headline = t2.headline AND 
            t1.description = t2.description AND 
            t1.link = t2.link AND 
            t1.date = t2.date AND 
            t1.hour = t2.hour;
        """
        cursor.execute(delete_duplicates_query)
        conn.commit()
        print('Duplicates removed')

        # Reset primary key
        reset_primary_key_query = """
        SET @new_id = 0;
        UPDATE articles SET ID = (@new_id := @new_id + 1);
        ALTER TABLE articles AUTO_INCREMENT = 1;
        """
        cursor.execute(reset_primary_key_query)
        conn.commit()
        print('Primary key reset')

except mysql.connector.Error as e:
    print(f"Error: {e}")
finally:
    if conn.is_connected():
        cursor.close()
        conn.close()
        print('Database connection closed')

For reference

Table name : articles

ID

headline

description

link

date

hour

\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

UNIQUE index

The OP statement to DELETE dups looks fine.

It's unclear why we would want to INSERT those rows in the first place, only to change our mind and rip them out later. Consider creating a UNIQUE articles index on (headline, description, link, date, hour). Then errant INSERT attempts will immediately report an error, and there will be no need for any DELETE cleanup.

timestamp

The separate {date, hour} columns seem inconvenient. Consider storing a single timestamp column. You can always extract a date or an hour from it later if you need that.

It should store UTC time in seconds since the epoch (1970), with no timezone information. Worry about zone offsets on the way in and the way out, preferring to store just UTC on disk.

reset primary key

This seems ill advised. Recommend you not attempt to change a PK id after it has been stored.

Suppose a line of code syslog's an id, and then you change it. Now the logfile no longer speaks truth.

Suppose you wind up with FK relationships. Now UPDATE articles SET id ... will be updating random rows on those other tables.

Prefer to think of each id as immutable.

\$\endgroup\$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.