A recurring pattern which I see in my code is chaining together a lot of functions. This is the result of a large number of processing steps needed for a given task. This could be e.g. a data processing and visualization or my most recent example:
For some financial analysis I want to read in a couple of pdfs which are bank statements, extract the text, get transactions by using regular expressions and do some conversions from strings to the proper data types. After that, I would analyze those transactions in some kind.
After some problem solving, I ended up with something like this in my main function:
transactions = []
for page_number, page in enumerate(pages):
num_of_pages = len(pages)
text_in_lines = extract_page_to_text(page_number, page, num_of_pages)
matches = match_dates(text_in_lines)
groups = assume_groups(matches)
curr_transactions = create_transactions(text_in_lines, groups)
transactions.extend(curr_transactions)
This code is for a single pdf (so it would be wrapped in another for loop) and I haven't done any analysis with the data yet. My issue with this is, that I'm pretty quickly starting to lose any structure if I'm just chaining together those functions.
So my thought is, there should be a design pattern for this kind of problem, right? Long chains of functions with some for loops with not a lot of conditionals should be easy to structure in theory?
One possibility of refactoring I thought of, was to split up the functions into classes. A PDF class, a Page class and a Transaction class. Is OOP the way to go, even if it means just to gather a bunch of functions in a class and then calling them from inside the class? But this opens up new questions for me again - how do I structure those classes? Do they exist in parallell and in the end, it's the same but I now call methods instead of functions? Or do I nest them inside of each other something like this:
class Pdf:
def __init__(self, pdf):
self.pdf = pdf
self.pages: [Page] = None
def get_page_content(self):
# do something with pdf to extract content
for page in pages:
self.pages.append(Page(page_content))
class Page:
def __init__(self, page_content):
self.page_content = page_content
self.transactions: [Transaction] = None
def get_transactions(self):
# do something with page to extract transactions
for transaction in transactions:
self.transactions.append(Transaction(transaction_raw))
class Transaction:
def __init__(self, transaction_raw):
self.transaction_raw = transaction_raw
def some_further_processing(self):
...
This looks better in my opinion, but I'm nesting classes three levels deep and it seems like a lot of OOP just to chain some "simple" functions together.
You may have noticed that I'm kind of lost on this topic, so looking forward for any kind of feedback :)