2

A recurring pattern which I see in my code is chaining together a lot of functions. This is the result of a large number of processing steps needed for a given task. This could be e.g. a data processing and visualization or my most recent example:

For some financial analysis I want to read in a couple of pdfs which are bank statements, extract the text, get transactions by using regular expressions and do some conversions from strings to the proper data types. After that, I would analyze those transactions in some kind.

After some problem solving, I ended up with something like this in my main function:

transactions = []
for page_number, page in enumerate(pages):
    num_of_pages = len(pages)
    text_in_lines = extract_page_to_text(page_number, page, num_of_pages)
    matches = match_dates(text_in_lines)
    groups = assume_groups(matches)
    curr_transactions = create_transactions(text_in_lines, groups)
    transactions.extend(curr_transactions)

This code is for a single pdf (so it would be wrapped in another for loop) and I haven't done any analysis with the data yet. My issue with this is, that I'm pretty quickly starting to lose any structure if I'm just chaining together those functions.

So my thought is, there should be a design pattern for this kind of problem, right? Long chains of functions with some for loops with not a lot of conditionals should be easy to structure in theory?

One possibility of refactoring I thought of, was to split up the functions into classes. A PDF class, a Page class and a Transaction class. Is OOP the way to go, even if it means just to gather a bunch of functions in a class and then calling them from inside the class? But this opens up new questions for me again - how do I structure those classes? Do they exist in parallell and in the end, it's the same but I now call methods instead of functions? Or do I nest them inside of each other something like this:

class Pdf:
    def __init__(self, pdf):
        self.pdf = pdf
        self.pages: [Page] = None

    def get_page_content(self):
        # do something with pdf to extract content
        for page in pages:
            self.pages.append(Page(page_content))

class Page:
    def __init__(self, page_content):
        self.page_content = page_content
        self.transactions: [Transaction] = None

    def get_transactions(self):
        # do something with page to extract transactions
        for transaction in transactions:
            self.transactions.append(Transaction(transaction_raw))

class Transaction:
    def __init__(self, transaction_raw):
        self.transaction_raw = transaction_raw

    def some_further_processing(self):
        ...

This looks better in my opinion, but I'm nesting classes three levels deep and it seems like a lot of OOP just to chain some "simple" functions together.

You may have noticed that I'm kind of lost on this topic, so looking forward for any kind of feedback :)

2
  • 2
    When writing code we have to deal with inherent or essential complexity – the problem we're solving is complicated – and accidental complexity – our solution has some complexity of its own. The point with inherent complexity is that it can never disappear, we can only shuffle it around. There is nothing inherently wrong with having a fairly large chunk of code that orchestrates things. Here, you have already abstracted over a lot of that complexity by creating functions for different steps. But abstractions can also introduce excessive complexity, and to me your classes feel like that.
    – amon
    Commented Nov 4, 2023 at 19:42
  • 3
    The simplest way to simplify a long chain of functions isn’t better structure. It’s meaningful names. Names that break it up into shorter chains that can be chained together. Commented Nov 4, 2023 at 20:31

1 Answer 1

6

The term you're looking for is "processing pipeline".

More generally we might have a DAG of data dependencies, as with make or airflow. But the simplest DAG is just a sequence of processing steps in a straight line.


Unix | pipelines are a powerful means of composing simple functions over text records.

Analogously in python we can write generators which yield a sequence of records to subsequent processing stages. They compose quite nicely.


Obligatory code review remark:

for page_number, page in enumerate(pages):
    num_of_pages = len(pages)

Better to hoist that constant out of the loop, as it isn't changing:

num_of_pages = len(pages)
for page_number, page in enumerate(pages):

Or even better, save typing a few characters by simply using that expression as third argument, since it's cheap to compute and needn't be cached.


Your pipeline is already well structured.

Notice that once we have extracted a list of text lines, we no longer refer to page. This suggests a cleavage point where we might break out a stage of the pipeline.

So stage1 iterates over pages and yields text line lists. And stage2 iterates over such lists to come up with transactions, which can then be conveniently collected by a list comprehension. Or those records might be sent onward to a third stage of processing.

What does that accomplish? It reduces coupling, letting local temp vars go out of scope so we've fewer things to worry about at any given stage. And the clearly defined interfaces make it easier to create unit tests for individual stages in the middle of the pipeline.

4
  • 3
    I was going to write something similar as an answer, so let me just add here some extra information I would have given. Modern Python has got a lot of enhancements for generators, some of them make it unnecessary to use yield. For example, one yould write pages_with_numbering = (page_number, page, len(pages) for page_number, page in enumerate(pages)) in one line and get a generator which includes the number of pages. I found also this page where programming with iterator chains is explained by example.
    – Doc Brown
    Commented Nov 5, 2023 at 8:21
  • 1
    ... another point I would highlight is that designing the code this way it moves the control structures (the outer for loop) into the individual steps of the pipeline. So the final composition of the pipeline will be done only by chaning, without any control logic - which makes it unnecessary to write a unit test with some complex mocks for the composing code. One needs only unit tests for each step, and an integration test for the composed pipeline.
    – Doc Brown
    Commented Nov 5, 2023 at 8:31
  • 1
    Thank you to both of you for your thoughts! I'll try to summarize what both of you said, please correct me if I got something wrong: In a processing pipeline I want to make control structures part of the functions/processing steps. For that, I use generator functions/expressions at every point where I would normally use a for loop. This reduces the amount of lists which are created and iterated over, moves the execution to the last step, and reduces coupling, which also makes testing easier. In result, each step of my process (pdf -> pages -> transactions) is a generator func/expression.
    – Jan
    Commented Nov 6, 2023 at 8:21
  • 1
    @Jan: sounds good. Not sure what you mean by "it reduces the amount of lists", but the design avoids the entangling of an outer for loop with the function calls. It separates strictly between "integration functions" and "operational functions", which is also known as IOSP.
    – Doc Brown
    Commented Nov 6, 2023 at 15:05

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.