2

We're working an agile project and designing as we go on a new python commandline app / systemd service for some fancy in-house project. Right now, we're supposed to be adding an observability / logging layer to the program. According to the product owner and the guy on our team that knows the most about observability (not me) we're supposed to send to an OTEL collector basically every function call we do so we can 100% recreate what the program did.

This is supposed to create hierarchical spans of work so we can use a dashboard to see what it did (I'll basically look like a gantt chart / timeline of all the work so we can see bottlenecks or whatever).

I'm of the opinion that this is rather insane and if I log every function call I'll ruin my app's performance and denude the program of any semblance of thread safety (not that that has been a concern thus far).

Furthermore, it seems like, in order to create all these child spans for OTEL collectors you need to have some global variables watching everything (which also makes multithreading a bigger chore and muddies up the whole program)

I'm not opposed to a global logger, but since that doesn't really keep any state, it doesn't hurt too much, what kinds of patterns do people use when they're adding observability to python code that doesn't either:

  • add extra, unnecessary variables to every function
  • add shakey and unsafe read/write global variables
  • slow down the program to a screeching halt on the network IO.

Right now, I've got a pretty good rating with the linter and the complexity of the program is low. It seems like a violently abhorrent tradeoff to add observability into a relatively simple program just so managers can tick a box - but I don't want to get into the philosophy of this, just figure out a way to do it without making the program look like crap.

2
  • I'm not familiar with what you call observability in a programming context. Are you sure they aren't looking for monitoring or profiling? Commented Apr 14, 2024 at 1:19
  • @GregBurghardt it's a moderately standard term in the industry these days, see e.g. What is observability? Not just logs, metrics and traces (an obviously somewhat biased source) Commented Apr 14, 2024 at 21:02

1 Answer 1

5

threading

You made some remarks about threading, for a GIL-bound python program, which seemed surprising. Is this really a flask app you're describing? The context matters.

In any event, create per-thread datastructures and log to those. Then you have the simpler problem of gathering the logs once per second or whatever, and sending them to your central monitoring server. The familiar queue is one candidate, but certainly there are others.

intrusiveness

add extra, unnecessary variables to every function

Right, we don't want to uglify everything in the codebase.

We might decorate each function with @observe, which logs the input args and the {result, exception raised}. But that's tedious, especially for two-line helpers. Probably better to create a function which, given a module, will iterate through its functions and decorate them en masse.

Hopefully much of your code is methods within classes. So define a class decorator which will iterate over relevant methods to instrument them. An excellent example of this technique can be found in beartype. It's very practical.

latency

slow down the program to a screeching halt on network IO

Your instrumentation should not be awaiting network roundtrips. Append to a list or other datastructure, and worry about flushing it to the network every few seconds as a separate problem. Then you have throughput concerns, not latency concerns.

efficiency

Take a look at tqdm. It logs events, but does a very good job of suppressing "too many logs" when there's an event arriving every millisecond. Also take a look at the cProfile module, which does accounting on number of calls, time spent in self, and total time for self + children.

log every function call I'll ruin my app's performance

That is a cogent concern. So pay attention to measures like "elapsed time" and "CPU seconds", in order to suppress log records of trivial helpers. If it turns out that a helper is always uninteresting (sub-millisecond elapsed times), then monkey patch it to un-instrument it.

GUIDs

You didn't mention IDs, so I will just make this explicit.

I would expect each unit-of-work to be tagged with a GUID, which follows it through the stack. And then each (large enough) sub-unit of work might get its own GUID, associated with the parent GUID.

1
  • It's not a flask app (yet) - we might be asked to convert the guys of it into something that can run via fast API and I don't want to back myself into a corner there. This app pretty much has to run as a singleton because its modifying the OS itself (firewalls and networking), so I have a global lock when it is started. What I might need to do, inside the program, is create threads to process some of the interfaces and firewall rules. Either way, I don't want to make every function in the program not thread safe just to justify paying for Dynatrace. Commented Apr 14, 2024 at 23:05

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.