But could this OOP be a disadvantage for software based on performance
i.e. how fast does the programm executes?
Often Yes!!! BUT...
In other words, could many references between many different objects,
or using many methods from many classes, result in a "heavy"
implementation?
Not necessarily. This depends on the language/compiler. For example, an optimizing C++ compiler, provided that you don't use virtual functions, will often squash down your object overhead to zero. You can do things like write a wrapper over an int
there or a scoped smart pointer over a plain old pointer which performs just as fast as using these plain old data types directly.
In other languages like Java, there is a bit of an overhead to an object (often quite small in many cases, but astronomical in some rare cases with really teeny objects). For example, Integer
there is considerably less efficient than int
(takes 16 bytes as opposed to 4 on 64-bit). Yet this isn't just blatant waste or anything of that sort. In exchange, Java offers things like reflection on every single user-defined type uniformly, as well as the ability to override any function not marked as final
.
Yet let's take the best-case scenario: the optimizing C++ compiler which can optimize object interfaces down to zero overhead. Even then, OOP will often degrade performance and prevent it from reaching the peak. That might sound like a complete paradox: how could it be? The problem lies in:
Interface Design and Encapsulation
The problem is that even when a compiler can squash an object's structure down to zero overhead (which is at least very often true for optimizing C++ compilers), the encapsulation and interface design (and dependencies accumulated) of fine-grained objects will often prevent the most optimal data representations for objects that are intended to be aggregated by the masses (which is often the case for performance-critical software).
Take this example:
class Particle
{
public:
...
private:
double birth; // 8 bytes
float x; // 4 bytes
float y; // 4 bytes
float z; // 4 bytes
/*padding*/ // 4 bytes of padding
};
Particle particles[1000000]; // 1mil particles (~24 megs)
Let's say our memory access pattern is to simply loop through these particles sequentially and move them around each frame repeatedly, bouncing them off the corners of the screen and then rendering the result.
Already we can see a glaring 4 byte padding overhead required to align the birth
member properly when particles are aggregated contiguously. Already ~16.7% of the memory is wasted with dead space used for alignment.
This might seem moot because we have gigabytes of DRAM these days. Yet even the most beastly machines we have today often only have a mere 8 megabytes when it comes to the slowest and biggest region of the CPU cache (L3). The less we can fit in there, the more we pay for it in terms of repeated DRAM access, and the slower things get. Suddenly, wasting 16.7% of memory no longer seems like a trivial deal.
We can easily eliminate this overhead without any impact on field alignment:
class Particle
{
public:
...
private:
float x; // 4 bytes
float y; // 4 bytes
float z; // 4 bytes
};
Particle particles[1000000]; // 1mil particles (~12 megs)
double particle_birth[1000000]; // 1mil particle births (~8 bytes)
Now we've reduced the memory from 24 megs to 20 megs. With a sequential access pattern, the machine will now consume this data quite a bit faster.
But let's look at this birth
field a bit more closely. Let's say it records the starting time when a particle is born (created). Imagine the field is only accessed when a particle is first created, and every 10 seconds to see if a particle should die and become reborn in a random location on the screen. In that case, birth
is a cold field. It's not accessed in our performance-critical loops.
As a result, the actual performance-critical data is not 20 megabytes but actually a 12-megabyte contiguous block. The actual hot memory we're accessing frequently has shrunk to half its size! Expect significant speed-ups over our original, 24-megabyte solution (doesn't need to be measured -- already done this kind of stuff a thousand times, but feel free if in doubt).
Yet notice what we did here. We completely broke the encapsulation of this particle object. Its state is now split between a Particle
type's private fields and a separate, parallel array. And that's where granular object-oriented design gets in the way.
We can't express the optimal data representation when confined to the interface design of a single, very granular object like a single particle, a single pixel, even a single 4-component vector, possibly even a single "creature" object in a game, etc. A cheetah's speed will be wasted if it's standing on a teeny island that's 2 sq. meters, and that's what very granular object-oriented design often does in terms of performance. It confines the data representation to a sub-optimal nature.
To take this further, let's say that since we're just moving particles around, we can actually access their x/y/z fields in three separate loops. In that case, we can benefit from SoA style SIMD intrinsics with AVX registers that can vectorize 8 SPFP operations in parallel. But to do this, we must now use this representation:
float particle_x[1000000]; // 1mil particle X positions (~4 megs)
float particle_y[1000000]; // 1mil particle Y positions (~4 megs)
float particle_z[1000000]; // 1mil particle Z positions (~4 megs)
double particle_birth[1000000]; // 1mil particle births (~8 bytes)
Now we're flying with the particle simulation, but look what happened to our particle design. It has been completely demolished, and we're now looking at 4 parallel arrays and no object to aggregate them whatsoever. Our object-oriented Particle
design has gone sayonara.
This happened to me many times working in performance-critical fields where users demand speed with only correctness being the one thing they demand more. These little teeny object-oriented designs had to be demolished, and the cascading breakages often required that we used a slow deprecation strategy towards the faster design.
Solution
The above scenario only presents a problem with granular object-oriented designs. In those cases, we often end up having to demolish the structure in order to express more efficient representations as a result of SoA reps, hot/cold field splitting, padding reduction for sequential access patterns (padding is sometimes helpful for performance with random-access patterns in AoS cases, but almost always a hindrance for sequential access patterns), etc.
Yet we can take that final representation we settled on and still model an object-oriented interface:
// Represents a collection of particles.
class ParticleSystem
{
public:
...
private:
double particle_birth[1000000]; // 1mil particle births (~8 bytes)
float particle_x[1000000]; // 1mil particle X positions (~4 megs)
float particle_y[1000000]; // 1mil particle Y positions (~4 megs)
float particle_z[1000000]; // 1mil particle Z positions (~4 megs)
};
Now we're good. We can get all the object-oriented goodies we like. The cheetah has a whole country to run across as fast as it can. Our interface designs no longer trap us into a bottleneck corner.
ParticleSystem
can potentially even be abstract and use virtual functions. It's moot now, we're paying for the overhead at the collection of particles level instead of at a per-particle level. The overhead is 1/1,000,000th of what it would be otherwise if we were modeling objects at the individual particle level.
So that's the solution in true performance-critical areas that handle a heavy load, and for all kinds of programming languages (this technique benefits C, C++, Python, Java, JavaScript, Lua, Swift, etc). And it can't easily be labeled as "premature optimization", since this relates to interface design and architecture. We can't write a codebase modeling a single particle as an object with a boatload of client dependencies to a Particle's
public interface and then change our minds later. I've done that a lot when being called to optimize legacy codebases, and that can end up taking months of rewriting tens of thousands of lines of code carefully to use the bulkier design. This ideally affects how we design things upfront provided that we can anticipate a heavy load.
I keep echoing this answer in some form or another in many performance questions, and especially ones that relate to object-oriented design. Object-oriented design can still be compatible with the highest-demand performance needs, but we have to change the way we think about it a little bit. We have to give that cheetah some room to run as fast as it can, and that's often impossible if we design teeny little objects that barely store any state.