5

I'm working on code which will provide a generic C++ interface to relational datasets. Some of the data will be in tables in memory (e.g. std::set or std::vector), but some of it will be computed - either produced by a function computing results of a logic program, or by joining memory tables. Other datasets will be streamed in from disk from SQLite queries, but will be tableized (cached) as they are read.

I've found whitepapers on similar systems, only in functional languages, and they tend to be based on Streams. I want to implement a similar system, but base it on concepts familiar to and interoperable with the C++ world. I had an epiphany last night - "not streams; iterators!" (Being a generic library, mine will give the user the choice what template policy to use to tableize with: a std::set, a std::vector, perhaps even to a fixed buffer accessed by pointers. The thing these all have in common is they can be abstracted using iterators.)

However another thought gave me pause: I'm not sure if I can articulate the differences between (C++) streams and (STL) iterators. I know each when I see it, but I also know one can be converted to the other with a wrapper object. I have an intuition that the iterator interface will provide opportunities to make better optimizations, but I am interested in what is the conceptual or API difference between the two. In other words, at what point is something actually an iterator and not just a stream-wrapped-in-an-iterator, and at what point is something actually a stream and not just an iterator-being-streamed?

(As an example, suppose we have one memory table, one SQLite query object, and a function computing a logic predicate on their cross product. The table is a std::vector accessed via an STL iterator. The SQLite query is a proxy iterator that will stream rows from disk when advanced, but return rows from cache if decremented/reset. The logic query iterator will compute zero to many results based on the value from dereferencing each of the two sources. A super-iterator representing the whole thing, when advanced, would ++ or reset each of its constituent iterators to compute all permutations. Is this last thing an iterator, or a stream? On the one hand it is streaming results of a complex orchestration of parts, but from another point of view it is performing a very similar operation to a custom iterator that would scan each cell of an array int[M][N][K], which is clearly an iterator.)

3
  • It's an iterator when you can pass decltype(it) to std::iterator_traits and a stream when you can << or >> values into / out of it. I.e. they can be interchangeable
    – Caleth
    Commented Apr 27, 2017 at 14:44
  • For your information,check this out first stackoverflow.com/questions/26824281/…
    – quintumnia
    Commented Apr 27, 2017 at 15:12
  • I doubt that use of iterators could ever be optimized to a point where they were more efficient than the equivalent streams. Generally, streams return values as soon as they are computed, as opposed to iterating through the entire collection before returning anything.
    – Nate T
    Commented Jun 5, 2021 at 6:42

2 Answers 2

3

Take a look at the documentation - specifically the category table and the stream iterators section.

The stream iterators all belong to the InputIterator or OutputIterator categories (only the first is relevant for your case). We can approximately say that input streams and InputIterators are equivalently powerful/expressive, but IMO streams have a lot of extra baggage.

In favour of iterators:

  • allow range-base loops like for (auto &record : query_result)
  • support standard algorithms like find_if, transform, etc.
  • allow more flexible traversal if you want and can support it (reverse iteration, multiple pass, random-access)

Against streams:

  • they're usually used for generalized formatted I/O in C++ (so there's a possible expectation mismatch)
  • detecting end-of-input stream is much uglier than detecting end-of-iterator range.

    See the plethora of while(!cin.eof()) questions on SO for reference: you actually have to try reading in the next record, and then remember to check whether it succeeded before you can use it.

    If the only sane way to do this is wrapping it in an istream_iterator<MyComplexRecord> anyway, you could just write the iterator.

3

I hope it's not poor form to answer my own question, but upon trying to implement a LogicProgramAnswersetIterator, I ran into some problems that lead me to think either I would have to go to "heroic" measures to pretend that my objects are iterable, or else admit that iterator is not a good match for this use case. I think that the reasons why that is the case illuminate differences between streams and iterators; the objects I'm working with are either "actually streams" or "something else", but they are not iterators.

tl;dr: Streams are different from iterators in one or both of two of the following aspects. First, an iterator to N results is post-advanced N times to consume all results, but a stream pointing to N results is pre-advanced (i.e. something is fetched) N+1 times to consume all results. Second, most iterators are ephemeral pointers to persistent state, and most streams are persistent pointers to ephemeral state.

The first problem I ran into is that there's nothing (no safe thing) to point at to return a real reference from operator *(). That leaves me needing to return a proxy object that acts like a reference, a la std::vector<bool>. This is not a showstopper, but it does mean that many of the limitations/surprises mentioned in articles about std::vector<bool> necessarily apply here too.

The second problem I ran into is that my objects require a "force the first answer" operation before you can get the first valid value, and there's no good place to put this call. It's potentially expensive, and mutates either the iterator, or the thing the iterator points to, or both. If the "force an answer" operation could be confined to the pre-increment operator ++(int) it could be justifiable, but unfortunately this must be done before returning from the dereference operator *(). We are left with a few choices:

  • Expect the user to call ++iter first, then call *iter after to get the 0th item.
    • This is just not how iterators work.
  • Force the first result in the .begin() method before constructing the iterator.
    • Violates expectations because a call like "do_only_if(false_condition, src.begin(), src.end())" will actually execute the logic program at least once.
    • (And what happens if you call .begin() twice?)
  • Force the first result in the iterator constructor itself.
    • Has all the problems of doing it in .begin().
  • Force the first result in operator *().
    • Requires a needs_eval flag. This flag has to be a mutable member variable, because it needs to be updated in const versions of the member functions.
    • The flag check logic needs to be duplicated in operator ++(int) in case the user didn't dereference the iterator.
    • Worse / most astonishingly, this flag check and force result needs to be done in operator ==()! (Because although you can easily "fake a failure" to create a "not-successful" iterator return from the .end() function, you can't know that an unevaluated iterator is going to fail (and thus, equals .end()) until you force the first result.)

Wait a minute! Isn't the std::istream_iterator wrapper class subject to all of these same design constraints? Indeed, it is, and the caveats required to meet them in its implementation lead Eric Niebler in his blog post to advise: "Naked input iterators like std::istream_iterator that don’t “belong” to a range have serious problems." [Because converting the istream to an iterator eagerly fetches a value, and calling .begin() twice actually advances the underlying stream twice.]

I think the last sentence of Useless's answer, "If the only sane way to do this is wrapping it in an istream_iterator<MyComplexRecord> anyway, you could just write the iterator," could be charitably taken as an answer in its inverse: "If the only sane way [to make this into an iterator] is wrapping it in an istream_iterator, [then maybe it's really a stream]."

The nature of the incompatibility between streams and iterators is something I've decided to call the "fence-post mismatch problem", in analogy to the class of bugs called "fence-post errors". A "fence-post error" is a logic error committed when failing to account for the fact that a fence with N sections has N+1 posts. The "fence-post mismatch problem" is one of:

  • The difficulty arising when trying to adapt an object for which N results is associated with N+1 advancements, onto an interface for which N results is associated N advancements. Graphically: |--|--| onto |--|--
  • The difficulty arising when trying to adapt an object with results-between-advancements onto an interface designed for advancements-between-results. Graphically: |--|--| onto --|--|--.
  • A combination of these two things. Graphically: |--|--| onto --|--|

The canonical model of an iterator range in C++ is an array where begin points to the first element of the array and is dereferenceable, and end points to the one past the end of the array and is not dereferenceable. Since begin starts out already pointing at a valid element (provided the array was not empty), the number of elements is N and the number of advancements to the end is also N.

Contrast this with the result set from a logic expression evaluation: the initial state of the result, before the first answer is forced, is not dereferencable. The result that marks the end, "failure", is also not dereferenceable. So the answer set is book-ended by advancement operations on both sides, like posts on a fence, therefore N results are associated with N+1 advancements.

Clearly this situation satisfies the first case of the fence-post mismatch problem.

The second case I defined for the fence-post mismatch problem is more subtle; but it is not just word-play - even if it sounds like a trivial difference of (English language) semantics, the question of whether the iterator is advancing between results or the result is moving between advancements has a concrete implementation in the program operational semantics.

Observe that in the case of iterators pointing into an array, the results are part of a larger persistent object and previous results still exist when the iterator is advanced. This is advancements-between-results because the results make up a fixed environment and it is the advancement operation which is moving between results.

In contrast the case of "iterating" through the results of a logic program has no actualized collection in memory. The computational mechanism ticking and whirring out results is the only persistent object, and the result is a temporary read off of the current state. As we compute the next answer, the states of parts of this machine reconfigure into a new result - therefore it is the result which is moving between advancements. This is results-between-advancements.

The promised simple explanation in operational semantics is this: for advancements-between-results we can return a reference (or reference proxy) to the current result and then advance the iterator without affecting the previously returned result and without making a copy of the result. For results-between-advancements we must violate at least one of the three conditions: either we have to return results by value (not reference), or we are returning a reference to an internal variable that is going to get overwritten by advancing, or else we are copying results somewhere.

This case of which-is-between-what differs from the case of "N vs N+1" because the latter concerns "when" and "how many times" the iterator is/must be advanced, and the former concerns "what happens when you dereference the iterator." Both, however, are amenable to the metaphor of fences and fence-posts.

While this answer as a whole may have been a roundabout and long-winded alternative to the observation from this Boost page: that standard iterator categories unfortunately conflate the two orthogonal traits of traversal and dereference, I want to close by talking about why I think that the fence-post problem pushes things across the line from "it's tricky to make an adapter" to "probably indicates these things should not be treated the same."

The cases of the fence-post mismatch problem involve two things that C++ is not inherently designed to allow code to be agnostic to: when (or how many times) something happens, and who owns the memory to which those things are happening. That is, it's not that you can't make it all happen automatically (by going to heroic measures), it's just that as a language C++ expects the user to have an intent in mind for these things, and that's inherently not compatible with shoehorning objects into an API which masks these details. Without garbage collection you can't just say "I don't care where the memory comes from or what happens to it later" and without non-strict evaluation with memoization you can't just say "I don't care when / how many times this code is evaluated".

In conclusion: iterators and streams are different things because they have different fence-post attributes. While it is possible to adapt one to the other, ignoring the differences can lead to astonishing results. In C++ in particular, one should consider carefully whether it is worthwhile to try to force one into the interface of the other, because hiding the differences requires making assumptions about things that C++ expects the user to make conscious choices about.

1
  • As a (very late) note, it's fine both to answer your own question, and to accept your own answer. However, your very interesting essay on the semantic differences between streams and iterators isn't very clear about what exactly the issue is with your motivating example. A solution to "forcing the first answer in begin()" isn't necessarily to avoid begin, but to separate your Query from it's iterable QueryResult which can fetch the first result at construction time.
    – Useless
    Commented Oct 13, 2017 at 15:21

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.