Index, Count, Offset, Size

146

iingve 3 days ago 70 commentsRead Article on tigerbeetle.com

Discussion (70 Comments)

jp1016•about 6 hours ago

Banning "length" from the codebase and splitting the concept into count vs size is one of those things that sounds pedantic until you've spent an hour debugging an off-by-one in serialization code where someone mixed up "number of elements" and "number of bytes." After that you become a true believer.

The big-endian naming convention (source_index, target_index instead of index_source, index_target) is also interesting. It means related variables sort together lexicographically, which helps with grep and IDE autocomplete. Small thing but it adds up when you're reading unfamiliar code.

One thing I'd add: this convention is especially valuable during code review. When every variable that represents a byte quantity ends in _size and every item count ends in _count, a reviewer can spot dimensional mismatches almost mechanically without having to load the full algorithm into their head.

maleldil•about 6 hours ago

Big-endian naming is great. I've adopted it since I first read it about it in matklad's blog.

akst•about 5 hours ago

Have you got a link to this blog post?

dd82•about 4 hours ago

not sure about a post, but have https://github.com/tigerbeetle/tigerbeetle/blob/main/docs/TI... bookmarked

cb321•about 8 hours ago

As @SkiFire correctly observes[^1], off-by-1 problems are more fundamental than 0-based or 1-based indices, but the latter still vary enough that some kind of discrimination is needed.

For many years (decades?) now, I've been using "index" for 0-based and "number" for 1-based as in "column index" for a C/Python style [ix] vs. "column number" for a shell/awk/etc. style $1 $2. Not sure this is the best terminology, but it is nice to have something consistent. E.g., "offset" for 0-based indices means "off" and even the letter "o" in some case becomes "the zero of some range". So, "offset" might be better than "index" for 0-based.

[^1]: https://news.ycombinator.com/item?id=47100056

matklad•about 7 hours ago

Ha! I also use `line_number = line_index + 1` convention!

cb321•about 4 hours ago

:-)

If it helps anyone explain the SkiFire point any better, I like to analogize it to an I-bar cursor vs. a block cursor for text entry. An I-bar is unambiguously "between characters" while a block cursor is not. So, there are questions that arise for block cursors that basically never arise for I-bar cursors. When just looking at an integer like 2 or 3, there is no cursor at all. So, we must instead rely on names/conventions/assumptions with their attendant issues.

To be clear, I liked the SkiFire explanation, but having multiple ways to describe/think about a problem is usually helpful.

throwaway27448•about 7 hours ago

Ordinal is nice because it explicitly starts at 1.

adrian_b•about 6 hours ago

Nit pick: only in few human languages the ordinal numbers start at 1.

In most modern languages, the ordinal numbers start at 2. In most old languages, and also in English, the ordinal numbers start at 3.

The reason for this is the fact that ordinal numbers have been created only recently, a few thousand years ago.

Before that time, there were special words only for certain positions of a sequence, i.e. for the first and for the last element and sometimes also for a few elements adjacent to those.

In English, "first", "second" and "last", are not ordinal numbers, but they are used for the same purpose as ordinal numbers, though more accurately is to say that the ordinal numbers are used for the same purpose with these words, as the ordinal numbers were added later.

The ancient Indo-European languages had a special word for the other element of a pair, i.e. the one that is not the first element of a pair. This word was used for what is now named "second". In late Latin, the original word that meant "the other of a pair" has been replaced with a word meaning "the following", which has been eventually also taken by English through French in the form of "second".

MarkusQ•about 4 hours ago

Meta nit pick: You are conflating linguist's jargon with mathematician's jargon.

In much the same way as physicists co-opted common words (e.g. "work" and "energy") to mean very specific things in technical contexts, both linguists and mathematicians gave "ordinal" a specific meaning in their respective domains. These meanings are similar but different, and your nit pick is mistakenly asserting that one of these has priority over the other.

"Ordinal" in linguistics is a word for a class of words. The words being classified may be old, but the use of "ordinal" to denote them is a comparatively modern coinage, roughly contemporary with the mathematicians usage. Both come from non-technical language describing putting things in an "orderly" row (c.f. cognates such as "public order", "court order", etc.) which did not carry the load you are trying to place on them.

wahern•about 16 hours ago

Relatedly, a survey of array nomenclature was performed for the ISO C committee when choosing the name of the new countof operator: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3469.htm

It was originally proposed as lengthof, but the results of the public poll and the ambiguity convinced the committee to choose countof, instead.

pansa2•about 6 hours ago

The reason many languages prefer `length` to `count`, I think, is that the former is clearly a noun and the latter could be a verb. `length` feels like a simple property of a container whereas `count` could be an algorithm.

`countof` removes the verb possibility - but that means that a preference for `countof` over `lengthof` isn't necessarily a preference for `count` over `length`.

ncruces•about 6 hours ago

But count is more clearly a dimensionless number of elements, and not a size measured in some unit (e.g. bytes).

JSR_FDED•about 13 hours ago

Using the same length of related variable names is definitely a good thing.

Just lining things up neatly helps spot bugs.

It’s the one thing I don’t like about strict formatters, I can no longer use spaces to line things up.

craig552uk•about 11 hours ago

I've never yet seen a linter option for assignment alignment, but would definitely use it if it were available

ivanjermakov•about 10 hours ago

AlignConsecutiveAssignments in clang-format might be the right fit.

https://clang.llvm.org/docs/ClangFormatStyleOptions.html

skydhash•about 9 hours ago

I know prettier can isolate a code section from changes by adding comments. And I think others can too.

stephc_int13•about 4 hours ago

Could not approve more a I use near identical naming convention in my C codebase. Not using the standard C library to avoid inconsistencies and the awful naming habits of that era.

Fraterkes•about 8 hours ago

Is there any reason to not just switch to 1-based indexing if we could? Seems like 0-based indexing really exacerbates off-by-one errors without much benefit

dgrunwald•about 5 hours ago

When accessing individual elements, 0-based and 1-based indexing are basically equally usable (up to personal preference). But this changes for other operations! For example, consider how to specify the index of where to insert in a string. With 0-based indexing, appending is str.insert(str.length(), ...). With 1-based indexing, appending is str.insert(str.length() + 1, ...). Similarly, when it comes to substr()-like operations, 0-based indexing with ranges specified by inclusive start and exclusive end works very nicely, without needing any +1/-1 adjustments. Languages with 1-based indexing tend to use inclusive-end for substr()-like operations instead, but that means empty substrings now are odd special cases. When writing something like a text editor where such operations happen frequently, it's the 1-based indexing that ends up with many more +1/-1 in the codebase than an editor written with 0-based indexing.

SkiFire13•about 8 hours ago

I'm not sure what that has to do with the article, but anyway: https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD831...

That said, I'm not sure how 1-based indexing will solve off-by-1 errors. They naturally come from the fencepost problem, i.e. the fact that sometimes we use indexes to indicate elements and sometimes to indicate boundaries between them. Mixing between them in our reasoning ultimately results in off-by-1 issues.

Fraterkes•about 8 hours ago

This is an article that (among other things) talks about off-by-one errors being caused by mixing up index and count (and having to remember to subtract 1 when converting between the two). That's what it has to with it.

adrian_b•about 7 hours ago

If you always use half-open intervals, you never have to subtract 1 from anything.

With half-open intervals, the count of elements is the difference between the interval bounds, adjacent intervals share 1 bound and merging 2 adjacent intervals preserves the extreme bounds.

Any programming problem is simplified when 0-based indexing together with half-open intervals are always used, without exceptions.

The fact that most programmers have been taught when young to use 1-based ordinal numbers and closed intervals is a mental handicap, but normally it is easy to get rid of this, like also getting rid of the mental handicap of having learned to use decimal numbers, when there is no reason to ever use them instead of binary numbers.

SkiFire13•about 6 hours ago

I must have missed that part, my bad

GuB-42•about 5 hours ago

Because it is not how computers work. It doesn't matter much for high level languages like LUA, you rarely manipulate raw bytes and pointers, but in system programming languages like Zig, it matters.

To use the terminology from the article, with 0-based indexing, offset = index * node_size. If it was 1-based, you would have offset = (index - 1) * node_size + 1.

And it became a convention even for high level languages, because no matter what you prefer, inconsistency is even worse. An interesting case is Perl, which, in classic Perl fashion, lets you choose by setting the $[ variable. Most people, even Perl programmers consider it a terrible feature and 0-based indexing is used by default.

pansa2•about 6 hours ago

Fundamentally, CPUs use 0-based addresses. That's unavoidable.

We can't choose to switch to 1-based indexing - either we use 0-based everywhere, or a mixture of 0-based and 1-based. Given the prevalence of off-by-one errors, I think the most important thing is to be consistent.

adrian_b•about 7 hours ago

This is a matter of opinion.

My opinion is that 1-based indexing really exacerbates off-by-one errors, besides requiring a more complex implementation in compilers, which is more bug-prone (with 1-based addressing, the compilers must create and use, in a manner transparent for the programmer, pointers that do not point to the intended object but towards an invalid location before the object, which must never be accessed through the pointer; this is why using 1-based addressing was easier in languages without pointers, like the original FORTRAN, but it would have been more difficult in languages that allow pointers, like C, the difficulty being in avoiding to expose the internal representation of pointers to the programmer).

Off-by-one errors are caused by mixing conventions for expressing indices and ranges.

If you always use a consistent convention, e.g. 0-based indexing together with half-open intervals, where the count of elements equals the difference between the interval bounds, there are no chances for ever making off-by-one errors.

tialaramex•about 8 hours ago

I would bet that in the opposite circumstance you'd say the same thing:

"Is there any reason to not just switch to 0-based indexing if we could? Seems like 1-based indexing really exacerbates off-by-one errors without much benefit"

The problem is that humans make off-by-one errors and not that we're using the wrong indexing system.

Fraterkes•about 8 hours ago

No indexing system is perfect, but one can be better than another. Being able to do array[array.length()] to get the last item is more concise and less error prone than having to add -1 every time.

Programming languages are filled with tiny design choices that don’t completely prevent mistakes (that would be impossible) but do make them less likely.

tialaramex•about 3 hours ago

array[array.length()] is nonsense if the array is empty.

You should prefer a language, like Rust, in which [T]::last is Option<&T> -- that is, we can ask for a reference to the last item, but there might not be one and so we're encouraged to do something about that.

IMNSHO The pit of success you're looking for is best dug with such features and not via fiddling with the index scheme.

adrian_b•about 7 hours ago

Having to use something like array[length] to get the last element demonstrates a defect of that programming language.

There are better programming languages, where you do not need to do what you say.

Some languages, like Ada, have special array attributes for accessing the first and the last elements.

Other languages, like Icon, allow the use of both non-negative indices and of negative indices, where non-negative indices access the array from its first element towards its last element, while negative indices access the array from its last element towards its first element.

I consider that your solution, i.e. using array[length] instead of array[length-1], is much worse. While it scores a point for simplifying this particular expression, it loses points by making other expressions more complex.

There are a lot of better programming languages than the few that due to historical accidents happen to be popular today.

It is sad that the designers of most of the languages that attempt today to replace C and C++ have not done due diligence by studying the history of programming languages before designing a new programming language. Had they done that, they could have avoided repeating the same mistakes of the languages with which they want to compete.

GoblinSlayer•about 6 hours ago

If your design works better in one scenario usually means it works worse in other scenarios, you just shuffled garbage around.

bruce343434•about 8 hours ago

You say "seems like", can you argue/show/prove this?

Fraterkes•about 8 hours ago

I think that many obo errors are caused by common situations where people can mistakenly mix up index and count. You could eliminate a (small) set of those situations with 1-based indexing: accessing items from the ends of arrays/lists.

meindnoch•about 7 hours ago

And in turn you'd introduce off by one errors when people confuse the new 1-based indexes with offsets (which are inherently 0-based).

So yeah, no. People smarter than you have thought about this before.

naasking•about 5 hours ago

> Is there any reason to not just switch to 1-based indexing if we could? Seems like 0-based indexing really exacerbates off-by-one errors without much benefit

You'd just get a different set of off-by-one errors with 1-based indexing.

qouteall•about 15 hours ago

With modern IDE and AI there is no need to save letters in identifier (unless too long). It should be "sizeInBytes" instead of "size". It should be "byteOffset" "elementOffset" instead of "offset".

pveierland•about 14 hours ago

When correctness is important I much prefer having strong types for most primitives, such that the name is focused on describing semantics of the use, and the type on how it is represented:

    struct FileNode {
        parent: NodeIndex<FileNode>,
        content_header_offset: ByteOffset,
        file_size: ByteCount,
    }

Where `parent` can then only be used to index a container of `FileNode` values via the `std::ops::Index` trait.

Strong typing of primitives also help prevent bugs like mixing up parameter ordering etc.

kqr•about 13 hours ago

I agree. Including the unit in the name is a form of Hungarian notation; useful when the language doesn't support defining custom types, but looks a little silly otherwise.

canucker2016•about 8 hours ago

Depends on what variant of Hungarian you're talking about.

There's Systems Hungarian as used in the Windows header files or Apps Hungarian as used in the Apps division at Microsoft. For Apps Hungarian, see the following URL for a reference - https://idleloop.com/hungarian/

For Apps Hungarian, the variable incorporates the type as well as the intent of the variable - in the Apps Hungarian link from above, these are called qualifiers.

so for the grandparent example, rewritten in C, would be something like:

    struct FileNode {
        FileNode *pfnParent;
        DWORD ibHdrContent;
        DWORD cb;
    }

For Apps Hungarian, one would know that the ibHdrContent and cb fields are the same type 'b'. ib represents an index/offset in bytes - HdrContent is just descriptive, while cb is a count of bytes. The pfnParent field is a pointer to a fn-type with name Parent.

One wouldn't mix an ib with a pfn since the base types don't match (b != fn). But you could mix ibHdrContent and cb since the base types match and presumably in this small struct, they refer to index/offset and count for the FileNode. You'd have only one cb for the FileNode but possibly one or more ibXXXX-related fields if you needed to keep track of that many indices/offsets.

groundzeros2015•about 13 hours ago

Long names become burdensome to read when they are used frequently in the same context

ivanjermakov•about 10 hours ago

When the same name is used a thousand times in a codebase, shorter names start to make sense. See aviation manuals or business documentation, how abbreviation-dense they are.

throwaway2027•about 15 hours ago

Isn't that more tokens though?

post-it•about 14 hours ago

Only until they develop some kind of pre-AI minifier and sourcemap tool.

0x457•about 15 hours ago

Sure you get one or two word extra worth of tokens, but you save a lot more compute and time figuring what exactly this offset is.

Onavo•about 15 hours ago

Not significantly, it's one word.

meindnoch•about 7 hours ago

Tokens are not words.

akdor1154•about 13 hours ago

The 'same length for complementary names' thing is great.

kgwxd•about 3 hours ago

So many arguments could have been avoided if the convention was to use o instead of i in c-like for loops.

navane•about 7 hours ago

I hoped to learn some more excel lookup tactics, alas

donkeybeer•about 6 hours ago

I was thinking sql.

card_zero•about 17 hours ago

I can't read the starts of any lines, the entire page is offset about 100 pixels to the left. :) Best viewed in Lynx?

Flow•about 12 hours ago

Looks perfect here. iOS Safari

zephen•about 15 hours ago

The invariant of index < count, of course, only works when using Djikstra's half-open indexing standard, which seems to have a few very vocal detractors.

tromp•about 11 hours ago

See https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD831... for Dijkstra's thoughts on indexing.

GolDDranks•about 15 hours ago

Fortunately only a few. Djikstra's is obviously the most reasonable system.

zephen•about 4 hours ago

Obviously to you and me, but you can see comments right here where others disagree.

And the detractors certainly have momentum in certain segments on their side.

Historically, of course, it was languages like Fortran and COBOL and even Smalltalk, but even today we have MATLAB, R, Lua, Mathematica, and julia.

Big-endian won in network byte order, but lost the CPUs. One-based indexing won in mathematical computing so far, and lost main-stream languages so far, but the julia folks are trying to change that.

dataflow•about 17 hours ago

Is there any other example of "length" meaning "byte length", or is it just Rust just being confusing? I've never seen this elsewhere.

Offset is ordinarily just a difference of two indices. In a container I don't recall seeing it implicitly refer to byte offset.

SabrinaJewson•about 16 hours ago

In general in Rust, “length” refers to “count”. If you view strings as being sequences of Unicode scalar values, then it might seem odd that `str::len` counts bytes, but if you view strings as being a subset of byte slices it makes perfect sense that it gives the number of UTF-8 code units (and it is analoguous to, say, how Javascript uses `.length` to return the number of UTF-16 code units). So I think it depends on perspective.

dataflow•about 13 hours ago

That makes sense, I agree -- seems Rust is on board here too.

AlotOfReading•about 16 hours ago

It's the usual convention for systems programming languages and has been for decades, e.g. strlen() and std::string.length(). Byte length is also just more useful in many cases.

dataflow•about 12 hours ago

No, those are counts by definition, and byte lengths only by coincidence. Look at wcslen() and std::wstring::length().

wyldfire•about 14 hours ago

A length could refer to lots of different units - elements, pages, sectors, blocks, N-aligned bytes, kbytes, characters, etc.

Always good to qualify your identifiers with units IMO (or types that reflect units).

userbinator•about 12 hours ago

Or learn an array language and never worry about indexing or naming ;-)

Everything else looks disgustingly verbose once you get used to them.