Skip to main content

Command Palette

Search for a command to run...

Use marker values instead of null or whatever your language calls a it!

null means data is missing but provides no semantic meaning for its absence. providing why it is missing is critical for data integrity

Updated
15 min read
J

None of what I post is "ai" generated. "AI" does not exist and what is being called "ai" vomits up misinformation as facts mixed in with a sprinkling of actual facts to make it extremely harmful to use.

What you read here, I wrote.

What is NULL?

Tony Hoare introduced the concept of thenull reference in 1965 while designing the ALGOL W programming language.

A null reference means that a reference variable does not point to any object.

He has explicitly stated that the null reference was included "simply because it was so easy to implement." In the context of designing ALGOL W, it provided a quick and straightforward way to represent the absence of a value.

"I call it my billion-dollar mistake" - Tony Hoare (QCon London 2009)

Granted in hind sight it has been misused and abused like just about every other feature of every other language or language paradigm. null in and of itself is not “bad”. It is all the people that do not understand why they should not use it in modern code. There are plenty of alternatives that were not available in 1965.

Programs written in 1965 were tiny compared to even the 1980’s. They had to fit in even smaller amounts of RAM and had to use every bit of memory and processing cycle for something relevant. null was a convenient way to signal that lack of data both in time (cycles) and space (bits).

“You fathers null reference, This is a tool of the diligent programmer. Not as haphazard or reckless as a raw pointer; an elegant solution for a more civilized age” - with apologies to Obi Wan Kenobi (and George Lucas)

While the concept ofnull was a valid solution to a real problem of its time, it is an anachronism that has no place in modern programming languages. Relying on null references in 2025 is just lazy and causes more work and harm than avoiding it.

There is no reason to use null in any language

Regardless of the language you are using, there is no reason to use null (or whatever your language calls it). We are not constrained by the memory or processor limitations of 1985. We are also dealing with code bases that are many orders of magnitude larger than they were in 1965.

Avoiding using null should be the default in modern languages, but it persists as something we all have to manually deal with in every popular language that is ubiquitous today.

And Optional types, Maybe or whatever copium you wrap a null with to check a bool value to see if the reference is null is no better than just returning null, because it carries with it no semantic meaning why null was returned.

This Page Left Intentionally Blank

The phrase "this page intentionally left blank" will be familiar to anyone that remembers reading actual physical manuals.

It served a practical purpose rooted in the mechanics of printing and bookbinding. To avoid confusion and prevent readers from thinking that pages are missing, printers include the "intentionally left blank" notice when a page was blank because of layout or formatting.

In important documents, such as legal contracts or technical manuals, a missing page could have serious consequences. The phrase became more standard with the rise of mass printing and the need for unambiguous consistent documentation of the intent of the printer.

This Reference Left Intentionally Unassigned

The problem with null is there is no semantic meaning to it, it just means this variable does not point to any valid data. It tells you there is nothing to read, but nothing more. This is the “mistake” of null. Is there no data because of an error? Is there no data because the data was asked for and not supplied? Or, was it asked for and actively not provided?

Programming is all about unambiguous instructions to be carried out. We need programming languages to be unambiguous as possible, and we need programmers to use them in as unambiguous ways as possible.

In 1965, large programs were only a few thousand lines of code. 10K lines of code were on the high end of what was even possible given the constraints on memory available. This means one person could read through the code and figure out why some data was missing and hopefully a comment would explain why the value was not set.

Null has no default value

The primary misuse of null was to use it as a marker value for a default value. This was propagated to databases and instead of storing default values, they would be converted to null values to save space and the converted back to some default value when they were read and found to be missing.

The problem with this is there is no way to tell if the value was intentionally set to null in the database or it was an error. If it is not an error, what is it supposed to represent down stream?

The source of the data may not even be consistent when converting values to null when storing them and what it converts them back to when they are null when read back from storage.

A null reference to a string is not ””, that would be a valid reference to a string with a length of zero bytes. An empty string without any characters.

History Lesson

I started writing programs in 6502 assembly language, in the early/mid 1980’s. I understand and appreciate the mistakes that we all made when the craft of programming was new. We were all learning, even 20 years after Tony Hoare invented null we still did not have any reason to think it was a problem. Programs were still written on the same scale in terms of lines of code.

In practice back then, the semantics of null vs ”” were not important. They were the same in practice, even if not in concept.

In C a string was an array of bytes terminated with a confusingly named null terminating character.

A null character was the a byte with all the bits set to ZERO, or a zero byte. Almost all numbers in programs that represented data were encoded in hexadecimal back then, and it was easy enough to convert in your head because every computer was only 8 bits. A single byte ranged from 00 to FF. That is 0 to 254 in value.

In C, a string was encoded as an array of one byte per character plus a terminating zero byte to mark the end of the data. “Hello, World!” would be encoded as; 48 6cs theor5 6C 6C 6F 2C 20 57 6F 72 6C 64 21 00 in memory in physical storage. The 00 was the “This Page Left Intentionally Blank” marker to say, no more valid data exists past this point.

In almost every other language since C, the string type is usually represented in memory as a pointer to the array of bytes and the length of the array and the terminating null character is not needed to mark the end of the data. If you with ASCII string values, even with UTF-8 encoded Unicode, storing the pointer to the string data and its length is only slightly more data to store than just the pointer. It just depends on the sizeof the pointer itself. For zero length string representations the space trade off is marginal compared to benefits of eliminating entire classes of errors.

Perversions begat Perversions

With the perversion of Alan Kay’s ideas into “Object Oriented Programming” paradigm of the 1990’s came even more corruption of valid concepts debased into “billion dollar mistakes”, because of lack of reading comprehension skills or just plain laziness. Remember, this was pre-internet brained days still, there was no excuses, not that there should be today either.

OOPy programmers immediately saw the flaws and problems with “OO” languages like C++ and instead of fixing the languages, they created contrivances they called “Patterns” of design.

These work around were were so numerous and common place they filled an entire book written by a gang of four people. These “Patterns” were canonized as sliver bullet solutions for lazy or incompetent programmers and treated as religious truths over three decades later.

One of them was called the “Null Object Pattern”, first documented in "Pattern Languages of Program Design" by James Coplien in 1995.

Even though this was conceived during the height of the “OOP” madness. It is arguable the best solution to the problem of dealing with missing data. At least, to the extent that it semantically shows that data is missing on purpose and not by accident to a point.

No Data Value > Null Object > Optional[T]

A competing concept to the “Null Object” is “Optional[T]” which has its origins in functional programming. Specifically the Maybe type from the programming language Haskell.

In functional programming languages, where Monads work seamlessly in the language, Optional is an elegant solution. In OOPy languages, like that are C++ based syntax; Java, C#, etc and hybrid languages like Go that are not OOPy and not really functional either, they are nothing more than if == null checks with extra steps.

import java.util.Optional;

Optional<String> maybeName = getNameFromDatabase(userId);

if (maybeName.isPresent()) {
    String name = maybeName.get();
    System.out.println("Name: " + name);
} else {
    System.out.println("Name not found.");
}

is nothing more code and more indirection than just doing

String name = getNameFromDatabase(userId);

if (name != null) {
    System.out.println("Name: " + name);
} else {
    System.out.println("Name not found.");
}

This is just functional programming language have better abstractions but I really do not want to learn how to think in an actual functional manner so lets wedge the idea into Java brained stupidity.

The same people that promote Optional as the solution to null problem, are the same ones that complain about performance and think less lines of code means better. They have no internal consistency.

String missingName = "\u0000" // zero byte character; traditionally null terminating character.

String name = getNameFromDatabase(userId);

if (!name.equals(missingName)) {
    System.out.println("Name: " + name);
} else {
    System.out.println("Name not found");
}

Now this is not any fewer lines of code than the raw null check, but it is more semantically rich that it explicitly informs the reader that there was no name returned.

I use the traditional zero byte null character from C, to represent that no data was provided. This value is useful in that if you try and print it out, you do not get an error and you can test for what is a useless value that should not be in any valid data in any real world system.

If for some exceptionally rare case your system does treat a single zero by character (or the Unicode equivalent) as valid data there are literally hundreds (thousands if using Unicode) other characters you can choose from.

Here are some Unicode characters that are specifically suited for use as a marker character for intentionally missing data.

  • Null Character (U+0000):

    • As demonstrated in the previous Java example, this is a classic choice. It's explicitly designed to represent "null."

    • However, it can be difficult to display and handle in some environments.

  • Unit Separator (U+001F):

    • This is a control character that's rarely used in normal text.

    • It's intended for separating data units, this single character by itself can be seen as analogous to representing the absence of data.

  • Object Replacement Character (U+FFFC):

    • This character is often used to represent an unknown or unrenderable object.

    • While not strictly `null`, it can be used to indicate the absence of a meaningful value.

  • Zero Width No-Break Space (U+FEFF):

    • While this is often used as a Byte Order Mark, it is also a zero width character, and therefore would not visually show up. It is also very unlikely to be used in normal text.

But how do you communicate why the data missing?

“\u0000” is a better value to use than null to represent missing or no-value data.

It is not perfect, it is will missing the semantics of why the data is missing.

All you have to do is expand on this idea, of invalid printable/invisible characters and you can encode as much information into your missing data as you see fit.

If you look up some data from a data store or external system and the data is missing because it is not found then this would be an semantically appropriate return value.

Object Replacement Character (U+FFFC):

  • This character is often used to represent an unknown or unrenderable object.

  • This can be used to indicate the absence of a meaningful value.

You can always combine one of the above alternatives with one of the following 6400 values to create as many missing data and here is why reason codes as you need!

  • Private Use Characters (U+E000–U+F8FF):

    • These characters are reserved for private use, so you can choose any character within this range.

    • This guarantees that it won't conflict with any standard Unicode characters.

    • This however makes the code less portable.

If you want to get even more clever you can return an entire printable error message up into the Private User Character range by adding 0xF8FF to each character so that none of them a printable and to print out the error message move them back down.

function stringToByteArray(str) {
    const encoder = new TextEncoder();
    const encoded = encoder.encode(str);
    return Array.from(encoded).map(byte => byte ^ 0xF8); // XOR with a byte value
}

function byteArrayToString(byteArray) {
    const decodedBytes = byteArray.map(byte => byte ^ 0xF8); // XOR with the same byte value
    const decoder = new TextDecoder();
    return decoder.decode(new Uint8Array(decodedBytes));
}

function isErrorMsg(byteArray) {
    for (let i = 0; i < byteArray.length; i++) {
        // Basic printable character range (ASCII)
        if (i >= 0xF8FF) {
            return true;
        }
    }

    return false;
}

// Example usage:
const inputString = "Hello, world! 😊";
console.log(inputString);
const byteArray = stringToByteArray(inputString);
console.log("isError:",isErrorMsg(byteArray));
console.log(byteArray);

const decodedString = byteArrayToString(byteArray);
console.log(decodedString);

This outputs the following.

Hello, world! 😊
isError: true
[
  176, 157, 148, 148, 151,
  212, 216, 143, 151, 138,
  148, 156, 217, 216,   8,
  103,  96, 114
]
Hello, world! 😊

This is just an example using the lowly string, you are probably wondering how you would deal with a more complex data type like a Person?


/**
 * string marker value for not being initialized
 * this is more semantically informational that using null or undefined
 * since this shows deliberate intention; ie:This Page Intentionally Blank
 * @type {string}
 */
export const no_string_data = '\u0000';

/**
 * Date marker value for not being initialized
 * since this shows deliberate intention; ie:This Page Intentionally Blank
 * @type {Date}
 */
export const no_date_data = new Date('0001-01-01T00:00:00.000Z');

/**
 * uint marker value for not being initialized
 * since there are only number types, this shows two things, that the variable
 * is intended to represent unsigned integers, by using a negative value
 * this shows deliberate intention; ie:This Page Intentionally Blank
 * @type {number}
 */
export const no_uint_data = -1;

/**
 * this string represents the lack of a boolean value
 * the idea is to union string with boolean as valid values
 * and test for this specific string if the value is a string type.
 * Zero Width No-Break Space - \uFEFF
 * @type {string}
 */
export const no_bool_data = "\uFEFF";

export function isNoBooleanData(value) {
  if (typeof value === 'string') {
    return value === no_bool_data;
  }
  return false;
}

/**
 *
 * @type {{id: string,
           first_name: string, 
           last_name: string, 
           opininated: boolean, 
           birthday: Date, 
           age: number
          }}
 */
const no_person_data = {
    id: no_string_data,
    first_name: no_string_data,
    last_name: no_string_data,
    opinionated: false,
    birthday: no_date_data,
    age: no_uint_data,
}

The one obvious exception to these approaches, and there is always at least one, is the Boolean type.

In this example, I could have done the lazy thing and just set opinionated = false and been done with it, but that would not be the wise choice given that all the other datatypes have no_data alternatives to check for.

With only two values, both of which are equally valid as data how to you deal with missing data.

In languages that are not strongly typed like JavaScript you have alternatives.

Here I have used the no_bool_data value which I set to the Zero Width No-Break Space, the test becomes simple, isNoBoolData() is no more code that == null or isPresent() and much more semantically informational.

Specific No Data Types

In strictly typed languages you can create specific no data types and check for the type or use a language feature to test for missing data. This is just one example of one approach in Go.

package main

import (
    "fmt"
)

type NotProvided string

const firstNameNotProvided = NotProvided("\u0000\uF8FF")
const lastNameNotProvided = NotProvided("\u0000\uF900")


func isNotProvidedType[T string | NotProvided](val T) bool {
    _, ok := interface{}(val).(NotProvided)
    return ok
}


func main() {
    fmt.Printf("data provided: %t\n", isNotProvidedType("this is a test"))
    fmt.Printf("first name provided: %t\n", isNotProvidedType(firstNameNotProvided))
    fmt.Printf("last name provided: %t\n", isNotProvidedType(lastNameNotProvided))
}

this results in this output

data provided: false
first name provided: true
last name provided: true

Language with matching syntax like Erlang make this even easier, but then again functional languages have elegant compositional chaining that procedural languages usually do not.

Strongly Typed Languages

Even though languages like Go have the concept of nil which is effectively a strongly typed null reference and points to that types Zero Value. Which is different depending on the type.

It still suffers from lack of semantic meaning other than, nothing of a type and I find myself using No Data types that I can test against for reasons why something is not returning data.

Go is not the best example either because of its idiom of returning an error value when ever a return value could be nil to explain why it is nil and of course the way to check if there is no error is to check and see if the returned error is nil.

Yet some functions return back a boolean, ok/exists/success type value to signal that a value was returned successfully. As in when asserting an interface{} is of a type, or checking to see if map has a value for a key.

// pre-generics way to test if something is of a type
var val interface{}
if _, ok := val.(string); ok {
   fmt.Println(string(val))
}


// does a map[string]interface{} contain a key
// if the value is a pointer type you could test
// v != nil but that is less informational than exists
m := make(map[string]string)
if v, exists := m["somekey"]; exists {
    fmt.Println(v)
}

Conclusion

There is no reason to compromise your data integrity and code clarity by slinging null all over your code base.

The more important concept is downstream data consumers.

When others access your data they need to know why data is missing so they can feel confident in the validity of the data. Especially if you are the authoritative source of the data.

If you change your rules for what null represents as default when read back in, there is no way for downstream consumers to know about this change. They might have see that null actually means 0 for some datum, but you decide that it actually means -1 now.

This causes confusion and havoc for downstream consumers use the data and forward what is now the wrong data downstream when they replace the null with what they think the default value should be.

When that data comes full circle back to the authoritative source more time and money will be lost than the time and money it takes to avoid the possibility in the first place.