System Design goal: efficient reads when multiple source define the same property key

System Design goal: efficient reads when multiple source define the same property key

Β·

7 min read

πŸ”” This article was originally posted on my site, MihaiBojin.com. πŸ””


A problem of ownership

I set out to build the props library to standardize retrieving (key,value) pairs from multiple sources while also allowing a simple mechanism to establish key precedence when more than one source is aware of the same key.

Sources can be property files, system properties, or they can be coming from a key-value store or database.

Since each (key,value) pair can change independently on each source, we need to establish a key's precedence, basically answering the "Who provides the value for this key?" question.

We do so by introducing the concept of layers.

Each added layer can retrieve data from a source. As we add more layers, each takes precedence over the previous layer, for any keys it defines.

We call this concept ownership.


Sources vs. layers

Let's think for a moment about sources and layers.

A naΓ―ve approach would be to model a 1:1 relationship between Sources and Layers, basically creating a single entity to represent both, then passing a list of these entities to a registry.

Two sources can point to the same file on disk. However, if we have two objects that independently read the same data they create a potential for contention and consistency problems (e.g., Source 1 reads data, the file is updated on disk, Source2 reads different data).

Consistency is not always possible and cannot always be a property of a system, but in this case, we can ensure that a source has the same view of its data in a running JVM process.

Rather than have _source_s and _layer_s be the same (1:1), let's instead have a single source representing a dataset and multiple layers (1:N) which receive updates when the dataset changes. An added benefit of this design is that our system only needs to do one read operation per dataset instead of one per layer. Since I/O is generally more "expensive" than in-memory operations, this is a win!

Here are the components of our system:

  • Source: loads the data from its origin (e.g., filesystem), reloads it using user-defined logic (e.g., an interval), and notifies registered layers of any reload events

  • Layer: holds copies of the data for each registry

  • Registry: queries data from each of its layers

Retrieving the effective value

But what if a layer adds a new key or removes an existing one!?

Between two reads of X, the owner can change from Layer 3 to Layer 2.

A simple algorithm to solve this problem is always to query all layers, e.g.:

  • ask Layer 1-3: Do you define X?

  • Layers 2 and 3 respond yes

  • Layer 3 has precedence and owns the key

  • X is unset from Layer 3

  • we repeat the first three steps and establish Layer 2 as the owner of X

This choice is not ideal. First, we'd have to repeat this on every read, which would result in O(K) time-complexity (where K is the number of layers in the registry) for every single read property. This doesn't sound performant!

A side objective here is to deal with a high frequency of ownership changes; for example, a source dataset may repeatedly set and unset one or more keys in a short interval.

Type casting values

Before we talk about ownership, let's quickly segue into data types. The values will always be serialized as strings as we're reading them from property files, system properties, and the environment. (The library will support loading values from other data stores, e.g., databases, but the problem of casting strings from property files remains)

Here's an example of a standard Java property file:


string.prop=some value

int.prop=42

float.prop=1.23

Even though we have integer and float values, the relevant Java APIs will still return a Map, leaving the job of casting these values to the appropriate types to the developer.

In practice, while developers "come and go" in projects, this results in many ways to, not-so-optimally, load strings and repeatedly convert them to the desired type.

When I set out to build the props library, I wanted to give developers "one" way of loading the values corresponding to each key without worrying about types, simple validations, or caching. My goal is to implement a no-hassle, efficient library for dealing with application properties!

How do we establish key ownership?

Ok, back to ownership!

The registry needs to keep track of a large enough number of keys (let's say in the 10,000s range) and retrieve values from the appropriate layer.

Developers will be able to rely on a registry to retrieve values cast to their desired type.

For example, a client queries the registry (1), which in turn checks all its layers (2), awaiting responses (3), then determining the owner and returning the effective value to the client (4).

In practical applications, we can assume a registry will have a few layers (K).

If our application handles N properties (keys), then reading all the properties at least once would be an O(N*K) operation. Given that ultimately a key is owned by a single layer, the best we can hope for is O(N) for N keys, or O(1), constant time, per key.

This time-complexity is not only more efficient but also totally achievable!


Internally, the registry needs to keep track of which layers own each key.

Since layers can be updated concurrently, we need a thread-safe data structure.

Since we need to "map" (pun intended) keys to layers, we need to use a Map (duh!)

The Java SDK provides a very efficient, thread-safe map: ConcurrentHashMap.

Let's see how this could look in code:


public class Registry {

ConcurrentHashMap owners;

// retrieves the value for the specified key

public T get(String key, Class clz) {

// finds the owner

Layer owner = owners.get(key);

// retrieves the value

String effectiveValue = owner.get(key);

// casts it

return clz.cast(effectiveValue);

}

}

public interface Layer{

// retrieves the value for the specified key, from this layer alone

String get(String key);

}

The last piece of the puzzle is updating ownership. Below, we examine the new key case.


public class Source{

List layers;

// reloads data periodically, at the specified interval

void reload(Duration interval) {

// the trigger will be periodically scheduled at an interval

var trigger = () -> {

// data is reloaded from source

Map data = ...;

// and then sent to all registered layers for this source

for (Layer layer : layers) {

layer.onReload(data);

}

// we may also want a sync stage here

// to make the new dataset available to all sources at the same time

// but it's beyond the scope, for the moment

}

}

}

public class Layer{

// processes the reloaded data

public void onReload(Map data) {

// for each key in the dataset

for (var entry : data.entrySet()) {

// updates the value

updateValue(entry.getKey(), entry.getValue());

// and notifies the registry that a new owner might be available

registry().bindKey(entry.getKey(), this);

}

}

// decides the priority of this layer, in the current registry

int priority();

// reference to the registry that owns this layer

abstract Registry registry();

// stores the (key,value) pair

abstract void updateValue(String key, String value);

}

public class Registry {

ConcurrentHashMap owners;

// registers ownership of a layer over a key

void bindKey(String key, Layer layer) {

// finds the current owner

Layer owner = owners.get(key);

// determines if ownership change is required

if (owner.priority() < layer.priority()) {

// change the current owner

owners.put(key, layer);

// from this point, all "get"s of "key" will be routed to "layer"

}

}

}

The scenario above details how the system will handle new keys appearing in a layer.

A full implementation also needs to consider deleted keys and updated keys. For those cases, we'd want to avoid calling registry().bindKey(...) since it would result in a no-op anyway.


Overall, I am willing to bet that ownership changes will be less frequent than value changes in real-world scenarios. So, the upfront cost of adding and deleting keys, represented by registering and unregistering keys from a layer, will be worth it to achieve constant time (O(1)) reads per key.

Conclusion

At this point, I am genuinely excited to implement this code!

I'm sure I will learn more as I go along and may change my mind on some of these choices, but this is the plan for now!

Until next time...


If you liked this article and want to read more like it, please subscribe to my newsletter; I send one out every few weeks!

Β