3

Embedded software engineer here.

If you had to pick a data structure that will have to potentially hold thousands of entries, and that should be as fast as possible at lookups, what would be your first choice?

Comments
  • 1
    @maushax +1 to sorted dictionary
  • 2
    Hashmap would be the way to go for random access. For sequential access, there are a lot of choices lists, arrays, queues, pipes, etc
  • 1
    Dictionary. Depending on language and key type, there might be further optimizations. For example, if your key values are strings, then some languages have culture, case, etc. options for finding the key.
  • 0
    As I said, it's fir an embedded system, so the only language option is C. The access is totally random and the key is a 48 bit integer.
    I thought about hashamps at first as well, but they tend to loose a lot of performance when they get too crowded and you have to resolve conflicts. I don't know much about sorted dictionaries. Anybody know any good reference/implementation?
  • 2
    @matteovigni c is far not the only language used in embedded systems... Even java has an embedded version.

    I doubt hashmap is the way to go. It'd only add overhead: hashing the value and iterating over a set to look it up. Oh, and also hashmaps do not have duplicates, so there's also that.
  • 2
    @netikras true, but C is also the most commonly used and supported by all MCU vendors. And in my experience if you're talking about programming embedded systems, it 99 times out of 100 you're implying programming in C
  • 1
    So it depends on a lot of things
    Do you want to do range queries, i.e. access elements in sorted order along with pure lookups? (If yes, you would want to avoid hashtables and go for a binary search tree based thing)

    How much do you write vs read, what's your memory access pattern like? (would allow you to choose between different types of trees for example, or if you have very specific access pattern you could even use a more exotic thing like a splay tree. Or implement caching. Or use hashtables if that fits your usage).

    What are the latencies of your memory hardware? Do you have a cache and how bad is a cache miss? (Could help to have a shallow and wide structure like a B-tree rather than a deeper structure like an AVL tree)

    Also, do you have a prebuilt library versus having to code it yourself. Also, data structure overhead.

    Depending on those I would choose anything from a regular array, red-black tree, AVL tree, B-tree, hashtable. Or more exotic variants thereof.
  • 1
    @matteovigni C is only third after python and C++ on IEEE Spectrum rating for embedded
  • 2
    Folks, nobody does Java for embedded below Cortex-A. And next to nobody does C++.

    As for the data structure, it depends on how often you will change entries. If it's generated at compile time, a linked list with skiplist helper table will be fine.

    If it's dynamic content, you can use an array of records that contain full hash plus data and determine the start index by approprately shifting the hash value. Then the entry plus the two following entries are relevant.

    Storage is: slot 0 if empty, slot 1 if the new entry is more "valuable" than the existing one, slot 2 otherwise (always replace). Lookup is shifting the hash, then scanning the next up to 3 entries. Be careful that the table needs room for two additional entries at the end to avoid out of bounds access.

    Of course, this means that entries in the table may get overwritten, but since the runtime of an embedded system is typically long and the memory small, there will probably be more data than you can store at once anyway.
  • 2
    Ah when storing: slot 0 not only if empty, but also if same key as the new one.
  • 0
    @Fast-Nop I like your idea. If I got it right, it would be log n for lookups right?
    Since my "key" (the value I'm sorting for) is already a simple number, I guess I could avoid completely the hash functions and use that directly instead.
    Problem is that entries can be randomly accessed, added and removed since it all depends on what happens on an external communication bus.
    If the data structure gets "big" I guess insertions and removals can get quite expensive
  • 1
    @matteovigni The first version for constant data in ROM will be always O(N) like every linear list. Without helper table, you'll need to scan through N/2 entries on average. With helper table, it's (N/2)/256 for a 256 entry helper table. You can choose any helper table size that is a power of 2, bigger or smaller.

    The second version is O(1) because the operations are shifting the hash value, using that as index and jumping right into the data array, plus the next or overnext entry, so that's constant effort.

    Adding/replacing are also constant effort. The drawback is that new data can overwrite old data, but you get easy eviction management.
  • 2
    @Fast-Nop unless I'm still not understanding, the thing is that my key is a 48 bits number, and of course I cannot build a 2^48 long array 😂
    So I was thinking to keep that array sorted and use bitshifts as you suggested to do a binary search for access
  • 2
    @matteovigni then you'd need to sort every time you get a new entry, and since the array is still smaller than your key space, you'll also need some eviction management. That's already included in that approach.

    Since you shift the key to use as index both for read and store, there is no need to sort. You'll always be at the right spot.

    Of course, you could also simplify the strategy to store always to slot 0 and read always from there, i.e. no slot 1 and 2. That would be even faster, but won't use the available memory as efficiently.
Add Comment