Sequences
Motivation
As of March 1, 2025, the sequence implementation has several critical limitations that impact system performance and scalability:
-
Unbound Memory Growth: Sequence data for all workspaces is loaded into memory simultaneously, creating a direct correlation between memory usage and the number of workspaces. This approach becomes unsustainable as applications scale.
-
Prolonged Startup Times: During command processor initialization, a resource-intensive "recovery process" must read and process the entire PLog to determine the last used sequence numbers. This causes significant startup delays that worsen as event volume grows.
The proposed redesign addresses these issues through intelligent caching, background updates, and optimized storage mechanisms that maintain sequence integrity while dramatically improving resource utilization and responsiveness.
Introduction
This document outlines the design for sequence number management within the Voedger platform.
A Sequence in Voedger is defined as a monotonically increasing series of numbers. The platform provides a unified mechanism for sequence generation that ensures reliable, ordered number production.
As of March 1, 2025, Voedger implements four specific sequence types using this mechanism:
- PLogOffsetSequence: Tracks write positions in the PLog
- Starts from 1
- WLogOffsetSequence: Manages offsets in the WLog
- Starts from 1
- To read all events by SELECT
- CRecordIDSequence: Generates unique identifiers for CRecords
- Starts from 322685000131072
- Motivation:
- Efficient CRecord caching on the DBMS side (Most CRecords reside in the same partition)
- Simple iteration over CRecords
- OWRecordIDSequence: Provides sequential IDs for ORecords/WRecords (OWRecords)
- Starts from 322680000131072
- There are a potentially lot of such records, so it is not possible to use SELECT to read all of them
As the Voedger platform evolves, the number of sequence types is expected to expand. Future development will enable applications to define their own custom sequence types, extending the platform's flexibility to meet diverse business requirements beyond the initially implemented system sequences.
These sequences ensure consistent ordering of operations, proper transaction management, and unique identification across the platform's distributed architecture. The design prioritizes performance and scalability by implementing an efficient caching strategy and background updates that minimize memory usage and recovery time.
Background
- #688: record ID leads to different tables
- VIEW RecordsRegistry
- [Singleton IDs]https://github.com/voedger/voedger/blob/ec85a5fed968e455eb98983cd12a0163effdc096/pkg/istructs/consts.go#L101
Existing design
- const MinReservedBaseRecordID = MaxRawRecordID + 1
- 65535 + 1
- const MaxReservedBaseRecordID = MinReservedBaseRecordID + 0xffff // 131071
- const FirstSingletonID = MinReservedBaseRecordID // 65538
- const MaxSingletonID = MaxReservedBaseRecordID // 66047, 512 singletons
- ClusterAsRegisterID = 0xFFFF - 1000 + iota
- ClusterAsCRecordRegisterID
- const FirstSingletonID
- cmdProc.appsPartitions
- command/impl.go/getIDGenerator
- command/impl.go: Put IDs to response
- command/impl.go: (idGen *implIDGenerator) NextID
- istructmem/idgenerator.go: (g *implIIDGenerator) NextID
Previous flow
- Recovery on the first request into the workspace
- save the event after cmd exec:
Definitions
APs: Applcation Partitions
SequencesTrustLevel:
The SequencesTrustLevel setting determines how events and table records are written.
| Level | Events, write mode | Table Records, write mode |
|---|---|---|
| 0 | InsertIfNotExists | InsertIfNotExists |
| 1 | InsertIfNotExists | Put |
| 2 | Put | Put |
Note
SequencesTrustLevel is not used for the case when we're calling PutPlog() to mark the event as corrupted. Put() always used in this case
Analysis
Sequencing strategies
As of March 1, 2025, record ID sequences may overlap, and only 5,000,000,000 IDs are available for OWRecords, since OWRecord IDs start from 322680000131072, while CRecord IDs start from 322685000131072.
Solutions:
- One sequence for all records:
- Pros:
- 👍Clean for Voedger users
- 👍IDs are more human-readable
- 👍Simpler Command Processor
- ❌Cons: CRecords are not cached efficiently
- Solution: Let the State read copies of CRecords from sys.Collection, or possibly from an alternative optimized storage to handle large CRecord data
- ❌Cons: Why we need CRecords then
- 👍Pros: Separation of write and read models
- Solution: Let the State read copies of CRecords from sys.Collection, or possibly from an alternative optimized storage to handle large CRecord data
- Pros:
- Keep as is:
- Pros
- 👍Easy to implement
- Cons
- ❌ No separation between write and read models
- ❌ Only 5 billions of OWRecords (ClusterAsRegisterID < ClusterAsCRecordRegisterID)
- Solution: Configure sequencer to use multiple ranges to avoid collisions
- 👍Pros: Better control over sequences
- Solution: Configure sequencer to use multiple ranges to avoid collisions
- Pros
SequencesTrustLevel: Performance impact
- https://snapshots.raintank.io/dashboard/snapshot/zEW5AQHECtKLIcUeO2PJnmy3nkQDhp9m?orgId=0
- Zero SequencesTrustLevel was introduced to the Air performance testbench on 2025-04-29
- Latency is increased from 40 ms to 120 ms with spikes up to 160 ms
- Testbench throughput reduced from 4000 command per seconds to 1400 cps
- CPU usage is decreased from 75% to 42%
- So we can make an educated guess that maximum thoughtput would be reduced by 4000 / 1400 * 42 / 75 = 1.6 times
Solution overview
The proposed approach implements a more efficient and scalable sequence management system through the following principles:
- Projection-Based Storage: Each application partition will maintain sequence data in a dedicated projection ???(
SeqData). SeqData is a map that eliminates the need to load all sequence data into memory at once - Offset Tracking:
SeqDatawill include aSeqDataOffsetattribute that indicates the PLog partition offset for which the stored sequence data is valid, enabling precise recovery and synchronization - LRU Cache Implementation: Sequence data will be accessed through a Most Recently Used (LRU) cache that prioritizes frequently accessed sequences while allowing less active ones to be evicted from memory
- Background Updates: As new events are written to the PLog, sequence data will be updated in the background, ensuring that the system maintains current sequence values without blocking operations
- Batched Writes: Sequence updates will be collected and written in batches to reduce I/O operations and improve throughput
- Optimized Actualization: The actualization process will use the stored
SeqDataOffsetto process only events since the last known valid state, dramatically reducing startup times
This approach decouples memory usage from the total number of workspaces and transforms the recovery process from a linear operation dependent on total event count to one that only needs to process recent events since the last checkpoint.
Use cases
VVMHost: Configure SequencesTrustLevel mode for VVM
~tuc.VVMConfig.ConfigureSequencesTrustLevel~covrd1✅
VVMHost uses cmp.VVMConfig.SequencesTrustLevel.
CP: Handling SequencesTrustLevel for Events
~tuc.SequencesTrustLevelForPLog~covrd2✅- When PLog is written then SequencesTrustLevel is used to determine the write mode
- Note: except the
update corruptedcase
~tuc.SequencesTrustLevelForWLog~covrd3✅- When WLog is written then SequencesTrustLevel is used to determine the write mode
- Note: except the case when the wlog event was already stored before. Consider PutWLog is called to re-apply the last event
CP: Handling SequencesTrustLevel for Table Records
~tuc.SequencesTrustLevelForRecords~uncvrd4❓- When a record is inserted SequencesTrustLevel is used to determine the write mode
- When a record is updated - nothing is done in connection with SequencesTrustLevel
CP: Command processing
~tuc.StartSequencesGeneration~uncvrd5❓- When: CP starts processing a request
- Flow:
partitionIDis calculated using request WSID and amount of partitions declared in AppDeploymentDescriptor here- sequencer, err := IAppPartition.Sequencer() err
- nextPLogOffest, ok, err := sequencer.Start(wsKind, WSID)
- if !ok
- Actualization is in progress
- Flushing queue is full
- Returns 503: "server is busy"
- if !ok
~tuc.NextSequenceNumber~uncvrd6❓- When: After CP starts sequences generation
- Flow:
- sequencer.Next(sequenceId)
~tuc.FlushSequenceNumbers~uncvrd7❓- When: After CP saves the PLog record successfully
- Flow:
- sequencer.Flush()
~tuc.ReactualizeSequences~uncvrd8❓- When: After CP fails to save the PLog record
- Flow:
- sequencer.Actualize()
IAppPartions implementation: Instantiate sequencer on Application deployment
~tuc.InstantiateSequencer~uncvrd9❓
- When: Partition with the
partitionIDis deployed - Flow:
- Instantiate the implementation of the
isequencer.ISeqStorage(appparts.internal.seqStorage, see below) - Instantiate
sequencer := isequencer.New(*isequencer.Params) - Save
sequencerso that it will be returned by IAppPartition.Sequencer()
- Instantiate the implementation of the
Technical design: Components
IAppPartition.Sequencer
~cmp.IAppPartition.Sequencer~uncvrd10❓- Description: Returns
isequencer.ISequencer - Covers:
tuc.StartSequencesGeneration
- Description: Returns
VVMConfig.SequencesTrustLevel
~cmp.VVMConfig.SequencesTrustLevel~covrd11✅- Covers: tuc.VVMConfig.ConfigureSequencesTrustLevel
pkg/isequencer
Core:
~cmp.ISequencer~covrd12✅: Interface for working with sequences~cmp.sequencer~covrd13✅: Implementation of theisequencer.ISequencerinterface~cmp.sequencer.Start~covrd14✅: Starts Sequencing Transaction for the given WSID~cmp.sequencer.Next~covrd15✅: Returns the next sequence number for the given SeqID~cmp.sequencer.Flush~covrd16✅: Completes Sequencing Transaction~cmp.sequencer.Actualize~covrd17✅: Cancels Sequencing Transaction and starts the Actualization process
Tests:
~test.isequencer.mockISeqStorage~covrd18✅- Mock implementation of
isequencer.ISeqStoragefor testing purposes
- Mock implementation of
~test.isequencer.NewMustStartActualization~uncvrd19❓isequencer.New()must start the Actualization process, Start() must return0, false- Design: blocking hook in mockISeqStorage
~test.isequencer.Race~uncvrd20❓- If !t.Short() run something like
go test ./... -count 50 -race
- If !t.Short() run something like
Some edge case tests:
~test.isequencer.LongRecovery~covrd21✅- Params.MaxNumUnflushedValues = 5 // Just a guess
- For numOfEvents in [0, 10*Params.MaxNumUnflushedValues]
- Create a new ISequencer instance
- Check that Next() returns correct values after recovery
~test.isequencer.MultipleActualizes~covrd22✅- Repeat { Start {Next} randomly( Flush | Actualize ) } cycle 100 times
- Check that the system recovers well
- Check that the sequence values are increased monotonically
~test.isequencer.FlushPermanentlyFails~covrd23✅- Recovery has worked but then ISeqStorage.WriteValuesAndOffset() fails permanently
- First Start/Flush must be ok
- Some of the next Start must not be ok
- Flow:
- MaxNumUnflushedValues = 5
- Recover
- Mock error on WriteValuesAndOffset
- Start/Next/Flush must be ok
- loop Start/Next/Flush until Start() is not ok (the 6th times till unflushed values exceed the limit)
- Recovery has worked but then ISeqStorage.WriteValuesAndOffset() fails permanently
interface.go
/*
* Copyright (c) 2025-present unTill Software Development Group B. V.
* @author Maxim Geraskin
*/
package isequencer
import (
"context"
"time"
)
type SeqID uint16
type WSKind uint16
type WSID uint64
type Number uint64
type PLogOffset uint64
type NumberKey struct {
WSID WSID
SeqID SeqID
}
type SeqValue struct {
Key NumberKey
Value Number
}
// To be injected into the ISequencer implementation.
//
type ISeqStorage interface {
// If number is not found, returns 0
ReadNumbers(WSID, []SeqID) ([]Number, error)
ReadNextPLogOffset() (PLogOffset, error)
// IDs in batch.Values are unique
// len(batch) may be 0
// offset: Next offset to be used
// batch MUST be written first, then offset
WriteValuesAndNextPLogOffset(batch []SeqValue, nextPLogOffset PLogOffset) error
// ActualizeSequencesFromPLog scans PLog from the given offset and send values to the batcher.
// Values are sent per event, unordered, ISeqValue.Keys are not unique.
// err: ctx.Err() if ctx is closed
ActualizeSequencesFromPLog(ctx context.Context, offset PLogOffset, batcher func(batch []SeqValue, offset PLogOffset) error) (err error)
}
// ISequencer defines the interface for working with sequences.
// ISequencer methods must not be called concurrently.
// Use: { Start {Next} ( Flush | Actualize ) }
//
// Definitions
// - Sequencing Transaction: Start -> Next -> (Flush | Actualize)
// - Actualization: Making the persistent state of the sequences consistent with the PLog.
// - Flushing: Writing the accumulated sequence values to the storage.
// - LRU: Least Recently Used cache that keep the most recent next sequence values in memory.
type ISequencer interface {
// Start starts Sequencing Transaction for the given WSID.
// Marks Sequencing Transaction as in progress.
// Panics if Sequencing Transaction is already started.
// Normally returns the next PLogOffset, true
// Returns `0, false` if:
// - Actualization is in progress
// - The number of unflushed values exceeds the maximum threshold
// If ok is true, the caller must call Flush() or Actualize() to complete the Sequencing Transaction.
Start(wsKind WSKind, wsID WSID) (plogOffset PLogOffset, ok bool)
// Next returns the next sequence number for the given SeqID.
// Panics if Sequencing Transaction is not in progress.
// err: ErrUnknownSeqID if the sequence is not defined in Params.SeqTypes.
Next(seqID SeqID) (num Number, err error)
// Flush completes Sequencing Transaction.
// Panics if Sequencing Transaction is not in progress.
Flush()
// Actualize cancels Sequencing Transaction and starts the Actualization process.
// Panics if Actualization is already in progress.
// Flow:
// - Mark Sequencing Transaction as not in progress
// - Do Actualization process
// - Cancel and wait Flushing
// - Empty LRU
// - Write next PLogOffset
Actualize()
}
// Params for the ISequencer implementation.
type Params struct {
// Sequences and their initial values.
// Only these sequences are managed by the sequencer (ref. ErrUnknownSeqID).
SeqTypes map[WSKind]map[SeqID]Number
SeqStorage ISeqStorage
MaxNumUnflushedValues int // 500
// Size of the LRU cache, NumberKey -> Number.
LRUCacheSize int // 100_000
BatcherDelay time.Duration // 5 * time.Millisecond
}
Implementation requirements
// filepath: pkg/isequencer: impl.go
import (
"context"
"time"
"github.com/hashicorp/golang-lru/v2"
)
// Implements isequencer.ISequencer
// Keeps next (not current) values in LRU and type ISeqStorage interface
type sequencer struct {
params *Params
actualizerInProgress atomic.Bool
// Set by s.Actualize(), never cleared (zeroed).
// Used by s.cleanup().
actualizerCtxCancel context.CancelFunc
actualizerWG *sync.WaitGroup
// Cleared by s.Actualize()
lru *lru.Cache
// Initialized by Start()
// Example:
// - 4 is the offset ofthe last event in the PLog
// - nextOffset keeps 5
// - Start() returns 5 and increments nextOffset to 6
nextOffset PLogOffset
// If Sequencing Transaction is in progress then currentWSID has non-zero value.
// Cleared by s.Actualize()
currentWSID WSID
currentWSKind WSKind
// Set by s.actualizer()
// Closed when flusher needs to be stopped
flusherCtxCancel context.CancelFunc
// Used to wait for flusher goroutine to exit
// Set to nil when flusher is not running
// Is not accessed concurrently since
// - Is accessed by actualizer() and cleanup()
// - cleanup() first shutdowns the actualizer() then flusher()
flusherWG *sync.WaitGroup
// Buffered channel [1] to signal flusher to flush
// Written (non-blocking) by Flush()
flusherSig chan struct{}
// To be flushed
toBeFlushed map[NumberKey]Number
// Will be 6 if the offset of the last processed event is 4
toBeFlushedOffset PLogOffset
// Protects toBeFlushed and toBeFlushedOffset
toBeFlushedMu sync.RWMutex
// Written by Next()
inproc map[NumberKey]Number
}
// New creates a new instance of the Sequencer type.
// Instance has actualizer() goroutine started.
// cleanup: function to stop the actualizer.
func New(*isequencer.Params) (isequencer.ISequencer, cleanup func(), error) {
// ...
}
// Flush implements isequencer.ISequencer.Flush.
// Flow:
// Copy s.inproc and s.nextOffset to s.toBeFlushed and s.toBeFlushedOffset
// Clear s.inproc
// Increase s.nextOffset
// Non-blocking write to s.flusherSig
func (s *sequencer) Flush() {
// ...
}
// Next implements isequencer.ISequencer.Next.
// It ensures thread-safe access to sequence values and handles various caching layers.
//
// Flow:
// - Validate equencing Transaction status
// - Get initialValue from s.params.SeqTypes and ensure that SeqID is known
// - Try to obtain the next value using:
// - Try s.lru (can be evicted)
// - Try s.inproc
// - Try s.toBeFlushed (use s.toBeFlushedMu to synchronize)
// - Try s.params.SeqStorage.ReadNumber()
// - Read all known numbers for wsKind, wsID
// - If number is 0 then initial value is used
// - Write all numbers to s.lru
// - Write value+1 to s.lru
// - Write value+1 to s.inproc
// - Return value
func (s *sequencer) Next(seqID SeqID) (num Number, err error) {
// ...
}
// batcher processes a batch of sequence values and writes maximum values to storage.
//
// Flow:
// - Wait until len(s.toBeFlushed) < s.params.MaxNumUnflushedValues
// - Lock/Unlock
// - Wait s.params.BatcherDelay
// - check ctx (return ctx.Err())
// - s.nextOffset = offset + 1
// - Store maxValues in s.toBeFlushed: max Number for each SeqValue.Key
// - s.toBeFlushedOffset = offset + 1
//
func (s *sequencer) batcher(ctx ctx.Context, values []SeqValue, offset PLogOffset) (err error) {
// ...
}
// Actualize implements isequencer.ISequencer.Actualize.
// Flow:
// - Validate Actualization status (s.actualizerInProgress is false)
// - Set s.actualizerInProgress to true
// - Set s.actualizerCtxCancel, s.actualizerWG
// - Start the actualizer() goroutine
func (s *sequencer) Actualize() {
// ...
}
/*
actualizer is started in goroutine by Actualize().
Flow:
- if s.flusherWG is not nil
- s.cancelFlusherCtx()
- Wait for s.flusherWG
- s.flusherWG = nil
- Clean s.lru, s.nextOffset, s.currentWSID, s.currentWSKind, s.toBeFlushed, s.inproc, s.toBeFlushedOffset
- s.flusherWG, s.flusherCtxCancel + start flusher() goroutine
- Read nextPLogOffset from s.params.SeqStorage.ReadNextPLogOffset()
- Use s.params.SeqStorage.ActualizeSequencesFromPLog() and s.batcher()
ctx handling:
- if ctx is closed exit
Error handling:
- Handle errors with retry mechanism (500ms wait)
- Retry mechanism must check `ctx` parameter, if exists
*/
func (s *sequencer) actualizer(ctx context.Context) {
// ...
}
/*
flusher is started in goroutine by actualizer().
Flow:
- Wait for ctx.Done() or s.flusherSig
- if ctx.Done() exit
- Lock s.toBeFlushedMu
- Copy s.toBeFlushedOffset to flushOffset (local variable)
- Copy s.toBeFlushed to flushValues []SeqValue (local variable)
- Unlock s.toBeFlushedMu
- s.params.SeqStorage.WriteValues(flushValues, flushOffset)
- Lock s.toBeFlushedMu
- for each key in flushValues remove key from s.toBeFlushed if values are the same
- Unlock s.toBeFlushedMu
Error handling:
- Handle errors with retry mechanism (500ms wait)
- Retry mechanism must check `ctx` parameter, if exists
*/
func (s *sequencer) flusher(ctx context.Context) {
// ...
}
// cleanup stops the actualizer() and flusher() goroutines.
// Flow:
// - if s.actualizerInProgress
// - s.cancelActualizerCtx()
// - Wait for s.actualizerWG
// - if s.flusherWG is not nil
// - s.cancelFlusherCtx()
// - Wait for s.flusherWG
// - s.flusherWG = nil
func (s *sequencer) cleanup() {
// ...
}
ISeqStorage implementation
~cmp.ISeqStorageImplementation~covrd24✅: Implementation of the isequencer.ISeqStorage interface
- Package: cmp.appparts.internal.seqStorage
~cmp.ISeqStorageImplementation.New~covrd25✅- Per App per Partition by AppParts
- PartitionID is not passed to the constructor
~cmp.ISeqStorageImplementation.i688~uncvrd26❓- Handle #688: record ID leads to different tables
- If existing number is less than ??? 322_680_000_000_000 - do not send it to the batcher
- Uses VVMSeqStorage Adapter
VVMStorage Adapter
~cmp.VVMSeqStorageAdapter~covrd27✅: Adapter that reads and writes sequence data to the VVMStorage
- PLogOffset in Partition storage: ((pKeyPrefix_SeqStorage_Part, PartitionID) PLogOffsetCC(0) )
- Numbers: ((pKeyPrefix_SeqStorage_WS, AppID, WSID) SeqID)
Technical design: Tests
Integration tests for SequencesTrustLevel mode
Method:
- Test for Record
- Create a new VIT instance on an owned config with
VVMConfig.TrustedSequences = false - Insert a doc to get the last recordID: simply exec
c.sys.CUDand get the ID of the new record - Corrupt the storage: Insert a conflicting key that will be used on creating the next record:
VIT.IAppStorageProvider.AppStorage(test1/app1).Put()- Build
pKey,cColsfor the record, use just inserted recordID+1 - Value does not matter, let it be
[]byte{1}
- Try to insert one more record using
c.sys.CUD - Expect panic
- Create a new VIT instance on an owned config with
- Test for PLog, WLog offsets - the same tests but sabotage the storage building keys for the event
Tests:
~it.SequencesTrustLevel0~uncvrd34❓: Intergation test forSequencesTrustLevel = 0~it.SequencesTrustLevel1~uncvrd35❓: Intergation test forSequencesTrustLevel = 1~it.SequencesTrustLevel2~uncvrd36❓: Intergation test forSequencesTrustLevel = 2
Intergation tests for built-in sequences
~it.BuiltInSequences~uncvrd37❓: Test for initial values: WLogOffsetSequence, WLogOffsetSequence, CRecordIDSequence, OWRecordIDSequence
Addressed issues
- #3215: Sequences - Initial requirements and discussion
References
Design process:
History: