88. Design Resumable Iterator for Large Dataset
Design Resumable Iterator for Large Dataset
Design and implement a resumable iterator for a large dataset. The iterator must be able to pause mid-traversal and later resume from the exact same position.
The dataset may be too large to fit in memory, so the iterator must be memory-efficient and must not require loading the entire dataset at once.
Additionally, the iterator should support serializing its state (cursor) so it can be persisted across sessions.
Key Definitions
- Dataset: an ordered collection of records, identified by integer
datasetId and integer datasetVersion.
- Record: a single item in the dataset, represented as a
String.
- Cursor / State Token: a
String that encodes the iterator’s resume position and safety metadata.
- Position: a 0-indexed integer representing the index of the next record that would be returned by
next().
- Dataset Change: if dataset content changes between pause and resume, the iterator must detect it using
datasetVersion.
Method Signatures
Constructor
ResumableIterator(int datasetId, List<String> records, int datasetVersion)
datasetId uniquely identifies the dataset.
records represents the dataset records (in real systems this could be streamed; in this in-memory problem it is provided as a list).
datasetVersion is used to detect dataset changes between pause and resume.
Iteration
boolean hasNext()
- Returns
true if there is at least one more record to read.
- If the iterator is currently paused, it may still return
true if records remain.
String next()
- Returns the next record and advances the iterator by 1 position.
- If there is no next record, returns an empty string
"".
- If the iterator is paused, returns an empty string
"" (must call resume(...) to continue).
State / Resuming
int position()
- Returns the current 0-indexed integer position of the next record (the index of the next record that
next() would return).
- Does not change iterator state (does not pause).
- This value must be the same
position that is embedded inside the token returned by pause().
String pause()
- Pauses the iterator and returns a serialized cursor token.
- After pausing,
next() must return "" until resume(...) succeeds.
- Calling
pause() multiple times should be safe and return a token for the same position.
- The returned token must be formatted exactly as:
datasetId=<int>,version=<int>,position=<int>
- Example:
datasetId=144,version=16,position=501
position inside the token is the 0-indexed index of the next record.
version inside the token refers to the dataset's current datasetVersion.
boolean resume(String stateToken)
- Attempts to resume the iterator from the provided cursor token.
- Returns
true if resume succeeds, otherwise false.
- The token must be parsed using the required format:
datasetId=<int>,version=<int>,position=<int>.
- Resume must fail (return
false) if:
- The token is invalid/corrupted and cannot be parsed into three integer fields.
- The token belongs to a different
datasetId.
- The token’s
version does not match the current datasetVersion (dataset changed).
- The token’s
position points outside 0..records.size().
- If resume succeeds, the iterator becomes unpaused and continues from that position.
Constraints
1 ≤ records.size() ≤ 10^7 (conceptually large; your solution must not copy or buffer all records again).
0 ≤ position ≤ records.size() (position is 0-indexed and refers to the next record).
1 ≤ datasetId ≤ 10^9.
1 ≤ datasetVersion ≤ 10^9.
0 ≤ record.length() ≤ 10^4.
- The iterator’s extra memory (excluding the provided dataset list) should be
O(1).
- Cursor token is a
String but must follow the exact formatting rules described above.
Notes on Dataset Changes
- Your cursor token must include enough information to detect changes safely (at minimum:
datasetId, version, and position).
- If the dataset changes between pause and resume, do not attempt to “best-effort” resume; return
false.
- Resuming at the end (position equals
records.size()) is valid:
hasNext() returns false.
next() returns "".
Examples
Example 1: Basic iteration, position tracking, pause
ResumableIterator it = new ResumableIterator(datasetId=1, records=List.of("A","B","C","D"), datasetVersion=10)
it.hasNext() → true
it.position() → 0
it.next() → "A"
it.position() → 1
it.pause() → "datasetId=1,version=10,position=1"
it.next() → "" (paused; must resume to continue)
Example 2: Resume later using saved state token
String token = "datasetId=1,version=10,position=1"
ResumableIterator it2 = new ResumableIterator(datasetId=1, records=List.of("A","B","C","D"), datasetVersion=10)
it2.resume(stateToken=token) → true
it2.next() → "B"
it2.next() → "C"
it2.position() → 3
Example 3: Resume fails when dataset version changes
String oldToken = "datasetId=1,version=10,position=2"
ResumableIterator it3 = new ResumableIterator(datasetId=1, records=List.of("A","B","X","C","D"), datasetVersion=11)
it3.resume(stateToken=oldToken) → false (version mismatch: 10 vs 11)
it3.position() → 0 (unchanged because resume failed)
it3.next() → "A"
Example 4: Resume at end of dataset
ResumableIterator it4 = new ResumableIterator(datasetId=2, records=List.of("P","Q"), datasetVersion=7)
it4.next() → "P"
it4.next() → "Q"
it4.hasNext() → false
it4.position() → 2
String endToken = it4.pause() → "datasetId=2,version=7,position=2"
ResumableIterator it5 = new ResumableIterator(datasetId=2, records=List.of("P","Q"), datasetVersion=7)
it5.resume(stateToken=endToken) → true
it5.hasNext() → false
it5.next() → ""
it5.position() → 2
Example 5: Invalid token (corrupted / wrong dataset)
ResumableIterator it6 = new ResumableIterator(datasetId=1, records=List.of("A","B","C"), datasetVersion=10)
it6.resume(stateToken="not-a-valid-token") → false
it6.resume(stateToken="datasetId=999,version=10,position=1") → false (datasetId mismatch)
it6.resume(stateToken="datasetId=1,version=10,position=999999") → false (position out of bounds)
it6.position() → 0