52. OA - Minimize Total Overhead by Replacing '?'

OA - Minimize Total Overhead by Replacing '?'
The artificial intelligence researchers at Amazon are building an advanced data augmentation system to expand the training dataset for enhanced model performance.

In one such system, there are 26 different labels possible and the ith data sample is classified to belong to the samples[i] label where samples is a string of lowercase English letters. However, for some data samples, samples[i] is equal to ? representing that the corresponding data sample label is missing and needs to be replaced with some lowercase English letter.

The overhead (or penalty) of any index of the string samples is defined as the number of indices before it that also have the same label.

For example,

  • For the string "hello" the overheads are [0, 0, 0, 1, 0] corresponding to each index.
  • For the string "abc" the overheads are [0,0,0] corresponding to each index.
  • For the string "aaccbbc" the overheads are [0,1,0,1,0,1,2] corresponding to each index because before the last c character, there are 2 more c characters.

The task is to replace all the characters ? so that the sum of the indices' overhead (or penalty) is minimum.

Given the string samples, interpolate the data such that the total overhead of all the indices is minimized. If there are multiple ways to get minimum overhead, print the lexicographically smallest possible string that satisfies the goal.

Note : A string p is lexicographically smaller than string q, if p is a prefix of q, is not equal to q, or there exists i, such that pi < qi and for all j < i it is satisfied that pj = qj. Note that while comparing pj and qj we consider their ASCII values, i.e. '[' and ']' are greater than any uppercase English letters. For example, "ABC" is lexicographically smaller than "BCD" and "ABCDE" but larger than "AAC" and "AACDE".

Example

Given the string samples = "abcd?"

Some of the possible replacements are:

Replacement Overhead
abcda 1
abcdd 1
abcde 0
abcdz 0

The strings "abcda", "abcdb", "abcdc" and "abcdd" overhead 1 because the last element has a duplicate character. We can see that the minimum overhead possible is 0. Since there can be multiple possible strings with 0 overhead, we choose the lexicographically smaller one i.e. abcde.

Function Description

Complete the function determineOptimalSamples(String samples) in the editor below.

Constraints

  • 1 ≤ |samples| ≤ 105
  • It is guaranteed that s contains at least one character '?' and others are lowercase English letters or the character '?'

Input Format For Custom Testing

Enter a string samples consisting of the lowercase English alphabet and the character '?'.

Sample Case 0

Sample Input For Custom Testing

STDIN FUNCTION
aaaa?aaaa samples = "aaaa?aaaa"

Sample Output

aaaabaaaa

Explanation

Putting 'b' does not contribute to the total overhead and is lexicographically minimum.

Sample Case 1

Sample Input For Custom Testing

STDIN FUNCTION
?????? samples = "??????"

Sample Output

abcdef

Explanation

This is the lexicographically smallest string that keeps the overhead least. The overhead will be 0 as there will be no duplicate character present in it.

Sample Case 2 (More Complex)

Given the string samples = "aaaa?bbb??c?dd?e????"

Fixed character counts are: a=4, b=3, c=1, d=2, e=1. These fixed letters already contribute a total overhead of C(4,2) + C(3,2) + C(2,2) = 6 + 3 + 1 = 10.

There are 9 question marks. To keep the total overhead minimum, we should replace each ? using letters that have not appeared yet (so each such placement adds 0 overhead). The smallest unused letters are: f, g, h, i, j, k, l, m, n.

Order of '?' encountered (left to right) Chosen replacement Reason
1 f Smallest letter with current count 0
2 g Next smallest letter with current count 0
3 h Still avoids creating any new duplicate pair
4 i Still avoids creating any new duplicate pair
5 j Still avoids creating any new duplicate pair
6 k Still avoids creating any new duplicate pair
7 l Still avoids creating any new duplicate pair
8 m Still avoids creating any new duplicate pair
9 n Still avoids creating any new duplicate pair

Output: aaaafbbbghciddjeklmn

Total overhead: 10. (All replacements are distinct new letters, so they add 0 extra overhead beyond the fixed part.)





Please use Laptop/Desktop or any other large screen to add/edit code.