In one such system, there are 26 different labels possible and the ith data sample is classified to belong to the samples[i] label where samples is a string of lowercase English letters. However, for some data samples, samples[i] is equal to ? representing that the corresponding data sample label is missing and needs to be replaced with some lowercase English letter.
The overhead (or penalty) of any index of the string samples is defined as the number of indices before it that also have the same label.
For example,
The task is to replace all the characters ? so that the sum of the indices' overhead (or penalty) is minimum.
Given the string samples, interpolate the data such that the total overhead of all the indices is minimized. If there are multiple ways to get minimum overhead, print the lexicographically smallest possible string that satisfies the goal.
Note : A string p is lexicographically smaller than string q, if p is a prefix of q, is not equal to q, or there exists i, such that pi < qi and for all j < i it is satisfied that pj = qj. Note that while comparing pj and qj we consider their ASCII values, i.e. '[' and ']' are greater than any uppercase English letters. For example, "ABC" is lexicographically smaller than "BCD" and "ABCDE" but larger than "AAC" and "AACDE".
Given the string samples = "abcd?"
Some of the possible replacements are:
| Replacement | Overhead |
|---|---|
| abcda | 1 |
| abcdd | 1 |
| abcde | 0 |
| abcdz | 0 |
The strings "abcda", "abcdb", "abcdc" and "abcdd" overhead 1 because the last element has a duplicate character. We can see that the minimum overhead possible is 0. Since there can be multiple possible strings with 0 overhead, we choose the lexicographically smaller one i.e. abcde.
Complete the function determineOptimalSamples(String samples) in the editor below.
Enter a string samples consisting of the lowercase English alphabet and the character '?'.
| STDIN | FUNCTION |
|---|---|
| aaaa?aaaa | samples = "aaaa?aaaa" |
aaaabaaaa
Putting 'b' does not contribute to the total overhead and is lexicographically minimum.
| STDIN | FUNCTION |
|---|---|
| ?????? | samples = "??????" |
abcdef
This is the lexicographically smallest string that keeps the overhead least. The overhead will be 0 as there will be no duplicate character present in it.
Given the string samples = "aaaa?bbb??c?dd?e????"
Fixed character counts are: a=4, b=3, c=1, d=2, e=1. These fixed letters already contribute a total overhead of C(4,2) + C(3,2) + C(2,2) = 6 + 3 + 1 = 10.
There are 9 question marks. To keep the total overhead minimum, we should replace each ? using letters that have not appeared yet (so each such placement adds 0 overhead). The smallest unused letters are: f, g, h, i, j, k, l, m, n.
| Order of '?' encountered (left to right) | Chosen replacement | Reason |
|---|---|---|
| 1 | f | Smallest letter with current count 0 |
| 2 | g | Next smallest letter with current count 0 |
| 3 | h | Still avoids creating any new duplicate pair |
| 4 | i | Still avoids creating any new duplicate pair |
| 5 | j | Still avoids creating any new duplicate pair |
| 6 | k | Still avoids creating any new duplicate pair |
| 7 | l | Still avoids creating any new duplicate pair |
| 8 | m | Still avoids creating any new duplicate pair |
| 9 | n | Still avoids creating any new duplicate pair |
Output: aaaafbbbghciddjeklmn
Total overhead: 10. (All replacements are distinct new letters, so they add 0 extra overhead beyond the fixed part.)