6/26/08

Write an efficient function to find the first nonrepeated character in a string.

At first, this task seems almost trivial. If a character is repeated, it must appear in at least two places in the string. Therefore, you can determine whether a particular character is repeated by comparing it with all other characters in the string. It’s a simple matter to perform this search for each character in the string, starting with the first. When you find a character that has no match elsewhere in the string you’ve found the first nonrepeated character.

What’s the time order of this solution? If the string is n characters long, then in the worst case you’ll make almost n comparisons for each of the n characters. That gives worst case O(n2) for this algorithm. You are unlikely to encounter the worst case for single-word strings, but for longer strings, such as a paragraph of text, it’s likely that most characters would be repeated, and the most common case might be close to the worst case. The ease with which you arrived at this solution suggests that there are better alternatives - if the answer were truly this trivial, the interviewer wouldn’t bother you with the problem. There must be an algorithm with a worst case better than O(n2).


Tip

The algorithm described can be improved somewhat by comparing each character with only the characters following it because it has already been compared with the characters preceding it. This would give you a total of (n – 1) + (n – 2)+ + 1 comparisons. As discussed in Chapter 3, this is still O(n2).

Why was the previous algorithm O(n2)? One factor of n came from checking each character in the string to determine whether it was nonrepeated. Because the nonrepeated character could be anywhere in the string, it seems unlikely that you’ll be able to improve efficiency here. The other factor of n was due to searching the entire string when trying to look up matches for each character. If you improve the efficiency of this search, you’ll improve the efficiency of the overall algorithm. The easiest way to improve search efficiency on a set of data is to put it in a data structure that allows more efficient searching. What data structures can be searched more efficiently than O(n)? Binary trees can be searched in O(log(n)). Arrays and hash tables both have constant time element lookup. (Hash tables have worst-case lookup of O(n) but the average case is O(1).) Begin by trying to take advantage of an array or hash table because these data structures offer the greatest potential for improvement.

You’ll want to be able to quickly determine whether a character is repeated, so you need to be able to search the data structure by character. This means you have to use the character as the index (in an array) or key (in a hash table). What values would you store in these data structures? A nonrepeated character appears only once in the string, so if you stored the number of times each character appeared, it would help you identify nonrepeating characters. You’ll have to scan the entire string before you have the final counts for each character.


Tip

You can convert a character to an integer in order to use it as an index. If the strings are restricted to ASCII characters, this gives you 128 different possible character values. Unicode characters as used in Java or C#, however, have 65,536 possible values.

Once you’ve completed this, you could scan through all the count values in the array or hash table looking for a 1. That would find a nonrepeated character, but it wouldn’t necessarily be the first one in the original string.

Therefore, you need to search your count values in the order of the characters in the original string. This isn’t difficult - you just look up the count value for each character until you find a 1. When you find a 1, you’ve located the first nonrepeated character.

Consider whether this new algorithm is actually an improvement. You will always have to go through the entire string to build the count data structure. In the worst case, you might have to look up the count value for each character in the string to find the first nonrepeated character. Because the operations on the array or hash you’re using to hold the counts are constant time, the worst case would be two operations for each character in the string, giving 2n, which is O(n) - a major improvement over the previous attempt.

Both hash tables and arrays provide constant-time lookup; you need to decide which one you will use. On the one hand, hash tables have a higher lookup overhead than arrays. On the other hand, an array would initially contain random values that you would have to take time to set to zero, whereas a hash table initially has no values. Perhaps the greatest difference is in memory requirements. An array would need an element for every possible value of a character. This would amount to a relatively reasonable 128 elements if you were processing ASCII strings, but if you had to process Unicode strings you would need more than 65,000 elements, assuming a 16-bit Unicode encoding. In contrast, a hash table would require storage for only the characters that actually exist in the input string. Therefore, arrays are a better choice for long strings with a limited set of possible character values; hash tables are more efficient for shorter strings or when there are many possible character values.

You could implement the solution either way. Assume the code may need to process Unicode strings (a safe bet these days) and choose the hash table implementation. You might choose to write the function in Java or C#, which have built-in support for both hash tables and Unicode. In outline form, the function you’ll be writing looks like this:

First, build the character count hash table:
For each character
If no value is stored for the character, store 1
Otherwise, increment the value
Second, scan the string:
For each character
Return character if count in hash table is 1
If no characters have count 1, return null

Now implement the function. Because you don’t know what class your function would be part of, implement it as a public static function (this is equivalent to a normal C function). Remember that the Java Hashtable stores references to Object, which means you can store the reference type Integer, but not the fundamental type int:

public static Character firstNonRepeated( String str ){
Hashtable charHash = new Hashtable();
int i, length;
Character c;
Integer intgr;

length = str.length();

// Scan str, building hash table
for (i = 0; i < length; i++) {
c = new Character(str.charAt(i));
intgr = (Integer) charHash.get(c);
if (intgr == null) {
charHash.put(c, new Integer(1));
} else {
// Increment count corresponding to c
charHash.put(c, new Integer(intgr.intValue() + 1));
}
}

// Search hashtable in order of str
for (i = 0; i < length; i++) {
c = new Character(str.charAt(i));
if (((Integer)charHash.get(c)).intValue() == 1)
return c;
}
return null;

}

The preceding code is actually quite inefficient from a memory standpoint, however. It allocates a lot of objects unnecessarily. The emphasis is not actually on how many times a character is repeated. You just want to know whether it occurs zero times, one time, or more than one time. Instead of storing integers in the hash table, why not just reserve two Object values for use as your “one time” and “more than one time” flags (with the null object meaning “zero times,” of course) and store those in the hash table. Here’s the simplified version:

public static Character firstNonRepeated( String str ){
Hashtable charHash = new Hashtable();
int i, length;
Character c;
Object seenOnce = new Object();
Object seenTwice = new Object();

length = str.length();

// Scan str, building hash table
for( i = 0; i < length; i++ ){
c = new Character(str.charAt(i));
Object o = charHash.get(c);
if( o == null ){
charHash.put( c, seenOnce );
} else if( o == seenOnce ){
// Increment count corresponding to c
charHash.put( c, seenTwice );
}
}

// Search hashtable in order of str
for( i = 0; i < length; i++ ){
c = new Character(str.charAt(i));
if( charHash.get(c) == seenOnce ){
return c;
}
}
return null;
}

A (significantly) further speedup could be achieved by implementing a faster char to Character mapping, possibly using an array to cache the mappings, or at least the most frequent mappings (such as for ASCII characters). Or use a hash table implementation that could directly store character char values.

No comments:

ITUCU