6/26/08

Remove Specified Characters from string.

Because this problem involves string manipulation and the string class is immutable, you’ll need to either convert the string to a character array or use a StringBuilder for your manipulations. Using an array will save you a lot of overhead, so that’s what you should use.

This problem breaks down into two separate tasks. For each character in str, you must determine whether it should be deleted. Then, if appropriate, you must delete the character. The second task, deletion, is discussed first.

Your initial task is to delete an element from an array. An array is a contiguous block of memory, so you can’t simply remove an element from the middle as you might with a linked list. Instead, you’ll have to rearrange the data in the array so it remains a contiguous sequence of characters after the deletion. For example, if you want to delete “c” from the string “abcd” you could either shift “a” and “b” forward one position (toward the end) or shift “d” back one position (toward the beginning). Either approach leaves you with the characters “abd” in contiguous elements of the array.

In addition to shifting the data, you need to decrease the size of the string by one character. If you shift characters before the deletion forward, you need to eliminate the first element; if you shift the characters after the deletion backward you need to eliminate the last element. Because you’re already keeping track of the string’s length (strings are not null-terminated as they are in C), shifting characters backward and decrementing the string length is the cleanest choice.

How would the proposed algorithm fare in the worst-case scenario in which you need to delete all the characters in str? For each deletion, you would shift all the remaining characters back one position. If str were n characters long, you would move the last character n – 1 times, the next to last n – 2 times, and so on, giving worst-case O(n2) for the deletion. Moving the same characters many times seems extremely inefficient. How might you avoid this?


Tip

If you start at the end of the string and work back toward the beginning, it’s somewhat more efficient but still O(n2) in the worst case.

What if you allocated a temporary string buffer and built your modified string there instead of in place? Then you could simply copy the characters you need to keep into the temporary string, skipping the characters you want to delete. When you finish building the modified string, you can copy it from the temporary buffer back into str. This way, you move each character at most twice, giving O(n) deletion. However, you’ve incurred the memory overhead of a temporary buffer the same size as the original string, and the time overhead of copying the modified string back over the original string. Is there any way you could avoid these penalties while retaining your O(n) algorithm?

To implement the O(n) algorithm just described, you need to track a source position for the read location in the original string and a destination position for the write position in the temporary buffer. These positions both start at zero. The source position is incremented every time you read, and the destination position is incremented every time you write. In other words, when you copy a character you increment both positions, but when you delete a character you increment only the source position. This means the source position will always be the same as or ahead of the destination position. Once you read a character from the original string (that is, the source position has advanced past it), you no longer need that character - in fact, you’re just going to copy the modified string over it. Because the destination position in the original string is always a character you don’t need anymore, you can write directly into the original string, eliminating the temporary buffer entirely. This is still an O(n) algorithm, but without the memory and time overhead of the earlier version.

Now that you know how to delete characters, consider the task of deciding whether to delete a particular character. The easiest way to do this is to compare the character to each character in remove and delete it if it matches any of them. How efficient is this? If str is n characters long and remove is m characters long, then in the worst case you make m comparisons for each of n characters, so the algorithm is O(nm). You can’t avoid checking each of the n characters in str, but perhaps you can make the lookup that determines whether a given character is in remove better than O(m).

If you’ve already read the solution to the section “Find the First Nonrepeated Character,” this should sound very familiar. Just as you did in that problem, you can use remove to build an array or hash table that has constant time lookup, thus giving an O(n) solution. The trade-offs between hashes and arrays have been discussed. In this case, an array is most appropriate when str and remove are long and characters have relatively few possible values (for example, ASCII strings). A hash table may be a better choice when str and remove are short or characters have many possible values (for example, Unicode strings). This time, the assumption is that you’re processing long ASCII strings and using an array instead of a hash table.


Tip

Why build an array? Isn’t remove already an array? Yes, it is, but it is an array of characters indexed by an arbitrary (that is, meaningless for this problem) position, requiring you to search through each element. The array that is referred to here would be an array of Boolean values indexed by all the possible values for a char. This enables you to determine whether a character is in remove by checking a single element.

Your function will have three parts:

  1. Set all the elements in your lookup array to false.

  2. Iterate through each character in remove, setting the corresponding value in the lookup array to true.

  3. Iterate through str with a source and destination index, copying each character only if its corresponding value in the lookup array is false.

Now that you’ve combined both subtasks into a single algorithm, analyze the overall efficiency for str of length n and remove of length m. Because the size of a character is fixed for a given platform, zeroing the array is constant time. You perform a constant time assignment for each character in remove, so building the lookup array is O(m). Finally, you do at most one constant time lookup and one constant time copy for each character in str, giving O(n) for this stage. Summing these parts yields O(n + m), so the algorithm has linear running time.

Having justified and analyzed your solution, you’re ready to code it:

stri ng removeChars( string str, string remove ){
char[] s = str.toCharArray();
char[] r = remove.toCharArray();
bool[] flags = new bool[128]; // assumes ASCII!
int len = s.Length;
int src, dst;

// Set flags for characters to be removed
for( src = 0; src < len; ++src ){
flags[r[src]] = true;
}
src = 0;
dst = 0;

// Now loop through all the characters,
// copying only if they aren't flagged
while( src < len ){
if( !flags[ (int)s[src] ] ){
s[dst++] = s[src];
}
++src;
}

return new string( s, 0, dst );
}

2 comments:

Tamer said...

There is a little correction here, when settings the flag for characters to be removed, you limit your loop to the length of the remove array, not the original string array, like in:

for (src = 0; src < removeArrLength; ++src)
{
flags[r[src]] = true;
}

Sumedh said...

Thats correct.
Thanks

ITUCU