Creating a C++ swear filter

black hole sun · Mar 22, 2006

Hi guys, I thought it might be fun to try to create a swear filter in C++. So, after many hours of labor and imported code from another project, here's what I've come up with.

Rage3d's own swear filter is messing up my results which is kind of lame, but you can use your imaginations to fill in the blanks.

Code:

// censor.cpp -- 3/22/06
// replaces naughty words with euphemisms or asterisks

#include <iostream>
#include "swearEnum.h"

using namespace std;

int main(void)
{
	char str[500];
	cout << "Enter a string full of bad words -- I'll censor it. \n";
	cin.getline(str, 499);
	
	//check the string for profanity
	checkForProfanity(str);

	//print out the (now-clean) string
	cout << str << endl;

	return 0;
}


//search the string for swear words
void checkForProfanity(char * str)
{
	if (strstr(str, "mother****er") != NULL)
		delProfanity(str, "mother****er", 12, mother****er);

	if (strstr(str, "mother****ing") != NULL)
		delProfanity(str, "mother****ing", 13, mother****ing);

	if (strstr(str, "****") != NULL)
		delProfanity(str, "****", 4, ****);		
	
	if (strstr(str, "****") != NULL)
		delProfanity(str, "****", 4, ****);
	
	if (strstr(str, "damn") != NULL)
		delProfanity(str, "damn", 4, damn);
	
	if (strstr(str, "*****") != NULL)
		delProfanity(str, "*****", 5, *****);
		
	if (strstr(str, "faggot") != NULL)
		delProfanity(str, "faggot", 6, faggot);

	if (strstr(str, "fag") != NULL)
		delProfanity(str, "fag", 3, fag);
}


/*  delProfanity accepts four args: 

	char *string	: the swear-laden user-inputted sentence;
								    
	char *srchTerm	: the swear word the function looks for and replaces;
						
	int wordCount	: the word-count of the swear word;

	char *replacement: a pointer to the array containing the replacement
					  for the swear word; usually, this array is full 
					  of stars (e.g., '****'), but that can be changed by modifying
					  swearEnum.h.  */


void delProfanity(char *string, char *srchTerm, int wordCount, char *replacement)
{	start:

	int lenOfSwearReplacement = strlen(replacement);

	char tempString[100] = {0};
	int  i = 0;
	char *ptrToFirstLetter = strstr(string, srchTerm);
	// the below pointer is the same as the above, but this
	// one remains unmodified so the same memory location
	// can be used later
	char *permPtrToFirstLetter = strstr(string, srchTerm);
	
	// split the sentence where the search term is
	while(*ptrToFirstLetter)
	{
		tempString[i] = *ptrToFirstLetter;
		ptrToFirstLetter++
		i++;
	}
		 
	int lenOfSentence	  = strlen(string);
	int lenOfTempString	  = strlen(tempString);
	int difference		  = (lenOfSentence - lenOfTempString);
		
	// in case the search term happens to be at the end of 
	// a string -- in that case, the temporary string and
	// the search term will be equal.
	if(strcmp(tempString, srchTerm) == 0)
	{
		// chop off the swear
		string[difference] = NULL; 
		// insert euphemism -- stars, usually, but it can be a replacement phrase
		strcat(string, replacement);
		return;
	}
		
	// chop the original string right where the temp
	// string is copied. 
	*permPtrToFirstLetter = NULL;

	// insert euphemism -- stars, usually, but it can be a replacement phrase
	strcat(string, replacement);

	//increment 'difference' the length of the euphemism
	difference += lenOfSwearReplacement;

	// now copy back the fragment in tempString[], omitting the search term.
	for(int j = (wordCount); j <= lenOfTempString; j++)
	{
		string[difference] = tempString[j];
		difference++;	
	}

	// if there is more than one occurance of a word that needs to be deleted,
	// do all the above again. much love, goto!
	if(strstr(string, srchTerm))
		goto start;

}

The header file:

Code:

// swear dictionary, function declarations

#ifndef swear
#define swear

void checkForProfanity(char*);
void delProfanity(char*, char*, int, char*);

// common four-letter-words -- and their euphemisms.
// use these instead of stars for some comedic value 
// when the sentences are printed

/*  The below have been raped by Rage's swear filter...

char ****[] = "have sex with ",
       ****[] = "defecate ",
       damn[] = "darn ",
       *****[] = "female dog ",
       mother****er[] = "incestuous boy ",
       mother****ing[] = "incestuous ";
*/	 

// current replacement table:

char	   fag[] = "***",
	   faggot[] = "******",
	   ****[] = "****",
	   ****[] = "****",
	   damn[] = "****",
	   *****[] = "*****",
	   mother****er[] = "************",
   	   mother****ing[] = "*************";


#endif

It works just great, but my question is, checkForProfanity(char * str) is kind of lame with all its if-statements, is there any way to do this better? Like, with enums or something (i've not learned enums yet so I wouldn't know...)

Zenitram · Mar 22, 2006

hahaha, this is hilarious. Well, I think a better approach to this would be to first tokenize the string and then check an array of strings against a dictionary file. So the file would look something like...

<begin file>
f*ck
sh*t
d*ck
c*nt
p*ssy
<end file>

and so on. These are the bad word tokens to check for. So a simple implementation would simply tokenize the string at each space and check whole strings against everything here in the dictionary. The next implementation would do this as well as do a search for each string inside each tokenized string to automatically filter out combined badwords..... such as the example "motherf*cker." Doing a search for "f*ck* inside the string would yield positive and therefore you can filter the word out. Another method (more complex) would be to derive a state machine for each string in the dictionary and then go through the string one letter at a time, if your state ends up at a valid endstate for a badword, you can backward propagate to filter out the involved characters causing the badword state. This would filter out only the badwords and not combinational words unless you apply some rule for space containment.

Have fun! =)

black hole sun · Mar 22, 2006

Whoah man! Slow it down, haha. What's a 'token'? I've only learned about three months worth of C++ so my knowledge is pretty much limited to what's in my source file...I do know struct's though, but I don't see how they'd help.

Zenitram · Mar 23, 2006

black hole sun said:
Whoah man! Slow it down, haha. What's a 'token'? I've only learned about three months worth of C++ so my knowledge is pretty much limited to what's in my source file...I do know struct's though, but I don't see how they'd help.

sory mang. look up "strtok" and the string class on MSDN. that might give ya an idea what to do. it's a struct but better. a sruct with behavior and specific functionality t what it is. otherwise a token can be see as a part of something. so the string:

Hi, I rock

could tokenize into something like:

Hi
I
rock

just makes it easier to apply rules to strings. good luck! keep asking questions if ya confused.

Squeek3018 · Mar 24, 2006

You could also make a .txt file with all the swear words and make a method that loops the file. Instead of adding some more code when you add a bad word, you just add the word in the txt file.

absolutefunk · Mar 29, 2006

In Java string tokens = StringTokenizer class. Very easy to use. Ahhh Java, gotta love it

-Brian

Synetech · Apr 17, 2006

Like you pointed out, the multiple if statements are not ideal. This is because they all perform the same action but with a different variable which is exactly what loops are for. So, the short answer (you should have 4 months by now

) is to replace the multiple if statements with a loop/array combination.

For example, instead of something like this:

Code:

if (str="foo")
  doaction(str, "foo");
else if (str="bar")
  doaction(str, "bar");
else if (str="baz")
  doaction(str, "baz");
else if (str="test")
  doaction(str, "test");

Use something this:

Code:

#define numswears 4
const swearlist[numswears]={"foo", "bar", "baz", "test"};
for (int i=0; i<numswears; i++) {
  if (str=swearlist[i]) {
    doaction(str, swearlist[i]);
    break;
  }
}

HTH

AluminumHaste · Apr 21, 2006

Zenitram said:
hahaha, this is hilarious. Well, I think a better approach to this would be to first tokenize the string and then check an array of strings against a dictionary file. So the file would look something like...

<begin file>
f*ck
sh*t
d*ck
c*nt
p*ssy
<end file>

and so on. These are the bad word tokens to check for. So a simple implementation would simply tokenize the string at each space and check whole strings against everything here in the dictionary. The next implementation would do this as well as do a search for each string inside each tokenized string to automatically filter out combined badwords..... such as the example "motherf*cker." Doing a search for "f*ck* inside the string would yield positive and therefore you can filter the word out. Another method (more complex) would be to derive a state machine for each string in the dictionary and then go through the string one letter at a time, if your state ends up at a valid endstate for a badword, you can backward propagate to filter out the involved characters causing the badword state. This would filter out only the badwords and not combinational words unless you apply some rule for space containment.

Have fun! =)

Isn't there a function called contains(String)?? That's part of the string class?

So:

Code:

    If myString.contains("****")
                //do your remove ****

FX-Overclocking · Apr 23, 2006

Wait a minute... you thought it would be "FUN" to sit down for hours creating a swear filter??? LOL

Zenitram · Apr 23, 2006

AluminumHaste said:
Isn't there a function called contains(String)?? That's part of the string class?

So:

Code:

If myString.contains("****") //do your remove ****

well... knowing that a string contains something is a little different that getting the information to remove the string. So my string has a word that needs to be removed, great, we still need to figure out how to remove it. This can be used to attempt to optimize out iterations or once tokenized you can use this to removed entire string tokens and then reassemble the string. Useful function, but does not solve all your problems.

Zenitram · Apr 24, 2006

FX-Overclocking said:
Wait a minute... you thought it would be "FUN" to sit down for hours creating a swear filter??? LOL

heck yea!

Synetech · Apr 24, 2006

A (proper) swear filter is actually more difficult than you would expect. Some (a lot?) of filters check for words inside others to catch things like !badword! and zBADWORDz. This is bad because it also catches legitimate words. Here's a test I did over at TV.com to test their swear filter:

Ping pong balls.
Butter cookies.
Mixed nuts.
A mangy pussycat.
It's all poppycock.
A stealthy assassin.
Professor Dick Solomon.
Accumulated goods.
The Shitepoke Rally
Pricked on a needle.
A Rolex wristwatch.

It filtered pussycat and wristwatch but nothing else. Like I said there, Hmmm... that's quite a retarded filter.

black hole sun · Apr 24, 2006

FX-Overclocking said:
Wait a minute... you thought it would be "FUN" to sit down for hours creating a swear filter??? LOL

I have no life and it was better than doing nothing.

And anyway syntech yeah it isn't exactly world class but hey what do you want, I've only been doing this C\C++ thing for a few months I think it's pretty neat.

Thanks for the enum suggestion though I think that would help.

AluminumHaste · Apr 24, 2006

Zenitram said:
well... knowing that a string contains something is a little different that getting the information to remove the string. So my string has a word that needs to be removed, great, we still need to figure out how to remove it. This can be used to attempt to optimize out iterations or once tokenized you can use this to removed entire string tokens and then reassemble the string. Useful function, but does not solve all your problems.

Code:

Visual Basic though not C++ :(

Dim myString As String
myString = "hairyballs"
Dim removeString As String = "balls"
Dim index As Int16
index = myString.IndexOfAny(removeString)
myString = myString.Remove(index, removeString.Length)

MsgBox(myString)

Zenitram · Apr 24, 2006

AluminumHaste said:
Visual Basic though not C++

There is your problem :bleh2:

hehe

SyneTech said:
A (proper) swear filter is actually more difficult than you would expect. Some (a lot?) of filters check for words inside others to catch things like !badword! and zBADWORDz. This is bad because it also catches legitimate words.

Aye. This is where the use of pretty detailed FSM's are used. Just any systematic way of applying a ruleset will define how the filter works. Gotta just ask yourself how much pain you really want/need.

Creating a C++ swear filter

black hole sun

New member

Zenitram

New member

black hole sun

New member

Zenitram

New member

Squeek3018

New member

absolutefunk

New member

Synetech

New member

AluminumHaste

Active member

FX-Overclocking

New member

Zenitram

New member

Zenitram

New member

Synetech

New member

black hole sun

New member

AluminumHaste

Active member

Zenitram

New member