| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Regular Expressions in Java
As from release 1.4, Java 2 comes with a regular expression class that supports patterns similar in style to Perl regular expressions. This page gives you a brief introduction to regular expressions if you're not already familiar with them, then covers Java-specific topics such as compile and match methods, Pattern objects and PatternSyntaxExceptions. BASIC HANDLING OF STRINGS The Java language includes a primitive data type char, which holds a 16-bit unicode character. You can hold multiple characters in a String object, or in a StringBuffer object. Methods such as equals and equalsIgnoreCase, startsWith and endsWith allow you to test Strings against one another. Methods such as indexOf and substring allow you to perform operations on a String. parseInt and other similarly named methods allow you to extract a number (in this example an int) from a String, although you do need to remember to catch the exception that may be thrown. In certain specialist applications, such as Bioinformatics, multiple characters can also be usefully held in a char array, where they're likely to be dealt with character-by-character in a loop, as in DNA and RNA sequencing. The StringTokenizer class allows you to take a String and step through it element-by-element (token-by-token) to handle it in chunks or sections. You can choose what character or characters you use between the elements to break up the string in the way you want. But, until Java release 1.4, the standard classes didn't include any way to ask "does this string look like xxxxxxx". Why would we want to? Well, we might want to ask, "Does this string look like an email address?" and go on to define (in simple terms) an email address as a series of non-spaces, followed by an @ character, followed by another series of non-spaces. Let's see how that is solved in Java 1.4: $ java Reg1 email graham@wellho.net or lisa@wellho.net for information "email" is NOT an email address "graham@wellho.net" IS a possible email address "or" is NOT an email address "lisa@wellho.net" IS a possible email address "for" is NOT an email address "information" is NOT an email address $ Looks good. The Reg1 class is short too, but it includes some strange looking text strings: import java.util.regex.*; public class Reg1 { public static void main (String [] args) { Pattern email = Pattern.compile("^\\S+@\\S+$"); for (int j=0; i<args.length; j++) { Matcher fit = email.matcher(args[j]); if (fit.matches()) { System.out.println ( "\"" +args[j] + "\" IS a possible email address"); } else { System.out.println ( "\"" + args[j] + "\" is NOT an email address"); } } } } AN INTRODUCTION TO REGULAR EXPRESSIONS A regular expression works by matching a String against a template or pattern (a Pattern object in Java), and in its simplest form, returning a boolean to say "yes, the string does look like the pattern" or "no, that doesn't match". Regular expressions have been around for many years. They originated in Unix utilities such as "grep", the Global Regular Expression Processor, and are now supported by most modern languages. They are, however, a mystery to many people. That string of strange looking characters in our example above: "^\\S+@\\S+$" would be enough to put many people off. If you're familiar with grep, or have programmed in Perl, PHP, awk or Tcl, you'll have come across regular expressions already, at least to some extent. Beware. Each language has its own regular expression engine (PHP has two!); although the basics are the same, more advanced regular expressions differ from one language to another. In the early days of Java, a regular expression engine was written by Johnathan Locke, and donated to the Apache Software Foundation; you can download it from: http://jakarta.apache.org/regexp/index.html As of release 1.4 of Java, though, there's a standard package java.util.regex that's shipped with the JRE, and that's what we'll look at in this module. THE ELEMENTS OF A REGULAR EXPRESSION. A regular expression match is a "woolly match". When you don't want to ask "does A equal B?" but rather "does A look like B?", then you're probably looking for a regular expression. Does this seem a bit alien to the world of computers? Are you expecting to be giving precise programming instructions and getting a bit worried when I tell you that the match is woolly or fuzzy? Fear not; it's up to the programmer to define every element of the fuzzy-ness. Let's have a look at our crude email address matcher. What does an email address look like? It STARTS WITH A NONE SPACE CHARACTER (and there's ONE OR MORE OF THOSE). That's followed by LITERALLY AN @ CHARACTER then A NONE SPACE CHARACTER (and there's ONE OR MORE OF THOSE) and that's then THE END OF THE STRING If you read it carefully, the description is reasonable enough, even though I have laid it out in an odd way. That's so that you can see in a moment how we translate this description in English into a regular expression. Whether we're looking at English (or a regular expression), we have four types of element in our description above: -- Assertions ( "Starts With" and "Ends With") -- Literal characters ( "Literally an @") -- Any character from a group ( "a non-space character") -- Counts (1 or more of those) and those are the basic element groupings in a regular expression. Translating, we'll add the regular expression down the right hand side; It STARTS WITH ^ A NONE SPACE CHARACTER \S (and there's ONE OR MORE OF THOSE). + That's followed by LITERALLY AN @ CHARACTER @ then A NONE SPACE CHARACTER \S (and there's ONE OR MORE OF THOSE) + and that's then THE END OF THE STRING. $ all we then need to do is combine it into a single string, and add extra \ characters to ensure that the \ of \S gets past the Java language and into the regex class, thus: "^\\S+@\\S+$" Within Java, there are three stages to regular expression matching. Firstly, you define a Pattern object against which you can do the matching, and that typically takes a String parameter. Chances are you'll have a number of matches to do using the same pattern, so you don't want your Java to be slowed down interpreting its curious input string every time. So: Pattern email = Pattern.compile("^\\S+@\\S+$"); doesn't actually do any matching; it runs the regular expression compiler and prepares to do the matching using the object called email as the pattern or template. Only when you go on and run the Matcher (and we happen to do this within a loop): Matcher fit = email.matcher(args[j]); is the matching done, and that creates an object of type Matcher. The final stage of our matching is to get the result. In this first instance, we're simply interested in knowing if the match succeeded or not, so we'll use the boolean matches method to find out: if (fit.matches()) { EXPANDING THE POWER OF REGULAR EXPRESSIONS There are complete books on regular expressions, and on our Perl and PHP courses we spend up to half a day studying them. We won't go quite so far in this module, but we will look at each of the elements of the regular expression handler in Java and show you the power and flexibility it gives you. It's not yet a "core Java" subject, but in a short time it may be! LITERAL CHARACTERS If you use most characters within a regular expression, then they are matched exactly. For example: Pattern feline = Pattern.compile("cat"); sets up a pattern which will match to any String that contains the letters c-a-t in that order somewhere within, so it matches -- cat -- catalogue -- The use of regular expressions is vindicated To match literally a unicode or other character that's special to the String handler, you precede it with a \ in the usual way, thus you may have literals such as: \n \t \u00a3 \\ To match literally a character that's special to the regular expression engine, you also need to precede the character with a \, but in this case the \ must itself be protected as you require it to be passed on through the double quote handler and reach the compile method. Thus, to match a String that literally contains a + character, you would write: Pattern adder = Pattern.compile("\\+"); ANCHORS AND ASSERTIONS You'll have noticed that a regular expression that contains only literal characters looks for the given pattern within the string. It does not force a match against the whole String. If you want to match at the start of a String, start your pattern with a ^ character; if you want to match at the end, conclude your pattern with a $ character. Should you specify both a ^ and a $, then you're looking to match the complete String to your regular expression. The ^ and $ elements are known as "anchors" as they tie the start and/or the end of the String down; this group as a whole is also known as "assertions" because they don't match any specific characters in the incoming string, they just assert that while the match is running a certain condition must occur at the given point in the match. Example: Pattern feline = Pattern.compile("^cat"); matches: -- cat -- catalogue but not: -- The use of regular expressions is vindicated Example: Pattern feline = Pattern.compile("cat$"); matches: -- cat but not: -- catalogue -- The use of regular expressions is vindicated Important note: It might appear that Java regular expressions are default anchored with both a ^ and $ character. This is how the match method that we're using at present works. Alternative methods such as find have anchors resume the traditional (but more confusing) "default off" status that they have in other programming languages. CHARACTER GROUPS With anchors and literal characters, you can look for a String that starts with, contains, ends with, or exactly matches another String. The mechanism is clear enough, but you could (if you think about it) have used methods such as startsWith just as easily. The power of regular expressions really comes into its own when you start adding in character groups. If you write [abcdef] in your regular expression, then you're matching any one character from the list given (a b c d e or f). You can expand this capability further by using a minus sign to specify a character range, thus
and if you want to match any character except one from a list, you can start the character list with an ^ character, for example:
There are some very common character groups you may want to specify; you could write "any white space character" as: [ \t\n\r\f\xoB] but that would get messy really fast, so there are some common groupings available in Java's regular expressions:
If you want any character except one of these, use a capital letter:
Sequences such as \s will be familiar to you if you use Perl's regular expressions, but there are other character groups too; these use a POSIX standard definition of the character groups, but it's extended and the format isn't taken from Perl, nor PHP, nor Tcl nor SQL!
You can negate these groups using \P rather than \p thus
One final grouping, the ultimate group if you like, is the "." (full stop or period) character, which matches virtually any character. COUNTS The fourth main group (after anchors, literal characters, and character groups) are the counts; you use these in regular expressions if you want to give a quantity to a literal character or group, and you add the count character into you pattern directly after the element to which it applies. There are three very common counts:
You might find it easier to read these as
Remember the example we started this section with? "^\\S+@\\S+$" Well, we can now read it through from start to end... AN EXAMPLE Here's a sample program that lets you run a regular expression engine against all the lines from a file. We've really rewritten the "grep" utility in Java, but our handler will take the more powerful regular expressions that Java supports: import java.util.regex.*; import java.io.*; public class Reg2 { public static void main (String [] args) throws IOException { File in = new File(args[1]); BufferedReader get = new BufferedReader( new FileReader( in )); Pattern hunter = Pattern.compile(args[0]); String line; int lines = 0; int matches = 0; System.out.print("Looking for "+args[0]); System.out.println(" in "+args[1]); while ((line = get.readLine()) != null) { lines++; Matcher fit = hunter.matcher(line); if (fit.matches()) { System.out.println ( "" + lines +": "+line); matches++; } } if (matches == 0) { System.out.println("No matches in "+lines+" lines"); } } } And in use: $ java Reg2 ".*dog.*" /usr/share/dict/words Looking for .*dog.* in /usr/share/dict/words 6459: bulldog 6460: bulldogs 13394: dog 13396: dogged 13397: doggedly 13398: doggedness 13399: dogging 13400: doghouse 13401: dogma 13402: dogmas 13403: dogmatic 13404: dogmatism 13405: dogs $ java Reg2 "[dD][aeiou]gg.*" /usr/share/dict/words Looking for [dD][aeiou]gg.* in /usr/share/dict/words 11229: dagger 12597: digger 12598: diggers 12599: digging 12600: diggings 13396: dogged 13397: doggedly 13398: doggedness 13399: dogging $ FLAGS How did we match to the word "dog"? We wrote ".*dog.*", but alas that would not have matched "Dog" or "DOG" as it's case sensitive. You can specify one or more flags (or'd together) to your pattern constructor. Flags available include:
SPLITTING If you want to divide an incoming String at a particular regular expression, the split method allows you to do so. It's an alternative to the StringTokenizer, and you can use it without a loop and with a more complex separator. split returns an array of Strings. An optional additional parameter allows you to specify a limit to the number of strings that you want returned. Here's a data file which has a mixture of spaces (sometimes several of them) and tabs between each field: passwd: files nisplus nis shadow: files nisplus nis group: files nisplus nis hosts: files dns bootparams: nisplus [NOTFOUND=return] files ethers: files netmasks: files networks: files protocols: files nisplus nis rpc: files services: files nisplus nis netgroup: files nisplus nis publickey: nisplus automount: files nisplus nis aliases: files nisplus And we want to write an application which lets us find a list of all the lookups (the first word on each line) may be handled by a particular service (the following words). import java.util.regex.*; import java.io.*; public class Reg3 { public static void main (String [] args) throws IOException { File in = new File("confdata"); BufferedReader get = new BufferedReader( new FileReader( in )); Pattern hunter = Pattern.compile(args[0], Pattern.CASE_INSENSITIVE); Pattern divisor = Pattern.compile(":?\\s+ # any white spaces", Pattern.COMMENTS); String line; while ((line = get.readLine()) != null) { String [] parts = divisor.split(line); for (int j=1; j<parts.length; j++) { if (hunter.matcher(parts[j]).matches()) System.out.println("Used for "+parts[0]); } } } } And the results: $ java Reg3 Nis Used for passwd Used for shadow Used for group Used for protocols Used for services Used for netgroup Used for automount $ java Reg3 DNS Used for hosts $ You'll notice how the separator characters have been stripped out of the array of strings that has been returned – a feature we've used to our benefit to strip off the excess colon on the first field of each line of our incoming data file. CAPTURING THE STRING THAT MATCHED A PATTERN The Pattern object is only half of the equation. We've already made lightweight use of the Matcher object, but it turns out that there's a lot more that we may want to do. Recall our first example of matching email addresses? For sure, it's useful to have the facility that allows us to match against a regular expression and see whether or not we have something of the format of an email address. We may want to go a stage further and save the user name and domain name (the bits before and after the @ character) into separate variables. Now, when the matching has actually been performed, it's clear that work has been done internally to see which bits of the incoming pattern match which bits of the String that we're matching against; all we need to add is: -- a way to say "this is a bit that I'm interested in" and -- a way to get back these interesting bits Firstly, we indicate the "interesting bits" in our regular expression by grouping them in round brackets. Round brackets have a dual function in that a count can also be added directly after the brackets to repeat a pattern. We then use the group method to return the group(s) to us. import java.util.regex.*; public class Reg4 { public static void main (String [] args) { Pattern email = Pattern.compile("(\\S+)@(\\S+)"); Matcher fit = email.matcher(args[0]); if (fit.find()) { for (int i=0; i<=fit.groupCount(); i++) { System.out.println("We have "+ fit.group(i)); } } } } Let's run that: $ java Reg4 "At home, graham@wellho.net but away ..." We have graham@wellho.net We have graham We have wellho.net $ Note: -- use of "find" to look within the string -- use of () capturing brackets for subsequences -- The whole match is returned as group number 0. If you call the find method a second and subsequent times on the same Matcher, then you can make a series of successive matches. You'll get a false return when it runs out. Thus, simply by changing "if" to "while" in the previous example, you can look for a whole series of email addresses in a line of text. $ java Reg5 "Use graham@wellho.net or lisa@wellho.net to reach us" We have graham@wellho.net We have graham We have wellho.net We have lisa@wellho.net We have lisa We have wellho.net $ Further methods are available to have find start from a particular position, to reset it to look from the start, etc. There are also methods available that will return the start and end positions in the incoming string of the match, rather than the match string itself. USING REGULAR EXPRESSIONS TO REPLACE ONE STRING BY ANOTHER There are methods available in the Matcher that will let you replace a matched pattern with a specific string of text. These are replaceFirst and replaceAll. Let's change a phone number from a UK number into a full international one: import java.util.regex.*; public class Reg6 { public static void main (String [] args) { Pattern phone = Pattern.compile("\\s0"); Matcher action = phone.matcher(args[0]); String worldwide = action.replaceAll(" +44 (0) "); System.out.println(worldwide); } } Which runs as: $ java Reg6 "phone 01225 708225 or fax 01225 707126" phone +44 (0) 1225 708225 or fax +44 (0) 1225 707126 $ OTHER REGULAR EXPRESSION TOPICS There's a whole book on regular expressions written a number of years ago before POSIX sequences, before Java ...before all these extras. Here are a few extra pointers for you to start you on your way to using the power of regular expressions should they be the tool of choice for you: EXCEPTIONS The regular expression engine can throw exceptions if you look to define or match against ill-formed expressions. Under most circumstances, you will not wish to let your user provide a regular expression to your program, but if there's a chance of problems you should catch PatternSyntaxException. EXTENDING REGULAR EXPRESSIONS There are a number of additional constructions available within a regular expression that may be of use to you:
There are also "reluctant" qualifiers. When we were just matching and not capturing, the regular expression "<.*>" would correctly identify a tag within a String. But if we used that same regular expression when capturing, we need to be more careful. Consider the text: <i><b>Bold, Italic</b></i> What would we get if we matched that against "<.*>"? You might hope to get a match to <i>, but you would not; you would match the whole of the incoming string. There is a subtext to the "*" count that reads "as many characters as possible please". In other words, it is what we call a "greedy" match. Java supports reluctant or sparse counts as well as greedy ones. You can simply add an extra ? after the count character. Bear in mind that the easier greedy match is what you want in 90% of cases, but using this last example, you can match a single tag by writing "<.*?>" OPERATIONS ON STRINGBUFFERS The Matcher matches against a String, as you have seen in this module, but it can also be used to match against any other object that implements the CharSequence interface. Other standard classes that implement this interface are StringBuffer and CharBuffer See also Java Regular Expressions Please note that articles in this section of our
web site were current and correct to the best of our ability when published,
but by the nature of our business may go out of date quite quickly. The
quoting of a price, contract term or any other information in this area of
our website is NOT an offer to supply now on those terms - please check
back via our main web site
Related Material
Regular Expressions in Java resource index - Java Solutions centre home page You'll find shorter technical items at The Horse's Mouth and delegate's questions answered at the Opentalk forum. At Well House Consultants, we provide training courses on subjects such as Ruby, Perl, Python, Linux, C, C++, Tcl/Tk, Tomcat, PHP and MySQL. We're asked (and answer) many questions, and answers to those which are of general interest are published in this area of our site. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PH: 0800 043 8225 or 01225 708225 • FAX: 0845 8382 405 or 01225 707126 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||