楼主: hanszhu
1561 0

Parsing and Formatting Text using Java [推广有奖]

  • 0关注
  • 34粉丝

院士

27%

还不是VIP/贵宾

-

TA的文库  其他...

Clojure NewOccidental

Job and Interview

Perl资源总汇

威望
7
论坛币
144574916 个
通用积分
68.0811
学术水平
37 点
热心指数
38 点
信用等级
25 点
经验
23243 点
帖子
1874
精华
1
在线时间
797 小时
注册时间
2005-1-3
最后登录
2024-5-18

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
Parsing and Formatting Text

Parsing and formatting text is a large, open-ended topic. So far in this chapter, we’ve looked at only primitive operations on strings—creation, basic editing, searching, and turning simple values into strings. Now we’d like to move on to more structured forms of text. Java has a rich set of APIs for parsing and printing formatted strings, including numbers, dates, times, and currency values. We’ll cover most of these topics in this chapter, but we’ll wait to discuss date and time formatting until Chapter 11.

We’ll start with parsing—reading primitive numbers and values as strings and chopping long strings into tokens. Then we’ll go the other way and look at formatting strings and the java.text package. We’ll revisit the topic of internationalization to see how Java can localize parsing and formatting of text, numbers, and dates for particular locales. Finally, we’ll take a detailed look at regular expressions, the most powerful text-parsing tool Java offers. Regular expressions let you define your own patterns of arbitrary complexity, search for them, and parse them from text.

We should mention that you’re going to see a great deal of overlap between the new formatting and parsing APIs (printf and Scanner) introduced in Java 5.0 and the older APIs of the java.textpackage. The new APIs effectively replace much of the old ones and in some ways are easier to use. Nonetheless, it’s good to know about both because so much existing code uses the older APIs.

Parsing Primitive Numbers
  1. In Java, numbers and Booleans are primitive types—not objects. But for each primitive type, Java also defines a primitive wrapper class. Specifically, the java.lang package includes the following classes: Byte, Short, Integer, Long, Float, Double, and Boolean. We talked about these in Chapter 1, but we bring them up now because these classes hold static utility methods that know how to parse their respective types from strings. Each of these primitive wrapper classes has a static “parse” method that reads a String and returns the corresponding primitive type. For example:

  2.     byte b = Byte.parseByte("16");
  3.     int n = Integer.parseInt( "42" );
  4.     long l = Long.parseLong( "99999999999" );
  5.     float f = Float.parseFloat( "4.2" );
  6.     double d = Double.parseDouble( "99.99999999" );
  7.     boolean b = Boolean.parseBoolean("true");
  8.     // Prior to Java 5.0 use:
  9.     boolean b = new Boolean("true").booleanValue();
  10. Alternately, the java.util.Scanner provides a single API for not only parsing individual primitive types from strings, but reading them from a stream of tokens. This example shows how to use it in place of the preceding wrapper classes:

  11.     byte b = new Scanner("16").nextByte();
  12.     int n = new Scanner("42").nextInt();
  13.     long l = new Scanner("99999999999").nextLong();
  14.     float f = new Scanner("4.2").nextFloat();
  15.     double d = new Scanner("99.99999999").nextDouble();
  16.     boolean b = new Scanner("true").nextBoolean();
复制代码

Working with alternate bases
  1. It’s easy to parse integer type numbers (byte, short, int, long) in alternate numeric bases. You can use the parse methods of the primitive wrapper classes by simply specifying the base as a second parameter:

  2.     long l = Long.parseLong( "CAFEBABE", 16 );  // l = 3405691582
  3.     byte b = Byte.parseByte ( "12", 8 ); // b = 10
  4. All methods of the Java 5.0 Scanner class described earlier also accept a base as an optional argument:

  5.     long l = new Scanner( "CAFEBABE" ).nextLong( 16 );  // l = 3405691582
  6.     byte b = new Scanner( "12" ).nextByte( 8 ); // b = 10
  7. You can go the other way and convert a long or integer value to a string value in a specified base using special static toString() methods of the Integer and Long classes:

  8.     String s = Long.toString( 3405691582L, 16 );  // s = "cafebabe"
  9. For convenience, each class also has a static toHexString() method for working with base 16:

  10.     String s = Integer.toHexString( 255 ).toUpperCase();  // s = "FF";
复制代码

Number formats
  1. The preceding wrapper class parser methods handle the case of numbers formatted using only the simplest English conventions with no frills. If these parse methods do not understand the string, either because it’s simply not a valid number or because the number is formatted in the convention of another language, they throw a NumberFormatException:

  2.     // Italian formatting
  3.     double d = Double.parseDouble("1.234,56");  // NumberFormatException
  4. The Scanner API is smarter and can use Locales to parse numbers in specific languages with more elaborate conventions. For example, the Scanner can handle comma-formatted numbers:

  5.     int n = new Scanner("99,999,999").nextInt();
  6. You can specify a Locale other than the default with the useLocale() method. Let’s parse that value in Italian now:

  7.     double d = new Scanner("1.234,56").useLocale( Locale.ITALIAN ).nextDouble();
  8. If the Scanner cannot parse a string, it throws a runtime InputMismatchException:

  9.     double d = new Scanner("garbage").nextDouble(); // InputMismatchException
  10. Prior to Java 5.0, this kind of parsing was accomplished using the java.text package with the NumberFormat class. The classes of the java.text package also allow you to parse additional types, such as dates, times, and localized currency values, that aren’t handled by the Scanner. We’ll look at these later in this chapter.
复制代码

Tokenizing Text

A common programming task involves parsing a string of text into words or “tokens” that are separated by some set of delimiter characters, such as spaces or commas. The first example contains words separated by single spaces. The second, more realistic problem involves comma-delimited fields.

   
  1. String text = "Now is the time for all good men (and women)...";
  2.     StringTokenizer st = new StringTokenizer( text );

  3.     while ( st.hasMoreTokens() )  {
  4.         String word = st.nextToken();
  5.         ...
  6.     }
复制代码

Java has several (unfortunately overlapping) APIs for handling situations like this. The most powerful and useful are the String split() and Scanner APIs. Both utilize regular expressions to allow you to break the string on arbitrary patterns. We haven’t talked about regular expressions yet, but in order to show you how this works we’ll just give you the necessary magic and explain in detail later in this chapter. We’ll also mention a legacy utility, java.util.StringTokenizer, which uses simple character sets to split a string. StringTokenizer is not as powerful, but doesn’t require an understanding of regular expressions.

The String split() method accepts a regular expression that describes a delimiter and uses it to chop the string into an array of Strings:

  1. String text = "Now is the time for all good men";
  2.     String [] words = text.split("\\s");
  3.     // words = "Now", "is", "the", "time", ...

  4.     String text = "4231,         Java Programming, 1000.00";
  5.     String [] fields = text.split("\\s*,\\s*");
  6.     // fields = "4231", "Java Programming", "1000.00"
复制代码

In the first example, we used the regular expression \\s, which matches a single whitespace character (space, tab, or carriage return). The split() method returned an array of eight strings. In the second example, we used a more complicated regular expression, \\s*,\\s*, which matches a comma surrounded by any number of contiguous spaces (possibly zero). This reduced our text to three nice, tidy fields.

With the new Scanner API, we could go a step further and parse the numbers of our second example as we extract them:

   
  1. String text = "4231,         Java Programming, 1000.00";
  2.     Scanner scanner = new Scanner( text ).useDelimiter("\\s*,\\s*");
  3.     int checkNumber = scanner.nextInt(); // 4231
  4.     String description = scanner.next(); // "Java Programming"
  5.     float amount = scanner.nextFloat();  // 1000.00
复制代码

Here, we’ve told the Scanner to use our regular expression as the delimiter and then called it repeatedly to parse each field as its corresponding type. The Scanner is convenient because it can read not only from Strings but directly from stream sources, such as InputStreams, Files, and Channels:

   
  1. Scanner fileScanner = new Scanner( new File("spreadsheet.csv") );
  2.     fileScanner.useDelimiter( "\\s*,\\s* );
  3.     // ...
复制代码

Another thing that you can do with the Scanner is to look ahead with the “hasNext” methods to see if another item is coming:

  1. while( scanner.hasNextInt() ) {
  2.       int n = scanner.nextInt();
  3.       ...
  4.     }
复制代码

StringTokenizer


  1. Even though the StringTokenizer class that we mentioned is now a legacy item, it’s good to know that it’s there because it’s been around since the beginning of Java and is used in a lot of code. StringTokenizer allows you to specify a delimiter as a set of characters and matches any number or combination of those characters as a delimiter between tokens. The following snippet reads the words of our first example:

  2.     String text = "Now is the time for all good men (and women)...";
  3.     StringTokenizer st = new StringTokenizer( text );

  4.     while ( st.hasMoreTokens() )  {
  5.         String word = st.nextToken();
  6.         ...
  7.     }
  8. We invoke the hasMoreTokens() and nextToken() methods to loop over the words of the text. By default, the StringTokenizer class uses standard whitespace characters—carriage return, newline, and tab—as delimiters. You can also specify your own set of delimiter characters in the StringTokenizer constructor. Any contiguous combination of the specified characters that appears in the target string is skipped between tokens:

  9.     String text = "4231,     Java Programming, 1000.00";
  10.     StringTokenizer st = new StringTokenizer( text, "," );

  11.     while ( st.hasMoreTokens() )  {
  12.        String word = st.nextToken();
  13.        // word = "4231", "     Java Programming", "1000.00"
  14.     }
  15. This isn’t as clean as our regular expression example. Here we used a comma as the delimiter so we get extra leading whitespace in our description field. If we had added space to our delimiter string, the StringTokenizer would have broken our description into two words, “Java” and “Programming,” which is not what we wanted. A solution here would be to use trim() to remove the leading and trailing space on each element.
复制代码

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:formatting format Using FORMA form structured including searching printing creation

已有 1 人评分经验 学术水平 热心指数 信用等级 收起 理由
fantuanxiaot + 15 + 1 + 1 + 1 精彩帖子

总评分: 经验 + 15  学术水平 + 1  热心指数 + 1  信用等级 + 1   查看全部评分

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加JingGuanBbs
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-5-23 19:12