=pod Code submission from: Paul Schaap (schaappa@ozemail.com.au) parseName 1.0.0 Release Notes WHY === WARNING the company dictionary "Company" is biased for Australia, Sorry. Somebody volunteer to modify for other countries ? Well I have often found myself wading through alot of data at work that is poory structured. Often a name field contains "MR P SCHAAP" but I really need to break it up into "MR" & "P" & "SCHAAP" so for example I can address a letter "DEAR MR SCHAAP" not quite as easy as you think when your data quality is poor, missing titles, full of Businesses etc. INSTALLATION ============ ï Install MacPerl 5.1.4r4 or greater.(Thanx Matthias) USAGE ===== 1) Drop a file with a file with names in it onto 'parseName Droplet'. 2) parseName presumes a pipe "|" delimited file with the name field as the first field. If you want to use a different delimiter and/or field alter the $fieldDelimiter and $fieldPosition values (Remember perl counts from zero).1) Drop a file with a file with names in it onto 'businessFinder Droplet'. CHANGES ======= V1.0.0 ï Well none really, first cut. CHEERS schaapp@ozemail.com.au - I Only read email once a fortnight, sorry. =cut #!/usr/bin/perl ##################################################################### # PROGRAM : parseName.pl # BY : Paul Schaap # DATED : 05/01/1998 # PURPOSE : To Parse Names ##################################################################### $|=1; print "STARTING parseName.pl\n"; $start_time = time; # Load Company Words #==================== $c = 0; open(COMPANY,"company") or die "Cannot open company !"; while(){ chop; s/(\W)/\\$1/g; tr/a-z/A-Z/; #print $_,"\n"; $C[$c]=$_; $c++; } $routime_time = $start_time; $current_time = time; $time_elapse = $current_time - $routime_time; $routime_time = $current_time; print "$c rows into company array ($time_elapse elapsed)\n"; # Load Title Words #================== @N=("MR","MRS","MISS","MS","DR","SIR","MADAM","LADY","MESSRS"); # BUSFINDER ROUTINE #=================== sub busfinder { $low = 0; $size = @C; $high = $size - 1; while($low <= $high){ $mid = int(($low + $high) / 2); if($C[$mid] eq $_[0]){ return $mid; } else { if($C[$mid] gt $_[0]){ $high = $mid - 1; } else { $low = $mid + 1; } } } return -1; } # PROPER CASING SUB-ROUTINE #=========================== sub propup{ my @p = @_; $p[2] =~ tr/A-Z/a-z/; $r = join('',$p[0],$p[1],$p[2]); return $r; } $i = 0; while(<>){ $i++; chomp; $record = $_; # Start Parsing #=============== s/'/\\\'/g; # Deal with quotes s/[ ]*\|/|/g; # Deal with trailing spaces s/ AND / \& /g; #s/(\W)/\\$1/g; tr/a-z/A-Z/; $cmp = ""; $tit = ""; $fnm = ""; $sur = ""; SWITCH: { foreach my $a (@N) { foreach my $b (@N) { # MR & MRS SCHAAP if(/^($a) \& ($b) ([\w]*)$/){ $tit = $1; $fnm = ""; $sur = $3; #print "1.$_-$tit-$fnm-$sur\n"; last SWITCH; } # MR & MRS PAUL SCHAAP if(/^($a) \& ($b) ([\w]*) ([\w]*)$/){ $tit = $1; $fnm = $3; $sur = $4; #print "2.$_-$tit-$fnm-$sur\n"; last SWITCH; } # MR & MRS PAUL & SALLY SCHAAP if(/^($a) \& ($b) ([\w]*) \& ([\w]*) ([\w]*)$/){ $tit = $1; $fnm = $3; $sur = $5; #print "3.$_-$tit-$fnm-$sur\n"; last SWITCH; } } # MR SCHAAP if(/^($a) ([\w]*)$/){ $tit = $1; $fnm = ""; $sur = $2; #print "4.$_-$tit-$fnm-$sur\n"; last SWITCH; } # MR PAUL SCHAAP if(/^($a) ([\w]*) ([\w]*)$/){ $tit = $1; $fnm = $2; $sur = $3; #print "5.$_-$tit-$fnm-$sur\n"; last SWITCH; } # MR PAUL A SCHAAP if(/^($a) ([\w]* [\w]*) ([\w]*)$/){ $tit = $1; $fnm = $2; $sur = $3; #print "6.$_-$tit-$fnm-$sur\n"; last SWITCH; } # MR PAUL A G SCHAAP if(/^($a) ([\w]* [\w]* [\w]*) ([\w]*)$/){ $tit = $1; $fnm = $2; $sur = $3; #print "7.$_-$tit-$fnm-$sur\n"; last SWITCH; } } # DO COMPANY SEARCHING #====================== @NAME = split(/\W/); foreach $d (@NAME){ $res = &busfinder($d); if($res != -1){ $cmp = $_; last SWITCH; } } # ILL TRY ANYTHING #================== # PAUL SCHAAP if(/^([\w]*) ([\w]*)$/){ $tit = ""; $fnm = $1; $sur = $2; last SWITCH; } # PAUL A SCHAAP if(/^([\w]* [\w]*) ([\w]*)$/){ $tit = ""; $fnm = $1; $sur = $2; last SWITCH; } # PAUL & SALLY SCHAAP if(/^([\w]*) \& [\w]* ([\w]*)$/){ $tit = ""; $fnm = $1; $sur = $2; last SWITCH; } # P A & SALLY A SCHAAP if(/^([\w]* \w) \& [\w]* [\w]* ([\w]*)$/){ $tit = ""; $fnm = $1; $sur = $2; last SWITCH; } # P A & SALLY SCHAAP if(/^([\w]* \w) \& [\w]* ([\w]*)$/){ $tit = ""; $fnm = $1; $sur = $2; last SWITCH; } } $cmp =~ tr/a-z/A-Z/; # Print Record Structure #======================== print "$cmp,$tit,$fnm,$sur,$record\n"; } print time - $start_time, " elapsed\n"; __END__ # END #####################################################################