Perl parsing field value pairs without separation

It is possible with perl functional programming to solve some interesting problems.

Suppose that you have a data set that has been processed to a degree where there was an error in the processing which removed some required delimiters such as the following:

some_data=some_valuesome_data=some_valuesomedata=somevalue

Now supposed that you have multiple lines of data that have varying sets of the fields and value pairs but always unique such as the following:

some_data1=some_value1some_data2=some_value2

some_data1=some_other_value1

some_data1=some_other_value1somedata2=some_other_value2

And to further complicate matters you have an ordering problem such as the following:

some_data1=some_value1some_data2=some_value2

some_data1=some_other_value1

somedata2=some_other_value2some_data1=some_other_value1

And the final crux is the possibility of trailing field without data:

some_data1=some_value1some_data2=some_value2

some_data1=some_other_value1

somedata2=some_other_value2some_data1=some_other_value1

somedata2=some_other_value2some_data1=

The solution is to read the data in from a file or stream line by line and split by = which is the separator for this example. Once this has been done, you can go through each of the values looking for pieces ending with the field names and subtracting those strings from the piece to set it as the next field name.

The Following Example opens a file, reads lines in one by one until the end, and does a foreach on the line split by the equals character:

open(MYINPUTFILE, "<test.dat");
while(<MYINPUTFILE>){
    my($line) = $_;
    chomp($line);
    @pieces = split( /=/,$line);
    foreach ( @pieces ){
        my($p) = $_;
        
    }
}
close(MYINPUTFILE);

In order to test that the field is in the split values, we must go through our array of fields and use a regular expression match such as the following:

if ( $piece =~ /fieldname$/ ){ 
    # do some action
}

After we have guaranteed we have a field, we then save the value as a token and set the last fieldname to whatever this field is:

$tok = substr($piece,0,-1 * length("fieldname"));
$last = "fieldname";

The case where we have no match indicates the end of our matching. Now that we have an understanding of the solution to a degree, lets use a real world sample:

Data File:

name=rajsex=mage
sal=10000
sex=fage=19name=somyasal=15000
sal=8000name=ritusex=fage
name=muksex=msal=19000age=30

Fields:

name, sal, sec, age

Code:

open(MYINPUTFILE, "<test.dat");
while(<MYINPUTFILE>){
        my($line) = $_;
        chomp($line);
        @pieces = split( /=/,$line);
        my($last) = $pieces[0] ;
        my($tok) = "" ;
        my($current) = "" ;
        my($count) = 0 ;

        my($name) = "(empty)" ;
        my($sal) = "(empty)" ;
        my($age) = "(empty)" ;
        my($sex) = "(empty)" ;

        foreach ( @pieces ){
                my($p) = $_;
                $current = $last ;
                if ( $p =~ /name$/ ){
                        $tok = substr ( $p,0,-4);
                        $last = "name" ;
                }elsif ( $p =~ /sal$/ ){
                        $tok = substr ( $p,0,-3);
                        $last = "sal" ;
                }elsif ( $p =~ /sex$/ ){
                        $tok = substr ( $p,0,-3);
                        $last = "sex" ;
                }elsif ( $p =~ /age$/ ){
                        $tok = substr ( $p,0,-3);
                        $last = "age" ;
                }else{
                        # must be end ;
                        if ( $count > 0 ){
                                if ( $last eq "name" ){
                                        $name = $p
                                }elsif ( $last eq "sal" ){
                                        $sal = $p ;
                                }elsif ( $last eq "age" ){
                                        $age = $p ;
                                }elsif ( $last eq "sex" ){
                                        $sex = $p ;
                                }
                        }
                        last ;
                }
                if ( $tok ne "" ){
                        if ( $current eq "name" ){
                                $name = $tok ;
                        }elsif ( $current eq "sal" ){
                                $sal = $tok ;
                        }elsif ( $current eq "age" ){
                                $age = $tok ;
                        }elsif ( $current eq "sex" ){
                                $sex = $tok ;
                        }
                }else{
                        if ( $count > 0 ){
                                if ( $current eq "name" ){
                                        $name = "(blank)" ;
                                }elsif ( $current eq "sal" ){
                                        $sal = "(blank)" ;
                                }elsif ( $current eq "age" ){
                                        $age = "(blank)" ;
                                }elsif ( $current eq "sex" ){
                                        $sex = "(blank)" ;
                             }
                        }
                }
                $count++;
                $tok = "" ;
        }
        print "$name $sal $age $sex\n" ;
}
close(MYINPUTFILE);

Output:

raj (empty) (empty) m
(empty) 10000 (empty) (empty)
somya 15000 19 f
ritu 8000 (empty) f
muk 19000 30 m

Such Output can now be fed into something like awk, cut or tcl to process the columns as fields.

ttessier

About ttessier

Professional Developer and Operator of SwhistleSoft
This entry was posted in Perl Scripting, Server Development and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *