Comparing two files using awk

Friday, December 28, 2012 , , 23 Comments


I have two files
File 1 contains 3 fields
File 2 contains 4 fields

The number of rows of File 1 is much smaller than that of File 2

I would like to compare between two files based on 1st field with the following operation
If the first field in any row of file 1 appears in the first field of a row in file 2,
don't print that row for file 2.

Input File 1
S13109 3739 31082
S45002 3800 31873
S43722 3313 26638

Input File 2
S13109 3738 31081 0
S13109 3737 31080 0
S00033 3008 29985 0
S00033 3007 29984 0
S00022 4130 31838 0
S00022 4129 31837 0
S00188 3317 27372 0
S45002 3759 31832 0
S45002 3758 31831 0
S45002 3757 31830 0
S43722 3020 26345 0
S43722 3019 26344 0
S00371 3737 33636 0
S00371 3736 33635 0

Desired Output
S00033 3008 29985 0
S00033 3007 29984 0
S00022 4130 31838 0
S00022 4129 31837 0
S00188 3317 27372 0
S00371 3737 33636 0
S00371 3736 33635 0

solution
---------
awk 'FNR==NR{a[$1]++;next}!a[$1]' file1 file2

How it works:

FNR==NR

When you have two (or more) input files to awk, FNR will reset back to 1
on the first line of the next file whereas NR will continuing incrementing
from where it left off. By checking FNR==NR we are essentially checking
to see if we are currently parsing the first file.

a[$1]++

If we are parsing the first file (see above) then create an associative
array with the first field $1 as the key and post increment the value by 1.
This essentially lets us create a 'seen' list.


next

This command tells awk not to process any further commands and to read in
the next record and start over.
We do this because file1 is only meant to set the associative array

!a[$1]

This line only executes when FNR==NR is false, i.e. we are not parsing
file1 and thus must be parsing file2. We then use the first field $1 of
file2 as the key to index into our 'seen' list created earlier. If the
value returned is 0 it means we didn't see it in file1 and therefore we
should print this line. Conversely, if the value is non-zero then we
did see it in file1 and thus we should not print its value.
Note that !a[$1] is equivalent to !a[$1]{print} because the default action
when one is not given is to print the entire line.

23 comments:

  1. In the same way, what could be the change in the above command to print lines which are present in both?

    ReplyDelete
  2. For hai,
    You simply change !a[$1] to a[$1] so that when the value returned is 0,i.e., when the values do not match, it will not print and when it is otherwise, it will print. This will print the similar values in both and exclude the ones which do not match in the first column.

    ReplyDelete
  3. what if all of the line in one line. how to separate each field in multiline back.
    could you explain it in code?

    ReplyDelete
  4. Good post
    How about, for instance, if you wanted to match column 1 in file 1 to column 2 and then 3 and then 4 in file 2, and only output any matching columns from file 2?

    ReplyDelete
  5. Nice example. My requirement is identify the line by line difference between the two files and create a third file containing the identified delta. If any line is new in file 2, then 'I' should be appended to the text of that line and if any line is updated in file2 (i.e. value of any field is changed), then line text should be appended with 'U' in the end.

    ReplyDelete
  6. @Sachin..It quite complex and if you provide me an input. I can try for something

    ReplyDelete
  7. hi,

    Do we need to have files sorted or will it still work if the fields are not sorted?

    Regards.
    Ritesh

    ReplyDelete
  8. My requirement some like this
    File1
    123ABCDEFGH
    456IJKLMNOPQ
    789RSTUVWXY

    File2
    ABC123DEF
    GHI456JKL
    MNO789PQR

    I need take the first three charcters of the file1 and serch it in the second file (file2) then prints the corrresponding line from the file2 that is matching this in separate file

    Hope I have given my requirement clear enough.

    Thanks for your help!

    ReplyDelete
  9. Use a perl script for this:

    open(F1,"File1");
    while(){
    push @a, $1 if(/^(...)/)
    }
    open(F2,"File2");
    while(){
    my $line=$_;
    for(@a){
    if($line=~/$_/){
    print $line;
    break;
    }
    }
    }

    ReplyDelete
  10. Hey Vijay Sarathi, the fact you said about NR & FNR is wrong...!!! It is actually the opposite...

    NR - This is the number of input records awk has processed since the beginning of the program's execution. NR is set each time a new record is read.

    FNR - FNR is the current record number in the current file. FNR is incremented each time a new record is read. It is reinitialized to zero each time a new input file is started

    ReplyDelete
  11. This Command doesn't work.

    ReplyDelete
  12. Thanks for your useful explanations.

    I didn't understand the ++ in a[$1]++ so I tried your script with just a[$1] instead. It yielded the same result.

    In awk, an index can be an integer as well as a string so perhaps the increment isn't needed since the index and value of each item in the array are automatically updated with every new record.

    What do you think?

    ReplyDelete
  13. I want to cmpare two files columnwise in unix using shell script

    file1
    datasrid BMStrid Mersionid country curr
    Met_CCD V14121011081 Recent US USD
    Met_CCD V14121011082 Recent US USD
    Met_CCD V14121011083 Recent GB GDB
    Met_CCD V14121011084 Recent IE GDB
    Met_CCD V14121011085 Recent GB GDB
    Met_CCD V14121011086 Recent AU AUD
    Met_CCD V14121011086 Recent HK HKD
    Met_CCD V14121011087 Recent IE GDB

    file2
    datasrid BMStrid Mersionid country curr
    Met_CCD V14121011081 Recent US USD
    Met_CCD V14121011082 Recent US USD
    Met_CCD V14121011083 Recent GB GDB
    Met_CCD V14121011088 Recent IE GDB
    Met_CCD V14121011085 Recent HK GDB
    Met_CCD V14121011086 Recent AU AUD
    Met_CCD V14121011086 Recent HK HKD
    Met_CCD V14121011087 Recent IE GDB

    Outputfile
    need to compare file2 wrt file1.
    change in any cell should get highlighted in output file.
    like
    o/p file should contain

    Met_CCD V14121011088 Recent IE GDB
    Met_CCD V14121011085 Recent HK GDB

    original files may contain hundreds of columns..i need a general solution

    ReplyDelete
  14. Ho do you exclude items that are not present in file 1.

    ReplyDelete
  15. Hi..I have two files file1 and file2 with content as follows
    File1:
    Hai
    Welcome
    Server1 10.11.12.13
    Server2 10.11.12.14

    File2:
    Goodbye
    Server1 18.19.23.66
    Server2 10.11.12.14

    Now I want to extract lines from file1 and file2 which has Server* and if the server names are same then check if the ip's are same or not.if they are same no action is required.
    If they are not same then update file2 IP with the corresponding IP from file1..can someone answer pls

    ReplyDelete
  16. I need to compare two excel files, using unix script.. could anyone sort out?

    ReplyDelete
  17. Thanks, vijay sarathi for so usefull post, it helped me a lot. I was trying to implement this code using a file .awk
    I have the files input1.txt and input2.txt with the content:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20

    and input2.txt

    2
    4
    6
    8
    10
    12
    14
    16
    18
    20
    22
    24
    26
    28
    30

    I've tried this in the file:

    BEGIN {
    FS = "\t";
    LINEA = ""

    #sw = 1
    }

    FNR == NR # verdadero sólo en el primer archivo
    {
    a[$1]++; # construye un array asociativo de la primera columna del archivo
    next # salta todos los bloques de procedimiento y procesa la siguiente línea
    }
    ($1 in a) # comprueba si el valor en la columna uno del segundo archivo está en el array
    {
    print $0 # si es que sí, imprime la línea
    } # estas llaves y su contenido se pueden omitir (es la acción por defecto)

    I call it by: awk -f compara.awk input1.txt input2.txt > output.txt
    and I got this output:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20

    but when I execute this: awk 'FNR==NR{a[$1]++;next}($1 in a){print $0}' input1.txt input2.txt > output.txt
    in shell I get this:

    2
    4
    6
    8
    10
    12
    14
    16
    18
    20

    I would be so thankful if somebody could help me or give me a tip.
    Cheers.

    ReplyDelete
  18. Hi, I have two files
    cat a1.csv
    one
    two
    three
    four
    one
    two
    five
    six
    seven

    and cat a2.csv
    eight
    nine
    ten
    one
    two
    ten

    #awk 'FNR==NR{a[$1]++;next}!a[$1]' a1.csv a2.csv
    eight
    nine
    ten
    ten

    #awk 'FNR==NR{a[$1]++;next}a[$1]' a1.csv a2.csv
    one
    two


    As per your commands I am getting above output.
    But I want when output like
    1)(Without Sort) in one command repeated entries from A1.csv itself and A2.csv itself and then repeated entries from Both A1.csv and A2.csv
    2) I want to print repeated entries to repeated.csv
    3) and non repeated entries to non_repeated.csv
    Please help to resolve.

    ReplyDelete
  19. Thanks for providing this informative information you may also refer.
    http://www.s4techno.com/blog/2016/07/11/aix-paging-space-commands/

    ReplyDelete
  20. we have to print the old entries only Delta(Difference) in File-1 and File-2. newly added enties in files-1 no need to print.

    I have NR=FNR logic to do so, but its not working for below case

    File-1

    7! J9AA-50! LHR! 35!
    8! J9BB-50! LHW! 22!
    7! test3! test3! 8
    7! test3! test3! 8
    7! JWZZ-50! LHN! 15!
    7! J9AA-50! LHR! 34!
    8! J9BB-50! LHW! 21!

    File-2

    7! J9AA-50! LHR! 34!
    8! J9BB-50! LHW! 22!
    7! JWZZ-50! LHN! 14!
    7! J9AA-50! LHR! 34!
    8! J9BB-50! LHW! 21!

    command to compare and print,

    /usr/xpg4/bin/awk -F'!' 'NR==FNR{++a[$2,$4,FNR];next} !a[$2,$4]++ ' File-1 File-2

    Current Result:

    7! J9AA-50! LHR! 34!
    8! J9BB-50! LHW! 22!
    7! JWZZ-50! LHN! 14!
    8! J9BB-50! LHW! 21!

    Expected result:

    7! J9AA-50! LHR! 34!
    7! JWZZ-50! LHN! 14!

    can anyone please suggest the changes/new logic to get Expected output. if its not possible by using NR=FNR then how can we get it by using shell/perl script?.

    ReplyDelete