Comparing two files using awk
I have two files
File 1 contains 3 fields
File 2 contains 4 fields
The number of rows of File 1 is much smaller than that of File 2
I would like to compare between two files based on 1st field with the following operation
If the first field in any row of file 1 appears in the first field of a row in file 2,
don't print that row for file 2.
Input File 1
S13109 3739 31082 S45002 3800 31873 S43722 3313 26638
Input File 2
S13109 3738 31081 0 S13109 3737 31080 0 S00033 3008 29985 0 S00033 3007 29984 0 S00022 4130 31838 0 S00022 4129 31837 0 S00188 3317 27372 0 S45002 3759 31832 0 S45002 3758 31831 0 S45002 3757 31830 0 S43722 3020 26345 0 S43722 3019 26344 0 S00371 3737 33636 0 S00371 3736 33635 0
Desired Output
S00033 3008 29985 0 S00033 3007 29984 0 S00022 4130 31838 0 S00022 4129 31837 0 S00188 3317 27372 0 S00371 3737 33636 0 S00371 3736 33635 0
solution
---------
awk 'FNR==NR{a[$1]++;next}!a[$1]' file1 file2
How it works:
FNR==NR
When you have two (or more) input files to awk, FNR will reset back to 1
on the first line of the next file whereas NR will continuing incrementing
from where it left off. By checking FNR==NR we are essentially checking
to see if we are currently parsing the first file.
a[$1]++
If we are parsing the first file (see above) then create an associative
array with the first field $1 as the key and post increment the value by 1.
This essentially lets us create a 'seen' list.
next
This command tells awk not to process any further commands and to read in
the next record and start over.
We do this because file1 is only meant to set the associative array
!a[$1]
This line only executes when FNR==NR is false, i.e. we are not parsing
file1 and thus must be parsing file2. We then use the first field $1 of
file2 as the key to index into our 'seen' list created earlier. If the
value returned is 0 it means we didn't see it in file1 and therefore we
should print this line. Conversely, if the value is non-zero then we
did see it in file1 and thus we should not print its value.
Note that !a[$1] is equivalent to !a[$1]{print} because the default action
when one is not given is to print the entire line.
In the same way, what could be the change in the above command to print lines which are present in both?
ReplyDeletenice explanation..
ReplyDeleteFor hai,
ReplyDeleteYou simply change !a[$1] to a[$1] so that when the value returned is 0,i.e., when the values do not match, it will not print and when it is otherwise, it will print. This will print the similar values in both and exclude the ones which do not match in the first column.
what if all of the line in one line. how to separate each field in multiline back.
ReplyDeletecould you explain it in code?
Good post
ReplyDeleteHow about, for instance, if you wanted to match column 1 in file 1 to column 2 and then 3 and then 4 in file 2, and only output any matching columns from file 2?
Nice example. My requirement is identify the line by line difference between the two files and create a third file containing the identified delta. If any line is new in file 2, then 'I' should be appended to the text of that line and if any line is updated in file2 (i.e. value of any field is changed), then line text should be appended with 'U' in the end.
ReplyDelete@Sachin..It quite complex and if you provide me an input. I can try for something
ReplyDeletehi,
ReplyDeleteDo we need to have files sorted or will it still work if the fields are not sorted?
Regards.
Ritesh
sorting is not required.
ReplyDeleteMy requirement some like this
ReplyDeleteFile1
123ABCDEFGH
456IJKLMNOPQ
789RSTUVWXY
File2
ABC123DEF
GHI456JKL
MNO789PQR
I need take the first three charcters of the file1 and serch it in the second file (file2) then prints the corrresponding line from the file2 that is matching this in separate file
Hope I have given my requirement clear enough.
Thanks for your help!
Use a perl script for this:
ReplyDeleteopen(F1,"File1");
while(){
push @a, $1 if(/^(...)/)
}
open(F2,"File2");
while(){
my $line=$_;
for(@a){
if($line=~/$_/){
print $line;
break;
}
}
}
Hey Vijay Sarathi, the fact you said about NR & FNR is wrong...!!! It is actually the opposite...
ReplyDeleteNR - This is the number of input records awk has processed since the beginning of the program's execution. NR is set each time a new record is read.
FNR - FNR is the current record number in the current file. FNR is incremented each time a new record is read. It is reinitialized to zero each time a new input file is started
This Command doesn't work.
ReplyDeleteThanks for your useful explanations.
ReplyDeleteI didn't understand the ++ in a[$1]++ so I tried your script with just a[$1] instead. It yielded the same result.
In awk, an index can be an integer as well as a string so perhaps the increment isn't needed since the index and value of each item in the array are automatically updated with every new record.
What do you think?
testing notification
DeleteI want to cmpare two files columnwise in unix using shell script
ReplyDeletefile1
datasrid BMStrid Mersionid country curr
Met_CCD V14121011081 Recent US USD
Met_CCD V14121011082 Recent US USD
Met_CCD V14121011083 Recent GB GDB
Met_CCD V14121011084 Recent IE GDB
Met_CCD V14121011085 Recent GB GDB
Met_CCD V14121011086 Recent AU AUD
Met_CCD V14121011086 Recent HK HKD
Met_CCD V14121011087 Recent IE GDB
file2
datasrid BMStrid Mersionid country curr
Met_CCD V14121011081 Recent US USD
Met_CCD V14121011082 Recent US USD
Met_CCD V14121011083 Recent GB GDB
Met_CCD V14121011088 Recent IE GDB
Met_CCD V14121011085 Recent HK GDB
Met_CCD V14121011086 Recent AU AUD
Met_CCD V14121011086 Recent HK HKD
Met_CCD V14121011087 Recent IE GDB
Outputfile
need to compare file2 wrt file1.
change in any cell should get highlighted in output file.
like
o/p file should contain
Met_CCD V14121011088 Recent IE GDB
Met_CCD V14121011085 Recent HK GDB
original files may contain hundreds of columns..i need a general solution
Ho do you exclude items that are not present in file 1.
ReplyDeleteHi..I have two files file1 and file2 with content as follows
ReplyDeleteFile1:
Hai
Welcome
Server1 10.11.12.13
Server2 10.11.12.14
File2:
Goodbye
Server1 18.19.23.66
Server2 10.11.12.14
Now I want to extract lines from file1 and file2 which has Server* and if the server names are same then check if the ip's are same or not.if they are same no action is required.
If they are not same then update file2 IP with the corresponding IP from file1..can someone answer pls
I need to compare two excel files, using unix script.. could anyone sort out?
ReplyDeleteThanks, vijay sarathi for so usefull post, it helped me a lot. I was trying to implement this code using a file .awk
ReplyDeleteI have the files input1.txt and input2.txt with the content:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
and input2.txt
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
I've tried this in the file:
BEGIN {
FS = "\t";
LINEA = ""
#sw = 1
}
FNR == NR # verdadero sólo en el primer archivo
{
a[$1]++; # construye un array asociativo de la primera columna del archivo
next # salta todos los bloques de procedimiento y procesa la siguiente línea
}
($1 in a) # comprueba si el valor en la columna uno del segundo archivo está en el array
{
print $0 # si es que sí, imprime la línea
} # estas llaves y su contenido se pueden omitir (es la acción por defecto)
I call it by: awk -f compara.awk input1.txt input2.txt > output.txt
and I got this output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
but when I execute this: awk 'FNR==NR{a[$1]++;next}($1 in a){print $0}' input1.txt input2.txt > output.txt
in shell I get this:
2
4
6
8
10
12
14
16
18
20
I would be so thankful if somebody could help me or give me a tip.
Cheers.
Hi, I have two files
ReplyDeletecat a1.csv
one
two
three
four
one
two
five
six
seven
and cat a2.csv
eight
nine
ten
one
two
ten
#awk 'FNR==NR{a[$1]++;next}!a[$1]' a1.csv a2.csv
eight
nine
ten
ten
#awk 'FNR==NR{a[$1]++;next}a[$1]' a1.csv a2.csv
one
two
As per your commands I am getting above output.
But I want when output like
1)(Without Sort) in one command repeated entries from A1.csv itself and A2.csv itself and then repeated entries from Both A1.csv and A2.csv
2) I want to print repeated entries to repeated.csv
3) and non repeated entries to non_repeated.csv
Please help to resolve.
Thanks for providing this informative information you may also refer.
ReplyDeletehttp://www.s4techno.com/blog/2016/07/11/aix-paging-space-commands/
we have to print the old entries only Delta(Difference) in File-1 and File-2. newly added enties in files-1 no need to print.
ReplyDeleteI have NR=FNR logic to do so, but its not working for below case
File-1
7! J9AA-50! LHR! 35!
8! J9BB-50! LHW! 22!
7! test3! test3! 8
7! test3! test3! 8
7! JWZZ-50! LHN! 15!
7! J9AA-50! LHR! 34!
8! J9BB-50! LHW! 21!
File-2
7! J9AA-50! LHR! 34!
8! J9BB-50! LHW! 22!
7! JWZZ-50! LHN! 14!
7! J9AA-50! LHR! 34!
8! J9BB-50! LHW! 21!
command to compare and print,
/usr/xpg4/bin/awk -F'!' 'NR==FNR{++a[$2,$4,FNR];next} !a[$2,$4]++ ' File-1 File-2
Current Result:
7! J9AA-50! LHR! 34!
8! J9BB-50! LHW! 22!
7! JWZZ-50! LHN! 14!
8! J9BB-50! LHW! 21!
Expected result:
7! J9AA-50! LHR! 34!
7! JWZZ-50! LHN! 14!
can anyone please suggest the changes/new logic to get Expected output. if its not possible by using NR=FNR then how can we get it by using shell/perl script?.
Hi,
ReplyDeleteI would like to compare two files and print down the value which is not common in both the files. How can I achieve that?
I'm trying my best to learn linux shell and hack things out myself. But this part has me stuck.
ReplyDeleteI have two CSV files with hundreds of rows each.
CSV 1 looks like this:
name, email, interest
CSV 2 looks like this:
email, name only
I'm trying to write a script to compare the files with email looking for duplicates and remove it. But as you can see, CSV 2 only contains an email, name. If an email in CSV 1 DOES NOT EXIST in CSV 2, then the row containing that email in CSV 1 should be deleted.
The end result can either give output in another location with 2 new file
I would be grateful for the help.
I tried something along these lines with no luck
This comment has been removed by the author.
ReplyDeleteHi,
ReplyDeleteI would like to compare two csv files and get output as two file one as same in csv1 and second file csv2 which is not common in both the files. How can I achieve that using shell script?
rks
ReplyDeleteHi,
I would like to compare two csv files and get output as two file one as same in csv1 and second file csv2 which is not common in both the files. How can I achieve that using shell script?