codetoad.com
  ASP Shopping CartForum & BBS
  - all for $20 from CodeToad Plus!
  
  Home || ASP | ASP.Net | C++/C# | DHTML | HTML | Java | Javascript | Perl | VB | XML || CodeToad Plus! || Forums || RAM 
Search Site:
Search Forums:
  merge two files by ID  pangpang at 04:20 on Friday, February 27, 2009
 

File #1
x MA
y FL
z UT

File #2
x
y
u
w

Output
x MA
y FL


File1 & File2 are very huge datasets. Please advise! Thank you

  Re: merge two files by ID  hermanningjaldsson at 07:04 on Friday, February 27, 2009
 

u take each file and put it into as hash.

then u have two hashes, u go through the keys of the former hash,
find that very key in the latter hash, and add it to the former hash.
when that is done the former hash represents both hashes,
u then put that hash into a new file.



  Re: merge two files by ID  S_Flex at 11:05 on Friday, February 27, 2009
 

you could put each file in a hash but that way would be a lot slower and most people that would write that script would use an array.

basically you want to match data from 2 files and make a new file/list with the matched data.

#!perl

use strict;
use warnings;
use Fcntl qw(:DEFAULT :flock);

# Full path to each file and must have file name
my $file = ''; #1
my $file2 = ''; #2
my $output = ''; #Output

my @content = ();
sysopen(FH, $file, O_RDONLY);
flock(FH, LOCK_EX);
@content = <FH>;
close(FH);

my @content2 = ();
sysopen(FH, $file2, O_RDONLY);
flock(FH, LOCK_EX);
@content2 = <FH>;
close(FH);

my @Output = ();
foreach (@content) {
my ($letter, $state) = split(/\s/, $_);
foreach (@content2) {
push(@Output, join(" ", $letter, $state)) if $letter eq $_;
}
}

print "@Output";


Please make Back-ups of the files you are using.

  Re: merge two files by ID  hermanningjaldsson at 11:54 on Friday, February 27, 2009
 

my thought on the hashes is that they have a seek time of one.

and since we're looking for every instance of former array, through every instance of latter array,
the big-o time should be to the second power.

whereas using hashes it should be to the first power because for every instance in former array, we seek that exact element with a seek time of 1.
thats because we already know the key of the hash, we dont have to wade through anything.

meaning hashes should be faster.

i worry about the memory usage though, in both solutions.


  Re: merge two files by ID  S_Flex at 12:37 on Friday, February 27, 2009
 

my thought on the hashes is that they have a seek time of one.

and since we're looking for every instance of former array, through every instance of latter array,
the big-o time should be to the second power.

whereas using hashes it should be to the first power because for every instance in former array, we seek that exact element with a seek time of 1.
thats because we already know the key of the hash, we dont have to wade through anything.

meaning hashes should be faster.

i worry about the memory usage though, in both solutions.



Im sorry but what you said makes no sense to me. maybe if you produce a code to help me understand what you mean.

as for memory usage, you should have nothing to worry about unless both files are 1gb+ or the system doing the task is low on resources.

  Re: merge two files by ID  hermanningjaldsson at 13:08 on Friday, February 27, 2009
 

the array solution:
by using arrays we have to go through the first array, and for each element in that array,
walk through the second array asking if thats the element matching the one in the former array we're looking for.

double the size of the arrays and we're gonna be having:
-twice as many elements to check,
-twice as many elements to wade through for each.
so when the lists double in size, the strain on the cpu quadruples.
10 fold the lists and we'll have a 10*10 larger strain on the cpu.

it's what can be called a 'big-o second power' solution.



the hash solution:
by using hashes we go through the first hash, and for each value in it, we pick the corresponding key in the latter up and do our stuff. hashes have an access time of one so we dont have to wade through anything for we already know exactly where it is to be found.

so if the files double in size, the strain on the cpu will double.
if the files quadruple in size, the strain on the cpu will quadruple.

it's what can be called a 'big-o first power' solution.



big-o is basically a measure of scalability.




  Re: merge two files by ID  S_Flex at 13:42 on Friday, February 27, 2009
 

I believe you are over thinking the task.
this post is on merging 2 files not optimizing the merger of 2 files.

What you say could be true, but you have not given a workable example to reinforce your claims.



  Re: merge two files by ID  hermanningjaldsson at 13:49 on Friday, February 27, 2009
 

true, i only felt like giving a pseudo code solution.



  Re: merge two files by ID  pangpang at 17:08 on Friday, February 27, 2009
 

Thank you very much.
Hash is a great idea.
I am a new PERL user ( 2 days old : ) ), a little big hard to write the real stuff. I will study more about Hash syntax.
Thank you again!!!

  Re: merge two files by ID  pangpang at 17:10 on Friday, February 27, 2009
 

I do not know what to say.
You saved my life :) I've been upset since last night.
Thank you so much!!








CodeToad Experts

Can't find the answer?
Our Site experts are answering questions for free in the CodeToad forums
//








Recent Forum Threads
•  Re: merge two files by ID
•  Re: number of regex match
•  Re: hex code of a ASCII char
•  Re: hex code of a number
•  Re: parser project needs perl programmer
•  Re: perl script to program
•  Re: help me on say_aplha plzzzzz
•  Re: how to install win32::GuidGen on windows 64 bit OS
•  how can I.....web crawling for vertical job search engine


Recent Articles
ASP GetTempName
Decode and Encode UTF-8
ASP GetFile
ASP FolderExists
ASP FileExists
ASP OpenTextFile
ASP FilesystemObject
ASP CreateFolder
ASP CreateTextFile
Javascript Get Selected Text


© Copyright codetoad.com 2001-2009