Find and replace text in a very large file fast
I needed to replace a specific text string in a 6.5Gb file for one of my projects. This is a pretty easy task if you are on Linux (using tool like sed) but it is not that easy if you are on Windows.
First, I tried my favorite PowerShell. I stumbled upon a comment made by Rob Campbell here and quickly created this script in PowerShell ISE (love it!):
$filepath = "input.csv"
$newfilepath = "input_fixed.csv"
filter num2x { $_ -replace "aaa","bbb" }
measure-command {
Get-Content -ReadCount 1000 $filepath | num2x | add-content $newfilepath
}
It took 19 minutes on my laptop to run which was not too bad. Rob mentioned that using a filter and reading a file in batches (using ReadCount) would provide very good performance. I could use .NET Streamreader library to read lines one by one but I think a lot of people agreed that ReadCount works better since it is doing reads/writes in batches.
Then I stumbled upon a free little command tool called FART - Find And Replace Text. Now leaving name of the tool aside, I was a bit skeptical about it but gave it a try. Well, it did the same thing in just 3 minutes! Quite a difference! All I had to do is to download it and use this command which would replace all the occurrences of aaa with bbb:
fart.exe -c input.csv "aaa" "bbb"
Then I processed another big file - 21Gb this time. It took FART 21 minutes - still not bad!
Certainly a great little tool to add to your toolbox!
P.S. I have also tried a trial version of EmEditor, authors of which claim that it can work with very large 200Gb+ files. EmEditor opened my file just fine but when I tried to do Find and Replace, it froze my laptop for 40 minutes and I had to kill the process.