I'm a full start for Python or any serious programming language for that matter. I got a prototype code to work but I think it will be very slow.
My goal is to search and replace some Chinese characters in all the files (they are CSV) I have a CSV file, files are being counted year-wise, for example 2000-01 .csv, and there will be only files in that directory.
I will looping around 25 files in the neighborhood of 500 MB each and every (and about one lakh lines) that will be about 300 elements of the dictionary I use and I will use Unicode (Chinese characters) I can turn it into an integer. I tried to run a test and assuming to raise everything linearly (?), It seems that it will take about a week to run.
Thank you in advance here is my code (do not laugh!):
# - * - coding: UTF-8 - * - Import OS, Codex DIIR = "C: For files in DIS, subdeer, OS VAL (DIR): inFile = codecs: Dix = {'Hello': 'good', 'world': 'bad'} for files in / user / ROI / desktop / file: File = codecs.open (DIR + file, "W", "UTF-8") inFileStr = inFile.read () for file in file in open (dir + file, "R", "UTF-8") InFile.close () in: inFileStr = InFileStr.replace (key, Dict [key]) inFile.write (in FileStr) inFile.close ()
In your current code, Since they are 500 MB files, it means 500 MB strings. And then you do those repetitive replacements, which means that Python needs to create a new 500 MB string with the first replacement, Then destroy the first string, then become the second 500 MB string for the second replacement Enter, then delete the second string, for each replacement it does not mention to use too much memory.
If you know that replacements will always be contained in one line, then you can read the file on the line from the running line again. Reading Python will buffer, which means that it will be quite customizable, you should open the new file under a new name to write together at once, in return, complete the replacement of each line, and write it immediately. The amounts of and memory are copied back and forth as you do the replacement:
for the file in the file: fname = os file = codecs. Open (fname, "r", "utf-8") outFile = codecs.open (fname + ".new", "w", "utf-8") .path.join (dir, file) in file = codecs File to open line: Newline = do_replacements_on (line) outFile.write (newline) inFile.close () outFile.close () os.rename (fname + ".new", fname)
If you can not be sure that they will always be in a line, then things will become a little harder; You will need to manually read in blocks using inFile.read (blockize)
, and be careful whether there is a partial match at the end of the block. It is not so easy to do this, but it's still worth it to avoid 500 MB strings.
If you can make a replacement at a time, there are several ways to do this, but which fits best, depends entirely on the place you are taking and its What happens with the translate some letters into something else, the translation
method of Unicode objects can be convenient. You can use Unicode strings as the Unicode codepoint (in the form of integer) as the matrix. Keeping are:
& gt; & Gt; & Gt; U "\ xff and \ ubd23". (And not just one letter) for the replacement of the substrings ({0xff: u "255", 0xbd23: u "something else"}) and 'some more'
You can use the again
module as the first code re.sub
function (and sub
method) as a callable (a function ), Which will be called for each match:
& gt; & Gt; & Gt; Import re & gt; & Gt; & Gt; D = {u'spam ': u'spam, ham, spam and eggs', you' egg ': u'saussages'}> gt; & Gt; & Gt; P = re.compile (for "|" .d for k (re.escape (k)) ())> gt; & Gt; & Gt; Def repl (m): ... Returns d [m.group (0)] ... & gt; & Gt; & Gt; P.sub (repl, u "spam, vikings, eggs and viking") ispam, ham, spam and eggs, Vikings, sausages and Vikings'
Comments
Post a Comment