I will be extracting certain bits from log files using regular expressions to filter out bit of data. Initially I was going to do this with Python. I later started to think about the fastest way I can perform this task. This lead me to parallel programming. I remember hearing somewhere that Python can't be truly parallel. I'm by far an expert programmer. I have been playing with Java for a bit and I was considering parallel programming in Java to perform this task. I was wondering what would be the fastest way to perform this task?
-
1Sharing your research helps everyone. Tell us what you've tried and why it didn’t meet your needs. This demonstrates that you’ve taken the time to try to help yourself, it saves us from reiterating obvious answers, and most of all it helps you get a more specific and relevant answer. Also see How to Ask– gnatCommented Jan 17, 2014 at 22:17
-
1Why are you worrying about performance first? The first thing to worry about is solving the problem. Only then should you worry about performance, and only if needed. You didn't give us near enough info to help you solve the problem. As far as Python not being parallelizable, see wiki.python.org/moin/ParallelProcessing .– David HammenCommented Jan 17, 2014 at 22:27
-
@gnat I haven't tried this yet because before I joined this site, the gist that was given was "Q&A for professional programmers interested in conceptual questions about software development". So I am merely asking for a theoretical answer. Not really some clear examples. Sorry if this question was inappropriate or too abstract.– LiondancerCommented Jan 17, 2014 at 22:36
1 Answer
Assuming you are talking about speed of execution and not speed of development, C might be hard to beat. But I don't think it would beat Java or even Python by much these days. Python is largely a thin wrapper on C implementations, albeit with a more efficient syntax and richer library.
Ultimately, I think your program will be I/O bound. Adding more threads or processes probably won't get the data off the disk or network any faster.
To minimize the CPU time of your program (as opposed to the execution time of it), you should use the fastest regex library you can find, and simplify the log format and regexes as much as possible. A multiple pass approach to parsing the log lines might help, too, whereby you use something simple and fast to break up the log lines into phrases, and then pass the phrases to separate regexes that would collectively be simpler and faster than a single regex to parse the entire line.
To parallelize, I would have the main thread vend lines to a pool of line processors, each running on a separate thread, perhaps with n queues for the n threads. You might have lines processed out of order, so beware. And if one processing thread is still faster than disk or network I/O, then you are adding complexity for no real gain.
Above all, when looking to improve speed, profile and measure before and after every change. You need to be able to prove that you are making the program faster.
-
thank you for your answer! i appreciate this! gave me a bit of insight! thanks! Commented Jan 17, 2014 at 22:48