-7

I am currently using Python to write a compiler manager. I have to detect whether given source code is C, even if the code has a few syntax errors. I am currently using the file's extension to tell, but I want a smarter option.

Would using regex be the best (and simplest) solution to this problem?

I am also looking to minimize dependencies. I don't want to use a 3rd party package for this.

Any suggestions?

18
  • 2
    Write a simple C parser. If it can tokenize a majority of your source text without errors, it's probably C. Commented Nov 18, 2016 at 16:13
  • C opposed to what other languages? C++? Objective C? Good luck.
    – Doc Brown
    Commented Nov 18, 2016 at 16:15
  • @Doc Brown Just if it is C or not. Commented Nov 18, 2016 at 16:17
  • 1
    You can write C++ code that's 100% C. So would you just send it to a C-Compiler instead? You should explain your issue.
    – user188153
    Commented Nov 18, 2016 at 16:26
  • 4
    Using the extension will give you far more accurate results than trying to guess at it, is a well-accepted convention and takes almost to time to handle. An empty file is a valid C translation unit when compilers not set to be pedantic; how would you differentiate that from an empty text file?
    – Blrfl
    Commented Nov 18, 2016 at 16:38

1 Answer 1

2

A regex would be the simplest, but not necessarily the best. A regex might be fooled by C-code residing in a comment of another language. It also might miss perfectly valid C-code that doesn't follow the whitespace convention you expect.

A C parser, as suggested, is probably the best you can do. There are tools such as ANTLR that can make this job easier for you (ANTLR can generate Python lexer/parsers for C and other languages). Once you've run the C-code through your parser, you can compare the number of syntax errors to the number of valid tokens and size of the C file. Based on these statistics, you can make a guess.

As others have noted--you still won't be able to tell the difference between a valid C++ file and a C file with some syntax errors (unless you also run your C-code through a C++ parser, and an Objective-C parser, and...).

If you're already making an auto-compiler, your best bet might be to compile the code in all your supported languages and pick the one that actually compiles. If the input file may actually have syntax errors, pick the language that compiles with the fewest syntax errors. You'll still run into the problem where a given input file compiles just fine in multiple languages (as mentioned, C and C++ is one example). You'll end up with multiple compilers that work just fine for the given input file--the only way you'll be able to tell which one to use is by... ...looking at the file extension.

1
  • I settled on just using the extension, but this answer would have been the best way to automatically determine what I wanted. Commented Nov 30, 2016 at 18:57

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.