Compiler Fingerprinting in EVM Bytecode

preview

Despite what you might think, compilers are not black boxes. They are complex, deterministic systems that produce machine code from high-level programming languages through a series of well-defined steps. This means that the output of a compiler is not just a random sequence of bytes, but a structured and predictable representation of the original source code. In fact, the output of a compiler is just as much a reflection of the compiler itself as it is of the source code it was given.

In this experimental paper, we will dive into EVM bytecode and examine distinct patterns and markers left by different major EVM compilers. We will also explore the potential for using these patterns to identify the compiler used to generate a given contract's bytecode.

Brief: EVM Bytecode and Compilers

Compilers generate EVM bytecode by translating high-level code, such as Solidity, into a series of instructions (opcodes) that represent the program's logic. Compilers are extremely complex systems that can be broken down into several stages:

Lexical Analysis: The compiler reads the source code and converts it into a stream of tokens. This may also be referred to as tokenization.
Syntax Analysis: The compiler parses the tokens and builds an abstract syntax tree (AST) that represents the structure of the program.
Semantic Analysis: The compiler checks the AST for semantic errors and performs type checking.
IR Generation: The compiler translates the AST into an intermediate representation (IR) that is closer to the target machine code. In the case of EVM, this intermediate representation is typically in the form of EVM assembly or an IR such as Yul.
Optimization: The compiler optimizes the IR to improve the efficiency of the generated code.
Code Generation: The compiler translates the optimized IR into machine code. In the case of EVM, this machine code is EVM bytecode.

Through this process, the compiler leaves distinct patterns and markers in the generated bytecode that can be used to identify the compiler which generated it.

Existing Known Heuristics

One of the most well-known heuristics for identifying the compiler used to generate a given contract's bytecode is by examining the first few operations in the bytecode, as different compilers take different approaches to program execution. For example:

Solidity

The Solidity compiler, solc, typically uses the following sequences of opcodes as the first few instructions in the bytecode:

0x60 0x80 0x60 0x40 0x52 (indicates solc 0.4.22+)
0x60 0x60 0x60 0x40 0x52 (indicates solc 0.4.11-0.4.21)

The Solidity compiler begins execution by initializing memory that the program will use. For those interested, the exact Solidity memory layout can be found here.

Vyper

The Vyper compiler typically uses the following sequences of opcodes as the first few instructions in the bytecode:

0x60 0x04 0x36 0x10 0x15 (indicates vyper 0.2.0-0.2.4,0.2.11-0.3.3)
0x34 0x15 0x61 0x00 0x0a (indicates vyper 0.2.5-0.2.8)

The Vyper compiler begins execution immediately in it's dispatcher, which is why the first few opcodes are different from Solidity.

CBOR Encoding

When a contract is compiled, the compiler may include metadata in the bytecode that can be used to identify the exact compiler version used to generate the bytecode. This metadata is encoded in a partial CBOR format:

Encode vyper or solc as hex string: 0x7970657283 and 0x736f6c6343 respectively.
Append the version as a 3-byte hex string: 0x000817 for version 0.8.23. For example, the metadata 0x736f6c6343000817 would be equivalent to solc 0.8.23.

However, this metadata is not always present in the bytecode as users can opt to exclude it from the deployed bytecode.

Methodology

If we can already roughly identify the compiler used to generate a contract's bytecode by examining the first few operations in the bytecode, how much more accurate can we be if we examine the entire bytecode? The process we will take to answer this question is as follows:

Data Collection: We collect a random sample of $5,000$ verified contract bytecode for both Solidity and Vyper from Etherscan.
Data Classification: Using the known heuristics and patterns, we classify the contracts into three groups: Solidity, Vyper, and Unknown.
Pattern Analysis: We analyze the bytecode of the contracts in each group to identify distinct patterns and markers that can be used to fingerprint the compiler.
Results: Using the patterns and markers identified, we reclassify the contracts and evaluate the accuracy of our classification algorithm against known compiler versions.

Note: I've opted not to take AI/ML approach to this problem as I would rather be able to reason about the patterns and markers left by the compilers rather than rely on a black-box model which simply outputs a prediction.

1. Data Collection

We collected a random sample of $5,000$ verified contracts for both Solidity and Vyper from Etherscan, saving their exact compiler version in a CSV.

Those interested can view the full raw data here, but here's a slice of the data:

snippet.txt
address,compiler_version
0xef672bd94913cb6f1d2812a6e18c1ffded8eff5c,vyper:0.3.1
0x10ac65a9f710c3d607d213784e5b8632c77d5d4f,vyper:0.3.1
0x0199429171bce183048dccf1d5546ca519ea9717,vyper:0.3.1
0x1c3a367f8b2e921d2476870576fcf91670017897,vyper:0.3.9
...
0xa21a59cc2375368fceb08898403fa7331b6531ad,v0.5.10+commit.5a6ea5b1
0xeb08b206271350fcc9ae1cad1e27f348a2055600,v0.5.14+commit.1f1aaa4
0x118cd20b58b069a2df45531cae31d1121fa4c310,v0.4.17+commit.bdeb9e52
0xa6ead154167d2e712936b8ebc22b66903c46047c,v0.5.17+commit.d19bba13

We can then fetch the bytecode for each contract using JSON-RPC, and we will prune pushed bytes from the bytecode since they make pattern discovery more complicated. For example, 0x60 0x80 0x60 0x40 (PUSH1 0x80 PUSH1 0x40) would become 0x60 0x60 (PUSH1 PUSH1).

2. Data Classification

Now that we have a list of contracts and their bytecode, we will use the known heuristics to classify the contracts, and then compare the results to the actual compiler version to determine the accuracy of our initially known heuristics.

View `detect_compiler.rs`

snippet.rs
/// Detect the compiler used to generate the given bytecode.
pub fn detect_compiler(bytecode: &[u8]) -> (Compiler, String) {
    let mut compiler = Compiler::Unknown;
    let mut version = "unknown".to_string();

    // check the prefix of the bytecode against known compiler patterns
    if bytecode.starts_with(&[
        0x36, 0x60, 0x00, 0x60, 0x00, 0x37, 0x61, 0x10, 0x00, 0x60, 0x00, 0x36, 0x60, 0x00, 0x73,
    ]) {
        compiler = Compiler::Vyper;
        version = "proxy".to_string();
    } else if bytecode.starts_with(&[0x60, 0x04, 0x36, 0x10, 0x15]) {
        compiler = Compiler::Vyper;
        version = "0.2.0-0.2.4,0.2.11-0.3.3".to_string();
    } else if bytecode.starts_with(&[0x34, 0x15, 0x61, 0x00, 0x0a]) {
        compiler = Compiler::Vyper;
        version = "0.2.5-0.2.8".to_string();
    } else if bytecode.starts_with(&[0x73, 0x1b, 0xf7, 0x97]) {
        compiler = Compiler::Solc;
        version = "0.4.10-0.4.24".to_string();
    } else if bytecode.starts_with(&[0x60, 0x80, 0x60, 0x40, 0x52]) {
        compiler = Compiler::Solc;
        version = "0.4.22+".to_string();
    } else if bytecode.starts_with(&[0x60, 0x60, 0x60, 0x40, 0x52]) {
        compiler = Compiler::Solc;
        version = "0.4.11-0.4.21".to_string();
    } else if bytecode.contains_slice(&[0x76, 0x79, 0x70, 0x65, 0x72]) {
        compiler = Compiler::Vyper;
    } else if bytecode.contains_slice(&[0x73, 0x6f, 0x6c, 0x63]) {
        compiler = Compiler::Solc;
    }

    // check for cbor encoded compiler metadata
    // https://cbor.io
    if bytecode.contains_slice(&[0x73, 0x6f, 0x6c, 0x63, 0x43]) {
        let compiler_version = bytecode.split_by_slice(&[0x73, 0x6f, 0x6c, 0x63, 0x43]);

        if compiler_version.len() > 1 {
            if let Some(encoded_version) = compiler_version.get(1).and_then(|last| last.get(0..3)) {
                version = encoded_version
                    .iter()
                    .map(|v| v.to_string())
                    .collect::<Vec<String>>()
                    .join(".");
                compiler = Compiler::Solc;
            }

            trace!(
                "exact compiler version match found due to cbor encoded metadata: {}",
                version
            );
        }
    } else if bytecode.contains_slice(&[0x76, 0x79, 0x70, 0x65, 0x72, 0x83]) {
        let compiler_version = bytecode.split_by_slice(&[0x76, 0x79, 0x70, 0x65, 0x72, 0x83]);

        if compiler_version.len() > 1 {
            if let Some(encoded_version) = compiler_version.get(1).and_then(|last| last.get(0..3)) {
                version = encoded_version
                    .iter()
                    .map(|v| v.to_string())
                    .collect::<Vec<String>>()
                    .join(".");
                compiler = Compiler::Vyper;
            }

            trace!("exact compiler version match found due to cbor encoded metadata");
        }
    }

    debug!("detected compiler {compiler} {version}.");

    (compiler, version.trim_end_matches('.').to_string())
}

After running our classification function on the $10,000$ contracts, we can generate a mapping of the contracts to their detected compiler and version:

Note: we also save unpruned bytecode in an additional mapping of similar structure.

snippet.json
{
    "Proxy": {
        "0x3fc90d031eecc364c620166ee7a791a151a16062": "0x3660603761603660735a...",
        ...
    },
    "Unknown": {
        "0xdf1b41413eafccfc6e98bb905feaeb271d307af3": "0x5f35601c60608216601b...",
        ...
    },
    "Solc": {
        "0x29109547921fb1978bbbe192f37e546de454dcdb": "0x60605236156157637c6...",
        ...
    },
    "Vyper": {
        "0x8d0f9c9fa4c1b265cd5032fe6ba4fefc9d94badb": "0x603611615761565b603...",
        ...
    }
}

This initial classification function correctly detects the compiler of $6,254$ contracts out of $6,599$ non-proxy contracts; an accuracy of $94.8\%$ . This is already a great start, but I believe we can do better.

Interestingly, if we remove CBOR encoded metadata detection from our classification function entirely, we still get the same result of $94.8\%$ , as CBOR encoded metadata is not necessary for our classification function to work and can only help determine the exact compiler version used.

Out of curiosity, I also ran the classification function using only CBOR encoded metadata detection, which resulted in a classification accuracy of $13.8\%$ , showing that the metadata is not present in the majority of contracts. Interestingly, the metadata was present in over three times as many solidity contracts ( $671$ ) as vyper contracts ( $238$ ).

3. Pattern Analysis

In order to improve our classification algorithm's accuracy, we will analyze the pruned bytecode of each contract for both Solidity and Vyper with the hope of identifying distinct patterns that can be used to fingerprint the compiler. We will focus on sequences of five operations, as these are long enough to be unique but short enough to be common. Here's the general process we will follow:

Given a list of contracts generated by known compilers, for each contract:
1. Extract all unique sequences of five operations from the bytecode.
2. Count the frequency of each sequence in all contracts generated by this compiler. So, if a sequence occurs in $1000$ out of $5000$ contracts, its frequency would be $20\%$ .
We want the most compiler-specific sequences, so we will calculate the percentage of contracts generated by each compiler that contain each sequence, and sort the sequences by this percentage, filtering out sequences that are not strong enough heuristics to be used confidently.
We will then compare the sequences for Solidity and Vyper to see if there are any distinct patterns that can be used to fingerprint the compiler. For example, sequences which occur frequently in Solidity contracts but rarely in Vyper contracts could be used as a fingerprint for Solidity.

Note: we don't look for longer sequences, as if a sequence of six operations exists, a subset sequence of five operations will also exist and is more likely to be found in other contracts. For example, if 0x60 0x80 0x60 0x40 0x52 0x60 exists in the bytecode, then 0x60 0x80 0x60 0x40 0x52 also must exist within the bytecode, and is more likely to be found in other contracts due to its shorter length.

4. Results

After performing pattern analysis, we are left with the following sequences, along with their frequency in all contracts and the percentage of contracts generated by each compiler that contain the sequence:

Sequence	Assembly	Frequency	Vyper	Solc
`0x5460526060`	`SLOAD PUSH1 MSTORE PUSH1 PUSH1`	9161	31.03%	0.00%
`0x6054605260`	`PUSH1 SLOAD PUSH1 MSTORE PUSH1`	6801	30.54%	0.00%
`0x6152615161`	`PUSH2 MSTORE PUSH2 MLOAD PUSH2`	30249	28.94%	0.00%
`0x6151615260`	`PUSH2 MLOAD PUSH2 MSTORE PUSH1`	6718	28.16%	0.00%
`0x6152606152`	`PUSH2 MSTORE PUSH1 PUSH2 MSTORE`	10146	27.34%	0.00%
`0x9050905081`	`SWAP1 POP SWAP1 POP DUP2`	8968	27.27%	0.00%
`0x61527f6152`	`PUSH2 MSTORE PUSH32 PUSH2 MSTORE`	5651	26.56%	0.00%
`0x8063146157`	`DUP1 PUSH4 EQ PUSH2 JUMPI`	27780	0.00%	94.47%
`0x1461578063`	`EQ PUSH2 JUMPI DUP1 PUSH4`	23576	0.00%	93.71%
`0x6157806314`	`PUSH2 JUMPI DUP1 PUSH4 EQ`	25464	0.00%	93.71%
`0x5780631461`	`JUMPI DUP1 PUSH4 EQ PUSH2`	25464	0.00%	93.71%

Findings

Given our set of sequences, we can now modify our classification function which uses these sequences to detect the compiler used to generate a contract's bytecode. We will do this with a simple confidence heuristic: if a contract contains a sequence that is more common in Solidity contracts, we will classify it as a Solidity contract, and vice versa for Vyper contracts. Luckily, our sequences are pretty much compiler-specific and exclusive, so we can be confident in our classification.

View `detect_compiler_new.rs`

snippet.rs
/// Detect the compiler used to generate the given bytecode.
pub fn detect_compiler_new(bytecode: &[u8]) -> (Compiler, String) {
    let mut compiler = Compiler::Unknown;
    let mut version = "unknown".to_string();

    // Previously known heuristic: perform prefix check for rough version matching
    if bytecode.starts_with(&[
        0x36, 0x60, 0x00, 0x60, 0x00, 0x37, 0x61, 0x10, 0x00, 0x60, 0x00, 0x36, 0x60, 0x00, 0x73,
    ]) {
        compiler = Compiler::Vyper;
        version = "proxy".to_string();
    } else if bytecode.starts_with(&[0x60, 0x04, 0x36, 0x10, 0x15]) {
        compiler = Compiler::Vyper;
        version = "0.2.0-0.2.4,0.2.11-0.3.3".to_string();
    } else if bytecode.starts_with(&[0x34, 0x15, 0x61, 0x00, 0x0a]) {
        compiler = Compiler::Vyper;
        version = "0.2.5-0.2.8".to_string();
    } else if bytecode.starts_with(&[0x73, 0x1b, 0xf7, 0x97]) {
        compiler = Compiler::Solc;
        version = "0.4.10-0.4.24".to_string();
    } else if bytecode.starts_with(&[0x60, 0x80, 0x60, 0x40, 0x52]) {
        compiler = Compiler::Solc;
        version = "0.4.22+".to_string();
    } else if bytecode.starts_with(&[0x60, 0x60, 0x60, 0x40, 0x52]) {
        compiler = Compiler::Solc;
        version = "0.4.11-0.4.21".to_string();
    } else if bytecode.contains_slice(&[0x76, 0x79, 0x70, 0x65, 0x72]) {
        compiler = Compiler::Vyper;
    } else if bytecode.contains_slice(&[0x73, 0x6f, 0x6c, 0x63]) {
        compiler = Compiler::Solc;
    }

    // Remove `PUSHN [u8; n]` bytes so we are left with only operations
    let pruned_bytecode = remove_pushbytes_from_bytecode(Bytes::from_iter(bytecode.iter()))
        .expect("invalid bytecode");

    // heuristics are in the form of (sequence, solc confidence, vyper confidence)
    let heuristics = [
        // Solidity
        ([0x80, 0x63, 0x14, 0x61, 0x57], 0.9447, 0.0),
        ([0x14, 0x61, 0x57, 0x80, 0x63], 0.9371, 0.0),
        ([0x61, 0x57, 0x80, 0x63, 0x14], 0.9371, 0.0),
        ([0x57, 0x80, 0x63, 0x14, 0x61], 0.9371, 0.0),
        // Vyper
        ([0x54, 0x60, 0x52, 0x60, 0x60], 0.00, 0.3103),
        ([0x60, 0x54, 0x60, 0x52, 0x60], 0.00, 0.3054),
        ([0x61, 0x52, 0x61, 0x51, 0x61], 0.00, 0.2894),
        ([0x61, 0x51, 0x61, 0x52, 0x60], 0.00, 0.2816),
        ([0x61, 0x52, 0x60, 0x61, 0x52], 0.00, 0.2734),
        ([0x90, 0x50, 0x90, 0x50, 0x81], 0.00, 0.2727),
        ([0x61, 0x52, 0x7f, 0x61, 0x52], 0.00, 0.2656),
    ];

    // for each heuristic, check if the bytecode contains the sequence and increment the confidence for that compiler.
    // the compiler with the highest confidence is chosen
    let (mut solc_confidence, mut vyper_confidence) = (0.0, 0.0);
    for (sequence, solc, vyper) in heuristics.iter() {
        if pruned_bytecode.contains_slice(sequence) {
            solc_confidence += solc;
            vyper_confidence += vyper;
        }
    }

    // classify the compiler based on the confidence levels
    if solc_confidence != 0.0 && solc_confidence > vyper_confidence {
        compiler = Compiler::Solc;
    } else if vyper_confidence != 0.0 && vyper_confidence > solc_confidence {
        compiler = Compiler::Vyper;
    }

    // Previously known heuristic: check for cbor encoded compiler metadata
    // check for cbor encoded compiler metadata
    // https://cbor.io
    if bytecode.contains_slice(&[0x73, 0x6f, 0x6c, 0x63, 0x43]) {
        let compiler_version = bytecode.split_by_slice(&[0x73, 0x6f, 0x6c, 0x63, 0x43]);

        if compiler_version.len() > 1 {
            if let Some(encoded_version) = compiler_version.get(1).and_then(|last| last.get(0..3)) {
                version = encoded_version
                    .iter()
                    .map(|v| v.to_string())
                    .collect::<Vec<String>>()
                    .join(".");
                compiler = Compiler::Solc;
            }

            trace!(
                "exact compiler version match found due to cbor encoded metadata: {}",
                version
            );
        }
    } else if bytecode.contains_slice(&[0x76, 0x79, 0x70, 0x65, 0x72, 0x83]) {
        let compiler_version = bytecode.split_by_slice(&[0x76, 0x79, 0x70, 0x65, 0x72, 0x83]);

        if compiler_version.len() > 1 {
            if let Some(encoded_version) = compiler_version.get(1).and_then(|last| last.get(0..3)) {
                version = encoded_version
                    .iter()
                    .map(|v| v.to_string())
                    .collect::<Vec<String>>()
                    .join(".");
                compiler = Compiler::Vyper;
            }

            trace!("exact compiler version match found due to cbor encoded metadata");
        }
    }

    debug!("detected compiler {compiler} {version}.");

    (compiler, version.trim_end_matches('.').to_string())
}

With our new classification function in place, we reanalyze the $6,599$ non-proxy contracts and find that we are able to classify $6,476$ contracts with an improved accuracy of $98.1\%$ ! While this is only a marginal improvement over our initial classification algorithm, it's still a step in the right direction and only a few contracts away from perfect accuracy.

Proxy Contracts

Through our analysis, it also became easy to detect proxy contracts, which are minimal contracts that delegate their logic to another contract. The pruned bytecode of these contracts is almost always:

snippet.txt

10x363d3d373d3d3d363d735af43d82803e903d916057fd5bf3

so, we can modify our classification function to detect these contracts with near-perfect accuracy. These contracts are typically not generated by a compiler, but rather by manually written assembly, so they are not classified as Solidity or Vyper contracts.

View `detect_compiler_new_with_proxies.rs`

snippet.rs
/// Detect the compiler used to generate the given bytecode.
pub fn detect_compiler_new(bytecode: &[u8]) -> (Compiler, String) {
    let mut compiler = Compiler::Unknown;
    let mut version = "unknown".to_string();

    // Previously known heuristic: perform prefix check for rough version matching
    if bytecode.starts_with(&[
        0x36, 0x60, 0x00, 0x60, 0x00, 0x37, 0x61, 0x10, 0x00, 0x60, 0x00, 0x36, 0x60, 0x00, 0x73,
    ]) {
        compiler = Compiler::Vyper;
        version = "proxy".to_string();
    } else if bytecode.starts_with(&[0x60, 0x04, 0x36, 0x10, 0x15]) {
        compiler = Compiler::Vyper;
        version = "0.2.0-0.2.4,0.2.11-0.3.3".to_string();
    } else if bytecode.starts_with(&[0x34, 0x15, 0x61, 0x00, 0x0a]) {
        compiler = Compiler::Vyper;
        version = "0.2.5-0.2.8".to_string();
    } else if bytecode.starts_with(&[0x73, 0x1b, 0xf7, 0x97]) {
        compiler = Compiler::Solc;
        version = "0.4.10-0.4.24".to_string();
    } else if bytecode.starts_with(&[0x60, 0x80, 0x60, 0x40, 0x52]) {
        compiler = Compiler::Solc;
        version = "0.4.22+".to_string();
    } else if bytecode.starts_with(&[0x60, 0x60, 0x60, 0x40, 0x52]) {
        compiler = Compiler::Solc;
        version = "0.4.11-0.4.21".to_string();
    } else if bytecode.contains_slice(&[0x76, 0x79, 0x70, 0x65, 0x72]) {
        compiler = Compiler::Vyper;
    } else if bytecode.contains_slice(&[0x73, 0x6f, 0x6c, 0x63]) {
        compiler = Compiler::Solc;
    }

    // Remove `PUSHN [u8; n]` bytes so we are left with only operations
    let pruned_bytecode = remove_pushbytes_from_bytecode(Bytes::from_iter(bytecode.iter()))
        .expect("invalid bytecode");

    // detect minimal proxies
    if pruned_bytecode.eq(&vec![
        0x36, 0x3d, 0x3d, 0x37, 0x3d, 0x3d, 0x3d, 0x36, 0x3d, 0x73, 0x5a, 0xf4, 0x3d, 0x82, 0x80,
        0x3e, 0x90, 0x3d, 0x91, 0x60, 0x57, 0xfd, 0x5b, 0xf3,
    ]) {
        compiler = Compiler::Proxy;
        version = "minimal".to_string();
    }

    // heuristics are in the form of (sequence, solc confidence, vyper confidence)
    let heuristics = [
        // Solidity
        ([0x80, 0x63, 0x14, 0x61, 0x57], 0.9447, 0.0),
        ([0x14, 0x61, 0x57, 0x80, 0x63], 0.9371, 0.0),
        ([0x61, 0x57, 0x80, 0x63, 0x14], 0.9371, 0.0),
        ([0x57, 0x80, 0x63, 0x14, 0x61], 0.9371, 0.0),
        // Vyper
        ([0x54, 0x60, 0x52, 0x60, 0x60], 0.00, 0.3103),
        ([0x60, 0x54, 0x60, 0x52, 0x60], 0.00, 0.3054),
        ([0x61, 0x52, 0x61, 0x51, 0x61], 0.00, 0.2894),
        ([0x61, 0x51, 0x61, 0x52, 0x60], 0.00, 0.2816),
        ([0x61, 0x52, 0x60, 0x61, 0x52], 0.00, 0.2734),
        ([0x90, 0x50, 0x90, 0x50, 0x81], 0.00, 0.2727),
        ([0x61, 0x52, 0x7f, 0x61, 0x52], 0.00, 0.2656),
    ];

    // for each heuristic, check if the bytecode contains the sequence and increment the confidence for that compiler.
    // the compiler with the highest confidence is chosen
    let (mut solc_confidence, mut vyper_confidence) = (0.0, 0.0);
    for (sequence, solc, vyper) in heuristics.iter() {
        if pruned_bytecode.contains_slice(sequence) {
            solc_confidence += solc;
            vyper_confidence += vyper;
        }
    }

    // classify the compiler based on the confidence levels
    if solc_confidence != 0.0 && solc_confidence > vyper_confidence {
        compiler = Compiler::Solc;
    } else if vyper_confidence != 0.0 && vyper_confidence > solc_confidence {
        compiler = Compiler::Vyper;
    }

    // Previously known heuristic: check for cbor encoded compiler metadata
    // check for cbor encoded compiler metadata
    // https://cbor.io
    if bytecode.contains_slice(&[0x73, 0x6f, 0x6c, 0x63, 0x43]) {
        let compiler_version = bytecode.split_by_slice(&[0x73, 0x6f, 0x6c, 0x63, 0x43]);

        if compiler_version.len() > 1 {
            if let Some(encoded_version) = compiler_version.get(1).and_then(|last| last.get(0..3)) {
                version = encoded_version
                    .iter()
                    .map(|v| v.to_string())
                    .collect::<Vec<String>>()
                    .join(".");
                compiler = Compiler::Solc;
            }

            trace!(
                "exact compiler version match found due to cbor encoded metadata: {}",
                version
            );
        }
    } else if bytecode.contains_slice(&[0x76, 0x79, 0x70, 0x65, 0x72, 0x83]) {
        let compiler_version = bytecode.split_by_slice(&[0x76, 0x79, 0x70, 0x65, 0x72, 0x83]);

        if compiler_version.len() > 1 {
            if let Some(encoded_version) = compiler_version.get(1).and_then(|last| last.get(0..3)) {
                version = encoded_version
                    .iter()
                    .map(|v| v.to_string())
                    .collect::<Vec<String>>()
                    .join(".");
                compiler = Compiler::Vyper;
            }

            trace!("exact compiler version match found due to cbor encoded metadata");
        }
    }

    debug!("detected compiler {compiler} {version}.");

    (compiler, version.trim_end_matches('.').to_string())
}

Potential Applications

The ability to fingerprint the compiler used to generate a contract's bytecode has several potential applications, including:

Vulnerability Scope Analysis: In July 2023, a critical vulnerability was discovered in the Vyper compiler which lead to a series of exploits, affecting contracts compiled with Vyper versions 0.2.15, 0.2.16, and 0.3.0. A heuristic to identify contracts compiled with these versions may have helped to identify and mitigate the impact of the vulnerability sooner.

Note: A bytecode-specific heuristic would be more effective than searching for all verified contracts as it would also be able to identify unverified contracts.
Smart-Contract Analysis: When working with unverified contract bytecode, it can be useful to know which compiler was used to generate the bytecode. Tools such as heimdall's decompiler can use this information to provide more accurate decompilation results.
Compiler Optimization and Development: Understanding the specific patterns left by different compilers can help in the optimization and development of new compilers. Developers can analyze these patterns to identify inefficiencies and areas for improvement, leading to more efficient compiler designs.

Future Work

While our current approach is able to classify contracts with a high degree of accuracy, there are several areas for future work:

Memory Layout Analysis: By analyzing the memory layout of contracts generated by different compilers, we may be able to identify additional patterns that can be used to fingerprint the compiler.
Machine Learning: While we opted not to take an AI/ML approach to this problem, it may be interesting to see how well a model could perform at classifying contracts based on their bytecode.
Additional Compilers: Our current analysis focused on Solidity and Vyper, but there are many other compilers, such as Huff, that generate EVM bytecode. By analyzing contracts generated by these compilers, we may be able to identify additional patterns that can be used to fingerprint the compiler.

Conclusion

In this paper, we've explored the problem of fingerprinting the compiler used to generate a contract's bytecode. By analyzing the bytecode of contracts generated by Solidity and Vyper, we were able to identify distinct patterns that can be used to fingerprint the compiler, and implemented a classification algorithm that can detect the compiler used to generate a given contract's bytecode with a high degree of accuracy. Our approach not only enhances our understanding of the compilation process but also provides a practical tool for smart contract analysis and security, and will be replacing heimdall's current classification algorithm which is used to improve decompilation accuracy. Future work can further refine these techniques as well as extend them to additional compilers, improving the robustness and applicability of compiler fingerprinting in the EVM ecosystem.

Acknowledgements

Etherscan for providing the data used in this analysis.
Ian Guimaraes for brainstorming ideas and providing feedback on this paper.
bantg for his initial work on compiler detection for vyper contracts.