Jonathan Becker
Jonathan BeckerMay 30, 2024 · 17 min read

Compiler Fingerprinting in EVM Bytecode

preview

Despite what you might think, compilers are not black boxes. They are complex, deterministic systems that produce machine code from high-level programming languages through a series of well-defined steps. This means that the output of a compiler is not just a random sequence of bytes, but a structured and predictable representation of the original source code. In fact, the output of a compiler is just as much a reflection of the compiler itself as it is of the source code it was given.

In this experimental paper, we will dive into EVM bytecode and examine distinct patterns and markers left by different major EVM compilers. We will also explore the potential for using these patterns to identify the compiler used to generate a given contract's bytecode.

Brief: EVM Bytecode and Compilers

Compilers generate EVM bytecode by translating high-level code, such as Solidity, into a series of instructions (opcodes) that represent the program's logic. Compilers are extremely complex systems that can be broken down into several stages:

  1. Lexical Analysis: The compiler reads the source code and converts it into a stream of tokens. This may also be referred to as tokenization.
  2. Syntax Analysis: The compiler parses the tokens and builds an abstract syntax tree (AST) that represents the structure of the program.
  3. Semantic Analysis: The compiler checks the AST for semantic errors and performs type checking.
  4. IR Generation: The compiler translates the AST into an intermediate representation (IR) that is closer to the target machine code. In the case of EVM, this intermediate representation is typically in the form of EVM assembly or an IR such as Yul.
  5. Optimization: The compiler optimizes the IR to improve the efficiency of the generated code.
  6. Code Generation: The compiler translates the optimized IR into machine code. In the case of EVM, this machine code is EVM bytecode.

Through this process, the compiler leaves distinct patterns and markers in the generated bytecode that can be used to identify the compiler which generated it.

Existing Known Heuristics

One of the most well-known heuristics for identifying the compiler used to generate a given contract's bytecode is by examining the first few operations in the bytecode, as different compilers take different approaches to program execution. For example:

Solidity

The Solidity compiler, solc, typically uses the following sequences of opcodes as the first few instructions in the bytecode:

  • 0x60 0x80 0x60 0x40 0x52 (indicates solc 0.4.22+)
  • 0x60 0x60 0x60 0x40 0x52 (indicates solc 0.4.11-0.4.21)

The Solidity compiler begins execution by initializing memory that the program will use. For those interested, the exact Solidity memory layout can be found here.

Vyper

The Vyper compiler typically uses the following sequences of opcodes as the first few instructions in the bytecode:

  • 0x60 0x04 0x36 0x10 0x15 (indicates vyper 0.2.0-0.2.4,0.2.11-0.3.3)
  • 0x34 0x15 0x61 0x00 0x0a (indicates vyper 0.2.5-0.2.8)

The Vyper compiler begins execution immediately in it's dispatcher, which is why the first few opcodes are different from Solidity.

CBOR Encoding

When a contract is compiled, the compiler may include metadata in the bytecode that can be used to identify the exact compiler version used to generate the bytecode. This metadata is encoded in a partial CBOR format:

  1. Encode vyper or solc as hex string: 0x7970657283 and 0x736f6c6343 respectively.
  2. Append the version as a 3-byte hex string: 0x000817 for version 0.8.23. For example, the metadata 0x736f6c6343000817 would be equivalent to solc 0.8.23.

However, this metadata is not always present in the bytecode as users can opt to exclude it from the deployed bytecode.

Methodology

If we can already roughly identify the compiler used to generate a contract's bytecode by examining the first few operations in the bytecode, how much more accurate can we be if we examine the entire bytecode? The process we will take to answer this question is as follows:

  1. Data Collection: We collect a random sample of 5,0005,000 verified contract bytecode for both Solidity and Vyper from Etherscan.
  2. Data Classification: Using the known heuristics and patterns, we classify the contracts into three groups: Solidity, Vyper, and Unknown.
  3. Pattern Analysis: We analyze the bytecode of the contracts in each group to identify distinct patterns and markers that can be used to fingerprint the compiler.
  4. Results: Using the patterns and markers identified, we reclassify the contracts and evaluate the accuracy of our classification algorithm against known compiler versions.

Note: I've opted not to take AI/ML approach to this problem as I would rather be able to reason about the patterns and markers left by the compilers rather than rely on a black-box model which simply outputs a prediction.

1. Data Collection

We collected a random sample of 5,0005,000 verified contracts for both Solidity and Vyper from Etherscan, saving their exact compiler version in a CSV.

Those interested can view the full raw data here, but here's a slice of the data:

snippet.txt
1address,compiler_version 20xef672bd94913cb6f1d2812a6e18c1ffded8eff5c,vyper:0.3.1 30x10ac65a9f710c3d607d213784e5b8632c77d5d4f,vyper:0.3.1 40x0199429171bce183048dccf1d5546ca519ea9717,vyper:0.3.1 50x1c3a367f8b2e921d2476870576fcf91670017897,vyper:0.3.9 6... 70xa21a59cc2375368fceb08898403fa7331b6531ad,v0.5.10+commit.5a6ea5b1 80xeb08b206271350fcc9ae1cad1e27f348a2055600,v0.5.14+commit.1f1aaa4 90x118cd20b58b069a2df45531cae31d1121fa4c310,v0.4.17+commit.bdeb9e52 100xa6ead154167d2e712936b8ebc22b66903c46047c,v0.5.17+commit.d19bba13

We can then fetch the bytecode for each contract using JSON-RPC, and we will prune pushed bytes from the bytecode since they make pattern discovery more complicated. For example, 0x60 0x80 0x60 0x40 (PUSH1 0x80 PUSH1 0x40) would become 0x60 0x60 (PUSH1 PUSH1).

2. Data Classification

Now that we have a list of contracts and their bytecode, we will use the known heuristics to classify the contracts, and then compare the results to the actual compiler version to determine the accuracy of our initially known heuristics.

View `detect_compiler.rs`
snippet.rs
1/// Detect the compiler used to generate the given bytecode. 2pub fn detect_compiler(bytecode: &[u8]) -> (Compiler, String) { 3 let mut compiler = Compiler::Unknown; 4 let mut version = "unknown".to_string(); 5 6 // check the prefix of the bytecode against known compiler patterns 7 if bytecode.starts_with(&[ 8 0x36, 0x60, 0x00, 0x60, 0x00, 0x37, 0x61, 0x10, 0x00, 0x60, 0x00, 0x36, 0x60, 0x00, 0x73, 9 ]) { 10 compiler = Compiler::Vyper; 11 version = "proxy".to_string(); 12 } else if bytecode.starts_with(&[0x60, 0x04, 0x36, 0x10, 0x15]) { 13 compiler = Compiler::Vyper; 14 version = "0.2.0-0.2.4,0.2.11-0.3.3".to_string(); 15 } else if bytecode.starts_with(&[0x34, 0x15, 0x61, 0x00, 0x0a]) { 16 compiler = Compiler::Vyper; 17 version = "0.2.5-0.2.8".to_string(); 18 } else if bytecode.starts_with(&[0x73, 0x1b, 0xf7, 0x97]) { 19 compiler = Compiler::Solc; 20 version = "0.4.10-0.4.24".to_string(); 21 } else if bytecode.starts_with(&[0x60, 0x80, 0x60, 0x40, 0x52]) { 22 compiler = Compiler::Solc; 23 version = "0.4.22+".to_string(); 24 } else if bytecode.starts_with(&[0x60, 0x60, 0x60, 0x40, 0x52]) { 25 compiler = Compiler::Solc; 26 version = "0.4.11-0.4.21".to_string(); 27 } else if bytecode.contains_slice(&[0x76, 0x79, 0x70, 0x65, 0x72]) { 28 compiler = Compiler::Vyper; 29 } else if bytecode.contains_slice(&[0x73, 0x6f, 0x6c, 0x63]) { 30 compiler = Compiler::Solc; 31 } 32 33 // check for cbor encoded compiler metadata 34 // https://cbor.io 35 if bytecode.contains_slice(&[0x73, 0x6f, 0x6c, 0x63, 0x43]) { 36 let compiler_version = bytecode.split_by_slice(&[0x73, 0x6f, 0x6c, 0x63, 0x43]); 37 38 if compiler_version.len() > 1 { 39 if let Some(encoded_version) = compiler_version.get(1).and_then(|last| last.get(0..3)) { 40 version = encoded_version 41 .iter() 42 .map(|v| v.to_string()) 43 .collect::<Vec<String>>() 44 .join("."); 45 compiler = Compiler::Solc; 46 } 47 48 trace!( 49 "exact compiler version match found due to cbor encoded metadata: {}", 50 version 51 ); 52 } 53 } else if bytecode.contains_slice(&[0x76, 0x79, 0x70, 0x65, 0x72, 0x83]) { 54 let compiler_version = bytecode.split_by_slice(&[0x76, 0x79, 0x70, 0x65, 0x72, 0x83]); 55 56 if compiler_version.len() > 1 { 57 if let Some(encoded_version) = compiler_version.get(1).and_then(|last| last.get(0..3)) { 58 version = encoded_version 59 .iter() 60 .map(|v| v.to_string()) 61 .collect::<Vec<String>>() 62 .join("."); 63 compiler = Compiler::Vyper; 64 } 65 66 trace!("exact compiler version match found due to cbor encoded metadata"); 67 } 68 } 69 70 debug!("detected compiler {compiler} {version}."); 71 72 (compiler, version.trim_end_matches('.').to_string()) 73}

After running our classification function on the 10,00010,000 contracts, we can generate a mapping of the contracts to their detected compiler and version:

Note: we also save unpruned bytecode in an additional mapping of similar structure.

snippet.json
1{ 2 "Proxy": { 3 "0x3fc90d031eecc364c620166ee7a791a151a16062": "0x3660603761603660735a...", 4 ... 5 }, 6 "Unknown": { 7 "0xdf1b41413eafccfc6e98bb905feaeb271d307af3": "0x5f35601c60608216601b...", 8 ... 9 }, 10 "Solc": { 11 "0x29109547921fb1978bbbe192f37e546de454dcdb": "0x60605236156157637c6...", 12 ... 13 }, 14 "Vyper": { 15 "0x8d0f9c9fa4c1b265cd5032fe6ba4fefc9d94badb": "0x603611615761565b603...", 16 ... 17 } 18}

This initial classification function correctly detects the compiler of 6,2546,254 contracts out of 6,5996,599 non-proxy contracts; an accuracy of 94.8%94.8\%. This is already a great start, but I believe we can do better.

Interestingly, if we remove CBOR encoded metadata detection from our classification function entirely, we still get the same result of 94.8%94.8\%, as CBOR encoded metadata is not necessary for our classification function to work and can only help determine the exact compiler version used.

Out of curiosity, I also ran the classification function using only CBOR encoded metadata detection, which resulted in a classification accuracy of 13.8%13.8\%, showing that the metadata is not present in the majority of contracts. Interestingly, the metadata was present in over three times as many solidity contracts (671671) as vyper contracts (238238).

3. Pattern Analysis

In order to improve our classification algorithm's accuracy, we will analyze the pruned bytecode of each contract for both Solidity and Vyper with the hope of identifying distinct patterns that can be used to fingerprint the compiler. We will focus on sequences of five operations, as these are long enough to be unique but short enough to be common. Here's the general process we will follow:

  1. Given a list of contracts generated by known compilers, for each contract:
    1. Extract all unique sequences of five operations from the bytecode.
    2. Count the frequency of each sequence in all contracts generated by this compiler. So, if a sequence occurs in 10001000 out of 50005000 contracts, its frequency would be 20%20\%.
  2. We want the most compiler-specific sequences, so we will calculate the percentage of contracts generated by each compiler that contain each sequence, and sort the sequences by this percentage, filtering out sequences that are not strong enough heuristics to be used confidently.
  3. We will then compare the sequences for Solidity and Vyper to see if there are any distinct patterns that can be used to fingerprint the compiler. For example, sequences which occur frequently in Solidity contracts but rarely in Vyper contracts could be used as a fingerprint for Solidity.

Note: we don't look for longer sequences, as if a sequence of six operations exists, a subset sequence of five operations will also exist and is more likely to be found in other contracts. For example, if 0x60 0x80 0x60 0x40 0x52 0x60 exists in the bytecode, then 0x60 0x80 0x60 0x40 0x52 also must exist within the bytecode, and is more likely to be found in other contracts due to its shorter length.

4. Results

After performing pattern analysis, we are left with the following sequences, along with their frequency in all contracts and the percentage of contracts generated by each compiler that contain the sequence:

SequenceAssemblyFrequencyVyperSolc
0x5460526060SLOAD PUSH1 MSTORE PUSH1 PUSH1916131.03%0.00%
0x6054605260PUSH1 SLOAD PUSH1 MSTORE PUSH1680130.54%0.00%
0x6152615161PUSH2 MSTORE PUSH2 MLOAD PUSH23024928.94%0.00%
0x6151615260PUSH2 MLOAD PUSH2 MSTORE PUSH1671828.16%0.00%
0x6152606152PUSH2 MSTORE PUSH1 PUSH2 MSTORE1014627.34%0.00%
0x9050905081SWAP1 POP SWAP1 POP DUP2896827.27%0.00%
0x61527f6152PUSH2 MSTORE PUSH32 PUSH2 MSTORE565126.56%0.00%
0x8063146157DUP1 PUSH4 EQ PUSH2 JUMPI277800.00%94.47%
0x1461578063EQ PUSH2 JUMPI DUP1 PUSH4235760.00%93.71%
0x6157806314PUSH2 JUMPI DUP1 PUSH4 EQ254640.00%93.71%
0x5780631461JUMPI DUP1 PUSH4 EQ PUSH2254640.00%93.71%

Findings

Given our set of sequences, we can now modify our classification function which uses these sequences to detect the compiler used to generate a contract's bytecode. We will do this with a simple confidence heuristic: if a contract contains a sequence that is more common in Solidity contracts, we will classify it as a Solidity contract, and vice versa for Vyper contracts. Luckily, our sequences are pretty much compiler-specific and exclusive, so we can be confident in our classification.

View `detect_compiler_new.rs`
snippet.rs
1/// Detect the compiler used to generate the given bytecode. 2pub fn detect_compiler_new(bytecode: &[u8]) -> (Compiler, String) { 3 let mut compiler = Compiler::Unknown; 4 let mut version = "unknown".to_string(); 5 6 // Previously known heuristic: perform prefix check for rough version matching 7 if bytecode.starts_with(&[ 8 0x36, 0x60, 0x00, 0x60, 0x00, 0x37, 0x61, 0x10, 0x00, 0x60, 0x00, 0x36, 0x60, 0x00, 0x73, 9 ]) { 10 compiler = Compiler::Vyper; 11 version = "proxy".to_string(); 12 } else if bytecode.starts_with(&[0x60, 0x04, 0x36, 0x10, 0x15]) { 13 compiler = Compiler::Vyper; 14 version = "0.2.0-0.2.4,0.2.11-0.3.3".to_string(); 15 } else if bytecode.starts_with(&[0x34, 0x15, 0x61, 0x00, 0x0a]) { 16 compiler = Compiler::Vyper; 17 version = "0.2.5-0.2.8".to_string(); 18 } else if bytecode.starts_with(&[0x73, 0x1b, 0xf7, 0x97]) { 19 compiler = Compiler::Solc; 20 version = "0.4.10-0.4.24".to_string(); 21 } else if bytecode.starts_with(&[0x60, 0x80, 0x60, 0x40, 0x52]) { 22 compiler = Compiler::Solc; 23 version = "0.4.22+".to_string(); 24 } else if bytecode.starts_with(&[0x60, 0x60, 0x60, 0x40, 0x52]) { 25 compiler = Compiler::Solc; 26 version = "0.4.11-0.4.21".to_string(); 27 } else if bytecode.contains_slice(&[0x76, 0x79, 0x70, 0x65, 0x72]) { 28 compiler = Compiler::Vyper; 29 } else if bytecode.contains_slice(&[0x73, 0x6f, 0x6c, 0x63]) { 30 compiler = Compiler::Solc; 31 } 32 33 // Remove `PUSHN [u8; n]` bytes so we are left with only operations 34 let pruned_bytecode = remove_pushbytes_from_bytecode(Bytes::from_iter(bytecode.iter())) 35 .expect("invalid bytecode"); 36 37 // heuristics are in the form of (sequence, solc confidence, vyper confidence) 38 let heuristics = [ 39 // Solidity 40 ([0x80, 0x63, 0x14, 0x61, 0x57], 0.9447, 0.0), 41 ([0x14, 0x61, 0x57, 0x80, 0x63], 0.9371, 0.0), 42 ([0x61, 0x57, 0x80, 0x63, 0x14], 0.9371, 0.0), 43 ([0x57, 0x80, 0x63, 0x14, 0x61], 0.9371, 0.0), 44 // Vyper 45 ([0x54, 0x60, 0x52, 0x60, 0x60], 0.00, 0.3103), 46 ([0x60, 0x54, 0x60, 0x52, 0x60], 0.00, 0.3054), 47 ([0x61, 0x52, 0x61, 0x51, 0x61], 0.00, 0.2894), 48 ([0x61, 0x51, 0x61, 0x52, 0x60], 0.00, 0.2816), 49 ([0x61, 0x52, 0x60, 0x61, 0x52], 0.00, 0.2734), 50 ([0x90, 0x50, 0x90, 0x50, 0x81], 0.00, 0.2727), 51 ([0x61, 0x52, 0x7f, 0x61, 0x52], 0.00, 0.2656), 52 ]; 53 54 // for each heuristic, check if the bytecode contains the sequence and increment the confidence for that compiler. 55 // the compiler with the highest confidence is chosen 56 let (mut solc_confidence, mut vyper_confidence) = (0.0, 0.0); 57 for (sequence, solc, vyper) in heuristics.iter() { 58 if pruned_bytecode.contains_slice(sequence) { 59 solc_confidence += solc; 60 vyper_confidence += vyper; 61 } 62 } 63 64 // classify the compiler based on the confidence levels 65 if solc_confidence != 0.0 && solc_confidence > vyper_confidence { 66 compiler = Compiler::Solc; 67 } else if vyper_confidence != 0.0 && vyper_confidence > solc_confidence { 68 compiler = Compiler::Vyper; 69 } 70 71 // Previously known heuristic: check for cbor encoded compiler metadata 72 // check for cbor encoded compiler metadata 73 // https://cbor.io 74 if bytecode.contains_slice(&[0x73, 0x6f, 0x6c, 0x63, 0x43]) { 75 let compiler_version = bytecode.split_by_slice(&[0x73, 0x6f, 0x6c, 0x63, 0x43]); 76 77 if compiler_version.len() > 1 { 78 if let Some(encoded_version) = compiler_version.get(1).and_then(|last| last.get(0..3)) { 79 version = encoded_version 80 .iter() 81 .map(|v| v.to_string()) 82 .collect::<Vec<String>>() 83 .join("."); 84 compiler = Compiler::Solc; 85 } 86 87 trace!( 88 "exact compiler version match found due to cbor encoded metadata: {}", 89 version 90 ); 91 } 92 } else if bytecode.contains_slice(&[0x76, 0x79, 0x70, 0x65, 0x72, 0x83]) { 93 let compiler_version = bytecode.split_by_slice(&[0x76, 0x79, 0x70, 0x65, 0x72, 0x83]); 94 95 if compiler_version.len() > 1 { 96 if let Some(encoded_version) = compiler_version.get(1).and_then(|last| last.get(0..3)) { 97 version = encoded_version 98 .iter() 99 .map(|v| v.to_string()) 100 .collect::<Vec<String>>() 101 .join("."); 102 compiler = Compiler::Vyper; 103 } 104 105 trace!("exact compiler version match found due to cbor encoded metadata"); 106 } 107 } 108 109 debug!("detected compiler {compiler} {version}."); 110 111 (compiler, version.trim_end_matches('.').to_string()) 112}

With our new classification function in place, we reanalyze the 6,5996,599 non-proxy contracts and find that we are able to classify 6,4766,476 contracts with an improved accuracy of 98.1%98.1\%! While this is only a marginal improvement over our initial classification algorithm, it's still a step in the right direction and only a few contracts away from perfect accuracy.

Proxy Contracts

Through our analysis, it also became easy to detect proxy contracts, which are minimal contracts that delegate their logic to another contract. The pruned bytecode of these contracts is almost always:

snippet.txt
10x363d3d373d3d3d363d735af43d82803e903d916057fd5bf3

so, we can modify our classification function to detect these contracts with near-perfect accuracy. These contracts are typically not generated by a compiler, but rather by manually written assembly, so they are not classified as Solidity or Vyper contracts.

View `detect_compiler_new_with_proxies.rs`
snippet.rs
1/// Detect the compiler used to generate the given bytecode. 2pub fn detect_compiler_new(bytecode: &[u8]) -> (Compiler, String) { 3 let mut compiler = Compiler::Unknown; 4 let mut version = "unknown".to_string(); 5 6 // Previously known heuristic: perform prefix check for rough version matching 7 if bytecode.starts_with(&[ 8 0x36, 0x60, 0x00, 0x60, 0x00, 0x37, 0x61, 0x10, 0x00, 0x60, 0x00, 0x36, 0x60, 0x00, 0x73, 9 ]) { 10 compiler = Compiler::Vyper; 11 version = "proxy".to_string(); 12 } else if bytecode.starts_with(&[0x60, 0x04, 0x36, 0x10, 0x15]) { 13 compiler = Compiler::Vyper; 14 version = "0.2.0-0.2.4,0.2.11-0.3.3".to_string(); 15 } else if bytecode.starts_with(&[0x34, 0x15, 0x61, 0x00, 0x0a]) { 16 compiler = Compiler::Vyper; 17 version = "0.2.5-0.2.8".to_string(); 18 } else if bytecode.starts_with(&[0x73, 0x1b, 0xf7, 0x97]) { 19 compiler = Compiler::Solc; 20 version = "0.4.10-0.4.24".to_string(); 21 } else if bytecode.starts_with(&[0x60, 0x80, 0x60, 0x40, 0x52]) { 22 compiler = Compiler::Solc; 23 version = "0.4.22+".to_string(); 24 } else if bytecode.starts_with(&[0x60, 0x60, 0x60, 0x40, 0x52]) { 25 compiler = Compiler::Solc; 26 version = "0.4.11-0.4.21".to_string(); 27 } else if bytecode.contains_slice(&[0x76, 0x79, 0x70, 0x65, 0x72]) { 28 compiler = Compiler::Vyper; 29 } else if bytecode.contains_slice(&[0x73, 0x6f, 0x6c, 0x63]) { 30 compiler = Compiler::Solc; 31 } 32 33 // Remove `PUSHN [u8; n]` bytes so we are left with only operations 34 let pruned_bytecode = remove_pushbytes_from_bytecode(Bytes::from_iter(bytecode.iter())) 35 .expect("invalid bytecode"); 36 37 // detect minimal proxies 38 if pruned_bytecode.eq(&vec![ 39 0x36, 0x3d, 0x3d, 0x37, 0x3d, 0x3d, 0x3d, 0x36, 0x3d, 0x73, 0x5a, 0xf4, 0x3d, 0x82, 0x80, 40 0x3e, 0x90, 0x3d, 0x91, 0x60, 0x57, 0xfd, 0x5b, 0xf3, 41 ]) { 42 compiler = Compiler::Proxy; 43 version = "minimal".to_string(); 44 } 45 46 // heuristics are in the form of (sequence, solc confidence, vyper confidence) 47 let heuristics = [ 48 // Solidity 49 ([0x80, 0x63, 0x14, 0x61, 0x57], 0.9447, 0.0), 50 ([0x14, 0x61, 0x57, 0x80, 0x63], 0.9371, 0.0), 51 ([0x61, 0x57, 0x80, 0x63, 0x14], 0.9371, 0.0), 52 ([0x57, 0x80, 0x63, 0x14, 0x61], 0.9371, 0.0), 53 // Vyper 54 ([0x54, 0x60, 0x52, 0x60, 0x60], 0.00, 0.3103), 55 ([0x60, 0x54, 0x60, 0x52, 0x60], 0.00, 0.3054), 56 ([0x61, 0x52, 0x61, 0x51, 0x61], 0.00, 0.2894), 57 ([0x61, 0x51, 0x61, 0x52, 0x60], 0.00, 0.2816), 58 ([0x61, 0x52, 0x60, 0x61, 0x52], 0.00, 0.2734), 59 ([0x90, 0x50, 0x90, 0x50, 0x81], 0.00, 0.2727), 60 ([0x61, 0x52, 0x7f, 0x61, 0x52], 0.00, 0.2656), 61 ]; 62 63 // for each heuristic, check if the bytecode contains the sequence and increment the confidence for that compiler. 64 // the compiler with the highest confidence is chosen 65 let (mut solc_confidence, mut vyper_confidence) = (0.0, 0.0); 66 for (sequence, solc, vyper) in heuristics.iter() { 67 if pruned_bytecode.contains_slice(sequence) { 68 solc_confidence += solc; 69 vyper_confidence += vyper; 70 } 71 } 72 73 // classify the compiler based on the confidence levels 74 if solc_confidence != 0.0 && solc_confidence > vyper_confidence { 75 compiler = Compiler::Solc; 76 } else if vyper_confidence != 0.0 && vyper_confidence > solc_confidence { 77 compiler = Compiler::Vyper; 78 } 79 80 // Previously known heuristic: check for cbor encoded compiler metadata 81 // check for cbor encoded compiler metadata 82 // https://cbor.io 83 if bytecode.contains_slice(&[0x73, 0x6f, 0x6c, 0x63, 0x43]) { 84 let compiler_version = bytecode.split_by_slice(&[0x73, 0x6f, 0x6c, 0x63, 0x43]); 85 86 if compiler_version.len() > 1 { 87 if let Some(encoded_version) = compiler_version.get(1).and_then(|last| last.get(0..3)) { 88 version = encoded_version 89 .iter() 90 .map(|v| v.to_string()) 91 .collect::<Vec<String>>() 92 .join("."); 93 compiler = Compiler::Solc; 94 } 95 96 trace!( 97 "exact compiler version match found due to cbor encoded metadata: {}", 98 version 99 ); 100 } 101 } else if bytecode.contains_slice(&[0x76, 0x79, 0x70, 0x65, 0x72, 0x83]) { 102 let compiler_version = bytecode.split_by_slice(&[0x76, 0x79, 0x70, 0x65, 0x72, 0x83]); 103 104 if compiler_version.len() > 1 { 105 if let Some(encoded_version) = compiler_version.get(1).and_then(|last| last.get(0..3)) { 106 version = encoded_version 107 .iter() 108 .map(|v| v.to_string()) 109 .collect::<Vec<String>>() 110 .join("."); 111 compiler = Compiler::Vyper; 112 } 113 114 trace!("exact compiler version match found due to cbor encoded metadata"); 115 } 116 } 117 118 debug!("detected compiler {compiler} {version}."); 119 120 (compiler, version.trim_end_matches('.').to_string()) 121}

Potential Applications

The ability to fingerprint the compiler used to generate a contract's bytecode has several potential applications, including:

  1. Vulnerability Scope Analysis: In July 2023, a critical vulnerability was discovered in the Vyper compiler which lead to a series of exploits, affecting contracts compiled with Vyper versions 0.2.15, 0.2.16, and 0.3.0. A heuristic to identify contracts compiled with these versions may have helped to identify and mitigate the impact of the vulnerability sooner.

    Note: A bytecode-specific heuristic would be more effective than searching for all verified contracts as it would also be able to identify unverified contracts.

  2. Smart-Contract Analysis: When working with unverified contract bytecode, it can be useful to know which compiler was used to generate the bytecode. Tools such as heimdall's decompiler can use this information to provide more accurate decompilation results.

  3. Compiler Optimization and Development: Understanding the specific patterns left by different compilers can help in the optimization and development of new compilers. Developers can analyze these patterns to identify inefficiencies and areas for improvement, leading to more efficient compiler designs.

Future Work

While our current approach is able to classify contracts with a high degree of accuracy, there are several areas for future work:

  • Memory Layout Analysis: By analyzing the memory layout of contracts generated by different compilers, we may be able to identify additional patterns that can be used to fingerprint the compiler.
  • Machine Learning: While we opted not to take an AI/ML approach to this problem, it may be interesting to see how well a model could perform at classifying contracts based on their bytecode.
  • Additional Compilers: Our current analysis focused on Solidity and Vyper, but there are many other compilers, such as Huff, that generate EVM bytecode. By analyzing contracts generated by these compilers, we may be able to identify additional patterns that can be used to fingerprint the compiler.

Conclusion

In this paper, we've explored the problem of fingerprinting the compiler used to generate a contract's bytecode. By analyzing the bytecode of contracts generated by Solidity and Vyper, we were able to identify distinct patterns that can be used to fingerprint the compiler, and implemented a classification algorithm that can detect the compiler used to generate a given contract's bytecode with a high degree of accuracy. Our approach not only enhances our understanding of the compilation process but also provides a practical tool for smart contract analysis and security, and will be replacing heimdall's current classification algorithm which is used to improve decompilation accuracy. Future work can further refine these techniques as well as extend them to additional compilers, improving the robustness and applicability of compiler fingerprinting in the EVM ecosystem.

Acknowledgements


More Reading

Diving Into Smart Contract DecompilationJan 19, 2023

In this article, we will delve deep into the inner workings of the heimdall-rs decompilation module, examining how it performs this conversion at a low level and exploring its various features and capabilities.

On Decoding Raw EVM CalldataDec 20, 2023

With the 0.7.0 release of Heimdall, the toolkit gained the ability to decode raw EVM calldata. In this article, we'll dive into the inner workings of calldata decoding and explore some of the use cases for this new feature.

Heimdall-rs 0.8.0 Release NotesMay 14, 2024

The heimdall-rs 0.8.0 release is our largest update to date with 34 merged PRs, hundreds of closed issues, countless hours of work, and six new contributors! Due to the sheer size of this release, we've decided to make a blog post to highlight the most significant changes.