...And you will know me by the trail of bits
Can you pass The Rekt Test?
Audits from Trail of Bits give organizations ways to fix their current issues and provide them with a roadmap to prevent the same security problems from occurring in the future. However, product teams don’t always implement the recommendations. This may be due to a lack of resources, a lack of institutional motivation, or several other internal reasons. Even when action is taken, progress can be difficult to measure.
We now have a way organizations can measure that progress. The Rekt Test provides a framework for an informed conversation that, if moderated by experts, can allow an organization to get a fuller picture of the risk presented by a blockchain project. Modeled after the The Joel Test, an organization should try to answer ‘Yes’ for as many of the below questions as possible before going public with a new blockchain venture. The more an organization can answer ‘Yes’ to these questions, the more stable its security foundation should be.
The Rekt Test
1. Do you have all actors, their roles, and privileges documented? It’s crucial to understand who is operating in your environment — users, developers, miners, validators, and other stakeholders — and the breakdown of their unique responsibilities, tasks, permissions and other duties that may be performed. This should be documented in a clear and concise manner, so an organization can easily identify when a security incident is happening. 2. Do you employ a multi-factor human and physical key management system? The keys to your wallets are always going to have a target on them. For protection, we recommend implementing a two-factor authentication system for users, using both physical and digital backups, and applying a multi-signature system for transactions. For multi-signatures, we suggest a consensus requirement for a sensitive action, like spending funds out of a wallet. This is to mitigate the risks of any singular party’s judgment overruling others, any singular party’s mistake causing a failure, and any singular credential compromise causing a failure. 3. Do you have a written and tested incident response plan? While security in the blockchain space is different from traditional enterprises, an incident response plan is a proven practice that will enable your organization to be resilient in the face of a security incident. The plan should include steps to identify, contain, and remediate the incident, through both automated and manual procedures. An organization should have all team members familiar with the plan, including both internal and external communication channels. This plan should be regularly tested to ensure that it is up-to-date and effective, especially given how quick the blockchain security world can change. As an example of what this traditionally looks like, Carnegie Mellon University has posted their plan online. 4. Do you perform identity verification and background checks on all employees? This is where your human resource officers can help with your security posture. While organizations should be conducting the standard checks — criminal background, employment history, credit check, reference check — a social media background check can unearth behavior that could be problematic from a security perspective. A little OSINT could go a long way. 5. Do you have a team member that has security defined in their role? Answering “yes” to this isn’t as simple as hiring a CISO. (Although that’s better than nothing!) Whether it’s developing something new or updating something that already exists, security cannot be a side gig on someone’s job responsibilities. There needs to be a dedicated person in the organization that can take the lead on making sure a product or service is as secure as possible. This person should also be deeply involved in getting an organization to “yes” on other questions in this list, particularly incident response plans and using the best tools currently on the market. 6. Do you require hardware security keys for production systems? Let’s not mince words: if your internal credentials are leaked or stolen, the fallout will be harsh. Using SMS codes for two-factor authentication is not good enough for the credentials guarding your code. There are several different options that support a wide range of authentication protocols, which will make your business-critical code repositories highly secure and resilient to breach attempts. YubiKey and Google Titan are two good choices for hardware keys. 7. Do you use the best automated tools for discovering security issues in your code? If organizations are going to be constantly testing their security, they are going to need tools that can keep up with the industry’s best practices. Trail of Bits has a wide suite of tools and documentation that will allow you to stay current in your security posture:
For those unfamiliar, an invariant is a condition or value that remains unchanged throughout a program’s execution. It is used to ensure that the program behaves as expected and that the results are consistent. As you develop new features, your organization should be testing to see if the new code plays nice with these invariants. These could be tied to things like cryptographic hash functions, consensus mechanisms, transaction validation, transaction privacy or other network security functions. By continually testing, you are constantly given the opportunity to find issues in your system before an attacker does. 9. Do you undergo external audits and maintain a vulnerability disclosure or bug bounty program? Getting an external audit will positively compound your security posture, both internally and externally. An audit will allow experts with a fresh perspective to search for and fix vulnerabilities that an internal security practice may not be aware of or have the expertise to uncover. Whether the audit falls on the actual smart contract or the components supporting the contract, it will also provide assurances to the public that your organization is taking the necessary steps to independently verify that your security is as good as it can possibly be. A vulnerability disclosure or bug bounty program is an additional way to enhance your security posture. Leading blockchain and web3 companies have set these programs up as a way to tap into the public’s security expertise and address potential issues before they become a problem. Some examples are:
- Do you have all actors, their roles, and privileges documented?
- Do you employ a multi-factor human and physical key management system?
- Do you have a written and tested incident response plan?
- Do you perform identity verification and background checks on all employees?
- Do you have a team member that has security defined in their role?
- Do you require hardware security keys for production systems?
- Do you use the best automated tools for discovering security issues in your code?
- Do you define key invariants for your system and test them on every commit?
- Do you undergo external audits and maintain a vulnerability disclosure or bug bounty program?
- Do you keep documentation of all the external services, contracts, and oracles you rely on?
- Do you document the best ways to attack your own system?
- Do you identify and address potential avenues for users who abuse your system?
1. Do you have all actors, their roles, and privileges documented? It’s crucial to understand who is operating in your environment — users, developers, miners, validators, and other stakeholders — and the breakdown of their unique responsibilities, tasks, permissions and other duties that may be performed. This should be documented in a clear and concise manner, so an organization can easily identify when a security incident is happening. 2. Do you employ a multi-factor human and physical key management system? The keys to your wallets are always going to have a target on them. For protection, we recommend implementing a two-factor authentication system for users, using both physical and digital backups, and applying a multi-signature system for transactions. For multi-signatures, we suggest a consensus requirement for a sensitive action, like spending funds out of a wallet. This is to mitigate the risks of any singular party’s judgment overruling others, any singular party’s mistake causing a failure, and any singular credential compromise causing a failure. 3. Do you have a written and tested incident response plan? While security in the blockchain space is different from traditional enterprises, an incident response plan is a proven practice that will enable your organization to be resilient in the face of a security incident. The plan should include steps to identify, contain, and remediate the incident, through both automated and manual procedures. An organization should have all team members familiar with the plan, including both internal and external communication channels. This plan should be regularly tested to ensure that it is up-to-date and effective, especially given how quick the blockchain security world can change. As an example of what this traditionally looks like, Carnegie Mellon University has posted their plan online. 4. Do you perform identity verification and background checks on all employees? This is where your human resource officers can help with your security posture. While organizations should be conducting the standard checks — criminal background, employment history, credit check, reference check — a social media background check can unearth behavior that could be problematic from a security perspective. A little OSINT could go a long way. 5. Do you have a team member that has security defined in their role? Answering “yes” to this isn’t as simple as hiring a CISO. (Although that’s better than nothing!) Whether it’s developing something new or updating something that already exists, security cannot be a side gig on someone’s job responsibilities. There needs to be a dedicated person in the organization that can take the lead on making sure a product or service is as secure as possible. This person should also be deeply involved in getting an organization to “yes” on other questions in this list, particularly incident response plans and using the best tools currently on the market. 6. Do you require hardware security keys for production systems? Let’s not mince words: if your internal credentials are leaked or stolen, the fallout will be harsh. Using SMS codes for two-factor authentication is not good enough for the credentials guarding your code. There are several different options that support a wide range of authentication protocols, which will make your business-critical code repositories highly secure and resilient to breach attempts. YubiKey and Google Titan are two good choices for hardware keys. 7. Do you use the best automated tools for discovering security issues in your code? If organizations are going to be constantly testing their security, they are going to need tools that can keep up with the industry’s best practices. Trail of Bits has a wide suite of tools and documentation that will allow you to stay current in your security posture:
- Echidna, a smart contract fuzzer
- Slither, a static analyzer for solidity
- Circomspect, a static analyzer and linter for the Circom programming language
- Amarna, a static analyzer and linter for the Cairo programming language
- Building Secure Contracts, guidelines and best practices to write secure smart contracts
- ZKDocs, comprehensive, detailed, and interactive documentation on zero-knowledge proof systems and related primitives.
For those unfamiliar, an invariant is a condition or value that remains unchanged throughout a program’s execution. It is used to ensure that the program behaves as expected and that the results are consistent. As you develop new features, your organization should be testing to see if the new code plays nice with these invariants. These could be tied to things like cryptographic hash functions, consensus mechanisms, transaction validation, transaction privacy or other network security functions. By continually testing, you are constantly given the opportunity to find issues in your system before an attacker does. 9. Do you undergo external audits and maintain a vulnerability disclosure or bug bounty program? Getting an external audit will positively compound your security posture, both internally and externally. An audit will allow experts with a fresh perspective to search for and fix vulnerabilities that an internal security practice may not be aware of or have the expertise to uncover. Whether the audit falls on the actual smart contract or the components supporting the contract, it will also provide assurances to the public that your organization is taking the necessary steps to independently verify that your security is as good as it can possibly be. A vulnerability disclosure or bug bounty program is an additional way to enhance your security posture. Leading blockchain and web3 companies have set these programs up as a way to tap into the public’s security expertise and address potential issues before they become a problem. Some examples are:
- Uniswap has a bug bounty program that incentivizes responsible bug disclosure and offers rewards up to 2,250,000 USDC for bugs.
- Primitive has a program with ImmuneFi that rewards up to $1 million for vulnerabilities that impact loss of user funds or governance funds.
- MakerDAO has a $10 million program with ImmuneFi that has a large library of smart contract code for researchers to look over.
- Gemini customers were the target of phishing campaigns that the result of an incident at a third-party vendor
- Cryptocurrency hardware wallet provider Trezor saw its users targeted with phishing attempts after a security incident at MailChimp.
- Tens of millions in worth have been taken in incidents where attackers targeted data oracles.
- Implement a KYC process to ensure user identity
- Consider a reputation system to track user behavior
- Establish escrow accounts to ensure funds are held securely in the event of a security incident
- Monitor relevant chains for suspicious activity and shut down users in the event of bad behavior
Categories: Security Posts
Codex (and GPT-4) can’t beat humans on smart contract audits
By Artem Dinaburg, Chief Technology Officer; Josselin Feist, Principal Engineer; and Riccardo Schirone, Security Engineer
Is artificial intelligence (AI) capable of powering software security audits? Over the last four months, we piloted a project called Toucan to find out. Toucan was intended to integrate OpenAI’s Codex into our Solidity auditing workflow. This experiment went far beyond writing “where is the bug?” in a prompt and expecting sound and complete results.
Our multi-functional team, consisting of auditors, developers, and machine learning (ML) experts, put serious work into prompt engineering and developed a custom prompting framework that worked around some frustrations and limitations of current large language model (LLM) tooling, such as working with incorrect and inconsistent results, handling rate limits, and creating complex, templated chains of prompts. At every step, we evaluated how effective Toucan was and whether it would make our auditors more productive or slow them down with false positives.
The technology is not yet ready for security audits for three main reasons:
Figure 1: True positives (green) and false positives (red) found by Slither, Toucan, and both on some simple re-entrancy tests. The Toucan results are not encouraging. We tried looking at the true/false positive sets of Slither and Toucan, and found that each tool had a different set of false positives/false negatives, with some overlap (Figure 1). Codex was not able to effectively reduce the false positive rate from a prototype Slither integer overflow detector. Overall, we noticed a tendency to reply affirmatively to our questions, increasing the number of positives discovered by Toucan. Codex can perform basic static analysis tasks, but the rate of failure is too high to be useful and too difficult to characterize. This capability to perform successful analysis, even on short program fragments, is very impressive and should not be discounted! For languages that Codex understands but for which no suitable tooling exists, this capability could be extremely valuable—after all, some analysis could be much better than nothing. But the benchmark for Solidity is not nothing; we already have existing static analysis tooling that works very well. How we framed our framework During Toucan’s development, we created a custom prompting framework, a web-based front end, and rudimentary debugging and testing tools to evaluate prompts and to aid in unit and integration tests. The most important of these was the prompting framework. Prompting framework If we were making Toucan today, we’d probably just use LangChain. But at the time, LangChain did not have the features we needed. Frustratingly, neither OpenAI nor Microsoft offered an official, first-party prompting framework. This led us to develop a custom framework, with the goal that it should be possible for auditors to create new prompts without ever modifying Toucan’s code. requires = [“emit-ownership-doc”, “emit-target-contract”,]
name = “Contract Ownership”
scope = “contract”
instantiation_condition = “any(‘admin’ in s.name.lower() or ‘owner’ in s.name.lower() for s in contract.state_variables)” [[questions]]
name = “can-change”
query = “Is it possible to change the `{{ contract | owner_variable }}` variable by calling a function in the `{{ contract.name }}` contract without aborting the transaction? Think through it step by step, and answer as ‘Yes’, ‘No’, or ‘Unknown’. If ‘Yes’, please specify the function.”
is_decision = true [[questions]]
name = “who-can-call”
runtime_condition = “questions[‘can-change’].is_affirmative()”
query = “””To reason about ownership:
1) First, carefully consider the code of the function
2) Second, reason step by step about the question.
Who can call the function successfully, that is, without aborting or revering the transaction?”””
answer_start = “””1) First, carefully consider the code of the function:””” [[questions]]
name = “can-non-owner-call”
runtime_condition = “questions[‘can-change’].is_affirmative()”
query = “Can any sender who is not the current owner call the function without reverting or aborting?”
is_decision = true
finding_condition = “question.is_affirmative()” Figure 2: Sample question chain asking about contract ownership. Before questions are emitted, the prompting framework also emits a specific explanation of what ownership means, with examples and information about the target contract. Our framework supported chaining multiple questions together to support Chain of Thought and similar prompting techniques (Figure 2). Since GPT models like Codex are multi-shot learners, our framework also supported adding background information and examples before forming a prompt. The framework also supported filtering on a per-question basis, as there may also be some questions relevant only to specific kinds of contracts (say, only ERC-20 tokens), and others questions may have a specific scope (e.g., a contract, function, or file scope). Finally, each question could be optionally routed to a different model. The prompting framework also took great lengths to abide by OpenAI’s API limitations, including batching questions into one API invocation and keeping track of both the token count and API invocation rate limits. We hit these limits often and were very thankful the Codex model was free while in beta. Test data One of our development goals was that we would never compromise customer data by sending it to an OpenAI API endpoint. We had a strict policy of running Toucan only against open-source projects on GitHub (which would already have been indexed by Codex) with published reports, like those on our Publications page). We were also able to use the rather extensive test set that comes with Slither, and our “building secure contracts” reference materials as additional test data. It is important to note that some of these tests and reference materials may have been a part of the Codex training set, which explains why we saw very good results on smaller test cases. The missing tools The lack of tooling from both OpenAI and Microsoft has been extremely disappointing, although that looks to be changing: Microsoft has a prompting library, and OpenAI recently released OpenAI Evals. The kinds of tools we’d have loved to see include a prompt debugger; a tree-graph visualization of tokens in prompts and responses with logprobs of each token; tools for testing prompts against massive data sets to evaluate quality; ways to ask the same question and combine results from counterexamples; and some plugins to common unit testing frameworks. Surely someone is thinking of the developers and making these tools? Current programming languages lack the facilities for interfacing with neural architecture computers like LLMs or similar models. A core issue is the lack of capability to work with nondeterminism and uncertainty. When using LLMs, every answer has some built-in uncertainty: the outputs are inherently probabilistic, not discrete quantities. This uncertainty should be handled at the type system level so that one does not have to explicitly deal with probabilities until it is necessary. A pioneering project from Microsoft Research called Infer.NET does this for .NET-based languages, but there seem to be few concrete examples and no real tooling to combine this with LLMs. Prompt engineering, and surrounding tooling, are still in their infancy. The biggest problem is that you never know when you are done: even now, it is always possible that we were just one or two prompts away from making Toucan a success. But at some point, you have to give up in the face of costs and schedules. With this in mind, the $300K salary for a fantastic prompt engineer does not seem absurd: if the only difference between a successful LLM deployment and a failure is a few prompts, the job quickly pays for itself. Fundamentally, though, this reflects a lack of tooling to assess prompt quality and evaluate responses. There is no particularly good way to determine if one prompt is better than another or if you’re on the right track. Similarly, when a prompt fails against an input, it is frustratingly difficult to figure out why and to determine, programmatically, which prompts are merely returning the wrong result versus completely hallucinating and misbehaving. Unit tests are also problematic; the results are not guaranteed to be the same across runs, and newer models may not provide the same results as prior ones. There is certainly a solution here, but again, the tooling developers expect just wasn’t present. OpenAI Evals is likely going to improve this situation. Overall, the tooling ecosystem is lacking, and surprisingly, the biggest names in the field have not released anything substantial to improve the adoption and integration of LLMs into real software projects that people use. However, we are excited that the open source community is stepping up with really cool projects like LangChain and LlamaIndex. Humans still reign supreme OpenAI’s Codex is not yet ready to take over the job of software security auditors. It lacks the ability to reason about the proper concepts and produces too many false positives for practical usage in audit tasks. However, there is clearly a nascent capability to perform interesting analysis tasks, and underlying models should quickly get more capable. We are very excited to keep using the technology as it improves. For example, the new larger context window with GPT-4 may allow us to provide enough context and direction to handle complex tasks. Even though Codex (and GPT-4) do not currently match mature algorithmic-based tools, LLM-based tools—even those of lower quality—may have interesting uses. For languages for which no analysis tooling exists, developers can bootstrap something from LLMs relatively quickly. The ability to provide some reasonable analysis where none previously existed may be considerably better than nothing at all. We hope the ability to integrate language models into existing programs improves quickly, as there is currently a severe lack of languages, libraries, type systems, and other tooling for the integration of LLMs into traditional software. Disappointingly, the main organizations releasing LLMs have not released much tooling to enable their use. Thankfully, open-source projects are filling the gap. There is still enormous work to be done, and whoever can make a wonderful developer experience working with LLMs stands to capture developer mindshare. LLM capability is rapidly improving, and if it continues, the next generation of LLMs may serve as capable assistants to security auditors. Before developing Toucan, we used Codex to take an internal blockchain assessment occasionally used in hiring. It didn’t pass—but if it were a candidate, we’d ask it to take some time to develop its skills and return in a few months. It did return—we had GPT-4 take the same assessment—and it still didn’t pass, although it did better. Perhaps the large context window version with proper prompting could pass our assessment. We’re very eager to find out!
- The models are not able to reason well about certain higher-level concepts, such as ownership of contracts, re-entrancy, and fee distribution.
- The software ecosystem around integrating large language models with traditional software is too crude and everything is cumbersome; there are virtually no developer-oriented tools, libraries, and type systems that work with uncertainty.
- There is a lack of development and debugging tools for prompt creation. To develop the libraries, language features, and tooling that will integrate core LLM technologies with traditional software, far more resources will be required.
Figure 1: True positives (green) and false positives (red) found by Slither, Toucan, and both on some simple re-entrancy tests. The Toucan results are not encouraging. We tried looking at the true/false positive sets of Slither and Toucan, and found that each tool had a different set of false positives/false negatives, with some overlap (Figure 1). Codex was not able to effectively reduce the false positive rate from a prototype Slither integer overflow detector. Overall, we noticed a tendency to reply affirmatively to our questions, increasing the number of positives discovered by Toucan. Codex can perform basic static analysis tasks, but the rate of failure is too high to be useful and too difficult to characterize. This capability to perform successful analysis, even on short program fragments, is very impressive and should not be discounted! For languages that Codex understands but for which no suitable tooling exists, this capability could be extremely valuable—after all, some analysis could be much better than nothing. But the benchmark for Solidity is not nothing; we already have existing static analysis tooling that works very well. How we framed our framework During Toucan’s development, we created a custom prompting framework, a web-based front end, and rudimentary debugging and testing tools to evaluate prompts and to aid in unit and integration tests. The most important of these was the prompting framework. Prompting framework If we were making Toucan today, we’d probably just use LangChain. But at the time, LangChain did not have the features we needed. Frustratingly, neither OpenAI nor Microsoft offered an official, first-party prompting framework. This led us to develop a custom framework, with the goal that it should be possible for auditors to create new prompts without ever modifying Toucan’s code. requires = [“emit-ownership-doc”, “emit-target-contract”,]
name = “Contract Ownership”
scope = “contract”
instantiation_condition = “any(‘admin’ in s.name.lower() or ‘owner’ in s.name.lower() for s in contract.state_variables)” [[questions]]
name = “can-change”
query = “Is it possible to change the `{{ contract | owner_variable }}` variable by calling a function in the `{{ contract.name }}` contract without aborting the transaction? Think through it step by step, and answer as ‘Yes’, ‘No’, or ‘Unknown’. If ‘Yes’, please specify the function.”
is_decision = true [[questions]]
name = “who-can-call”
runtime_condition = “questions[‘can-change’].is_affirmative()”
query = “””To reason about ownership:
1) First, carefully consider the code of the function
2) Second, reason step by step about the question.
Who can call the function successfully, that is, without aborting or revering the transaction?”””
answer_start = “””1) First, carefully consider the code of the function:””” [[questions]]
name = “can-non-owner-call”
runtime_condition = “questions[‘can-change’].is_affirmative()”
query = “Can any sender who is not the current owner call the function without reverting or aborting?”
is_decision = true
finding_condition = “question.is_affirmative()” Figure 2: Sample question chain asking about contract ownership. Before questions are emitted, the prompting framework also emits a specific explanation of what ownership means, with examples and information about the target contract. Our framework supported chaining multiple questions together to support Chain of Thought and similar prompting techniques (Figure 2). Since GPT models like Codex are multi-shot learners, our framework also supported adding background information and examples before forming a prompt. The framework also supported filtering on a per-question basis, as there may also be some questions relevant only to specific kinds of contracts (say, only ERC-20 tokens), and others questions may have a specific scope (e.g., a contract, function, or file scope). Finally, each question could be optionally routed to a different model. The prompting framework also took great lengths to abide by OpenAI’s API limitations, including batching questions into one API invocation and keeping track of both the token count and API invocation rate limits. We hit these limits often and were very thankful the Codex model was free while in beta. Test data One of our development goals was that we would never compromise customer data by sending it to an OpenAI API endpoint. We had a strict policy of running Toucan only against open-source projects on GitHub (which would already have been indexed by Codex) with published reports, like those on our Publications page). We were also able to use the rather extensive test set that comes with Slither, and our “building secure contracts” reference materials as additional test data. It is important to note that some of these tests and reference materials may have been a part of the Codex training set, which explains why we saw very good results on smaller test cases. The missing tools The lack of tooling from both OpenAI and Microsoft has been extremely disappointing, although that looks to be changing: Microsoft has a prompting library, and OpenAI recently released OpenAI Evals. The kinds of tools we’d have loved to see include a prompt debugger; a tree-graph visualization of tokens in prompts and responses with logprobs of each token; tools for testing prompts against massive data sets to evaluate quality; ways to ask the same question and combine results from counterexamples; and some plugins to common unit testing frameworks. Surely someone is thinking of the developers and making these tools? Current programming languages lack the facilities for interfacing with neural architecture computers like LLMs or similar models. A core issue is the lack of capability to work with nondeterminism and uncertainty. When using LLMs, every answer has some built-in uncertainty: the outputs are inherently probabilistic, not discrete quantities. This uncertainty should be handled at the type system level so that one does not have to explicitly deal with probabilities until it is necessary. A pioneering project from Microsoft Research called Infer.NET does this for .NET-based languages, but there seem to be few concrete examples and no real tooling to combine this with LLMs. Prompt engineering, and surrounding tooling, are still in their infancy. The biggest problem is that you never know when you are done: even now, it is always possible that we were just one or two prompts away from making Toucan a success. But at some point, you have to give up in the face of costs and schedules. With this in mind, the $300K salary for a fantastic prompt engineer does not seem absurd: if the only difference between a successful LLM deployment and a failure is a few prompts, the job quickly pays for itself. Fundamentally, though, this reflects a lack of tooling to assess prompt quality and evaluate responses. There is no particularly good way to determine if one prompt is better than another or if you’re on the right track. Similarly, when a prompt fails against an input, it is frustratingly difficult to figure out why and to determine, programmatically, which prompts are merely returning the wrong result versus completely hallucinating and misbehaving. Unit tests are also problematic; the results are not guaranteed to be the same across runs, and newer models may not provide the same results as prior ones. There is certainly a solution here, but again, the tooling developers expect just wasn’t present. OpenAI Evals is likely going to improve this situation. Overall, the tooling ecosystem is lacking, and surprisingly, the biggest names in the field have not released anything substantial to improve the adoption and integration of LLMs into real software projects that people use. However, we are excited that the open source community is stepping up with really cool projects like LangChain and LlamaIndex. Humans still reign supreme OpenAI’s Codex is not yet ready to take over the job of software security auditors. It lacks the ability to reason about the proper concepts and produces too many false positives for practical usage in audit tasks. However, there is clearly a nascent capability to perform interesting analysis tasks, and underlying models should quickly get more capable. We are very excited to keep using the technology as it improves. For example, the new larger context window with GPT-4 may allow us to provide enough context and direction to handle complex tasks. Even though Codex (and GPT-4) do not currently match mature algorithmic-based tools, LLM-based tools—even those of lower quality—may have interesting uses. For languages for which no analysis tooling exists, developers can bootstrap something from LLMs relatively quickly. The ability to provide some reasonable analysis where none previously existed may be considerably better than nothing at all. We hope the ability to integrate language models into existing programs improves quickly, as there is currently a severe lack of languages, libraries, type systems, and other tooling for the integration of LLMs into traditional software. Disappointingly, the main organizations releasing LLMs have not released much tooling to enable their use. Thankfully, open-source projects are filling the gap. There is still enormous work to be done, and whoever can make a wonderful developer experience working with LLMs stands to capture developer mindshare. LLM capability is rapidly improving, and if it continues, the next generation of LLMs may serve as capable assistants to security auditors. Before developing Toucan, we used Codex to take an internal blockchain assessment occasionally used in hiring. It didn’t pass—but if it were a candidate, we’d ask it to take some time to develop its skills and return in a few months. It did return—we had GPT-4 take the same assessment—and it still didn’t pass, although it did better. Perhaps the large context window version with proper prompting could pass our assessment. We’re very eager to find out!
Categories: Security Posts