October 19th 2023

What is Semgrep?

Semgrep is an open-source static analysis tool designed for identifying and preventing software vulnerabilities and Code quality issues in source code. It is particularly well-suited for use in the field of application security.

Mind-Bending Fun: Best Puzzle Games f…
LEI Register in the UK: Your Passport…
Understanding the Software and Apps o…
Les Secrets du SuccÃ¨s en Programmati…
The Most Popular and Best WordPress T…

The name "Semgrep" is a fusion of "semantic" and "grep," signifying that Semgrep is a command-line utility for text searching that possesses an understanding of source code semantics.

Semgrep uses a pattern-oriented matching methodology to look for particular code patterns or criteria within the source code. These patterns are expressed using a unique syntax that mirrors the code you seek, facilitating custom rule creation.

The tool has an active community that contributes to its development and maintains a repository of community rules. They rules cover a wide range of programming languages and security concerns, making it easier for users to get started with meaningful checks. In their registry on the website, you can find the sets of these rules that Semgrep developers have created. For example, the rule set matches the patterns of OWASP TOP 10 vulnerabilities.

Create your own custom rules, which you can contribute to the Semgrep registry and share with others in the future. So, let's have a closer look at the main Semgrep rules.

What are the basics of Semgrep rules?

There is the example of the Semgrep rule that looks for the appearance of usage multiplication operation by asterisks in JavaScript code. Imagine we have the arrangement to use the Math namespace for all such operations.

// Prompt the user for a number
var userInput = prompt("Enter a number:");


// Convert the user input to a number
const number = parseFloat(userInput);


if (!isNaN(number)) {
  // Calculate the square
  var square = number * number;


  // Display the result
  console.log(`The square of ${number} is ${square}`);
} else {
  console.log("Invalid input. Please enter a valid number.");
}

Here is a Semgrep rule that looks for the patterns where there is the usage of asterisks instead of Math namespace:

rules:
  - id: multiplication_rule
    pattern: $VAR1 * $VAR2;
    message: Use  instead of asterisks.
    languages:
      - javascript
    severity: INFO

Let's get to the bottom of it:

The Semgrep rules are written using the YAML syntax.
All required fields must be present at the top level of a rule, immediately under the “rules” key.
The required fields are:

The “id” key - descriptive identifier of the rule.
Rule pattern: in our case, we use “pattern” key, but there are other ones that you can use.
The “message” key - the text that includes why Semgrep matched this pattern and how to remediate it.
The “languages” key lists acceptable programming languages for this rule.
The “severity” key specifies how critical the issues that a rule potentially detect are. One of the following values: INFO (Low severity), WARNING (Medium severity), or ERROR (High severity).

I will run this rule using a built-in playground on the Semgrep website. Semgrep Playground is an interactive code editor designed for crafting and experimenting with rule patterns applied to example code. You can use it for fast testing the rules:

Fig. 1. Running the Semgrep rule in its playground

As we can see, after processing the code with the provided rule, Semgrep highlights the found piece of code where the pattern matched. Next, we must understand why it highlighted this line of code. For that, we need to analyze the rule pattern.

We are looking for the following code: $VAR1 * $VAR2; 2 metavariables have the multiplication operator between them. The name for a metavariable you can choose by yourself, the important thing is that it has to be uppercased.

Semgrep rules: variables and operators

Now, let’s dive into the Semgrep rules' syntax and discover the most common syntax units.

Exact pattern matching. It searches for a given exact pattern in a source code. For example, when you want to find the line of code that has console.log(“Hello, Semgrep!”), it will look over all the lines for the exact matching.
Metavariables. They serve to identify and capture code elements when their specific values or contents are not known in advance, much like capture groups in regular expressions. This encompasses various code elements, such as variables, functions, arguments, classes, object methods, imports, exceptions, etc.
Ellipsis operator. This operator (...) provides an abstraction representing a sequence of zero or more items, like arguments, statements, parameters, fields, and characters. The most common usage of it is representing the function’s (method’s) arguments within its definition/call. It means, for example, when you have the function sample_function (arg1, arg2, arg3), you can use the ellipsis operator to match all the arguments of this function: sample_function(...).

Another common usage of the operator is implementing it to match everything in the current code scope. It is especially useful when you want to grab the code within the scope of a pattern. For example, if to modify the previous example with a multiplication rule like this, it will also match everything after the pattern matching inside the if statement:

Fig. 2. The use case of the ellipsis operator

The next thing we have to figure out is rule pattern fields. Before we were using pattern operators, however, there were other ones that we could use in the rules; let’s enumerate them:

pattern: it looks for code matching its expression.
pattern-not: it serves as the inverse of the pattern operator. It identifies code that does not conform to its expression, which is valuable for filtering out typical false positives.
patterns: this operator executes a logical AND operation on a set of child patterns, making it valuable for linking multiple patterns.
pattern-either: this operator carries out a logical OR operation on a collection of child patterns, offering a convenient way to connect multiple patterns in a manner where any one of them can be true.
pattern-regex: locates substrings that match the provided PCRE pattern.
pattern-not-regex: the inverse of the pattern-regex. It employs a PCRE regular expression to refine and filter the results selectively.
pattern-inside: retains matched discoveries that are located within its defined expression. This is particularly handy for identifying code segments within other code constructs, such as functions.
pattern-not-inside: it is the opposite of pattern-inside and retains matched findings that are not found within its specified expression.

There are some advanced operators:

metavariable-regex: analyzes metavariables for a PCRE regular expression, which proves valuable for refining results according to the values stored in metavariables. It requires the metavariable and regex keys.

Want to see how it works? Here is a code that creates a new object of a Person class and some method calls.

Fig. 3. The example of usage of metavariable-regex

In this rule, we defined a pattern that looks for an object person that calls its method. We were using a metavariable for the method, which means it will match any method call. After that, a metavariable-regex pattern narrows down the methods that can be called by the person object. For the metavariable key, we specify the name of a metavariable in our pattern, for which we will provide a regular expression. After that, there is a regex key that defines regular expressions that the metavariable has to match. In our case, the rule will find the object person that calls the method with the name that starts with print.

We can also address the metavariables inside the message block; then, it will be printed inside the message (“Matches” section on a previous figure). In short, use it to write meaningful instructions or debug the metavariable values to see what they match.

metavariable-pattern: The operator examines metavariables using a pattern formula, serving as a valuable tool for refining results based on the value of a metavariable. To utilize it, you must provide the metavariable key and exactly one of the following keys: pattern, patterns, pattern-either, or pattern-regex. In this example, we will match the event listener with the axios get request instead of it.

Fig. 4. The example of usage of metavariable-pattern

We narrow the findings in this pattern by filtering the $FOO metavariable values by the pattern axios.get(...).

How to Detect NoSQL Injection Issues using Semgrep

By now, you'll have realized the basic Semgrep techniques so that we can write something more advanced and practical. Let me give you some examples: to write a Semgrep rule to look for potential places for NoSQL injections. Suppose that we implemented a function that takes the data that came somewhere from the user input and processes it by removing dangerous symbols that can cause a NoSQL injection.

First of all, let’s install Semgrep locally on our machine. I use GNU/Linux distribution to utilize python3: python3 -m pip install semgrep.

To check that we successfully installed it, let’s check the version:

Fig. 5. Checking the version of Semgrep

Overview of the vulnerable application

In the GitHub report, I wrote the JavaScript code to simulate processing user data and querying MongoDB in a web application.

Let’s shortly analyze the source code. We use Express and Mongoose within the application. There is a middleware defined out there with the name sanitizeUserInput. It will simulate the code that takes the user input and processes it by removing potentially dangerous symbols for NoSQL syntax. I’ve specially added it for some routes and some not.

Take a look: there are the routes of the application, some of them explicitly added to check that matching patterns work as expected:

POST /users: creates a new user and saves it to the database. The user data comes from req.body.
GET /users: obtains all users from the database.
GET /users/by-age: gets the users by age. The user data comes from req.query.
GET /users/:userId: gets a specific user by ID. The data comes from req.params.
PUT /users/:userId: updates a specific user by ID. The data arrives from req.params and req.body.
DELETE /users/:userId: deletes a specific user by ID. The data comes from req.params.
GET /server-info: sends information about the server.
POST /echo: echos back the user's input data. The user data comes from req.body.

Create Semgrep rule based on code review results

Based on the application architecture, the user data can come from req.body, req.params and req.query. So, we have to write a rule matching the route handlers that process the provided user’s data by querying the database. Let’s see how exactly this works:

For instance, give the ID for our rule: nosql_injection_prevention.
The language that will be used is JavaScript.
For the severity, we can specify MEDIUM one.
For the message, we can write: “Use sanitizeUserInput() middleware for sanitizing user input!”. We have defined all the required fields except the matching pattern for now. Here is what it looks like:

rules:
  - id: nosql_injection_prevention
    # here will be the pattern
    message: Use sanitizeUserInput() middleware for sanitizing user input!
    languages:
      - javascript
    severity: WARNING

We have to match the route handlers because the database querying happens inside of them and not somewhere else. For that, we will use pattern-inside, but because there will be other patterns that will narrow down the matching, we will use the patterns operator at the top:

rules:
  - id: nosql_injection_prevention
    patterns:
      - pattern-inside: app.$REQUEST('$PATH', (req, res) => {...})
    message: Use sanitizeUserInput() middleware for sanitizing user input!
    languages:
      - javascript
    severity: WARNING

At this stage, we are looking for route handlers without middleware calls before processing request data. If we run a rule, we will see that it works. Still, it also matches the route handlers that don’t have interaction with the database. You may be wondering: “And what? We can match all the routes”, but the data that comes from the user in some cases (when there is no interaction with the database can contain harmful symbols). Therefore, we have to narrow the matches down by using other patterns.

The next step is to write the pattern that will match the call of the Mongoose method:

rules:
  - id: nosql_injection_prevention
    patterns:
      - pattern-inside: app.$REQUEST('$PATH', (req, res) => {...})
      - pattern: $MODEL.$METHOD(...)
    message: Use sanitizeUserInput() middleware for sanitizing user input!
    languages:
      - javascript
    severity: WARNING

However, this one will match any method call inside the route handler. So, we have again to narrow the pattern down.

To clarify the matches, we can use a metavariable-regex pattern by specifying the names of mongoose methods:

rules:
  - id: nosql_injection_prevention
    patterns:
      - pattern-inside: app.$REQUEST('$PATH', (req, res) => {...})
      - pattern: $MODEL.$METHOD(...)
      - metavariable-regex:
          metavariable: $METHOD
          regex: save|find|findById|findByIdAndUpdate|findByIdAndRemove
    message: Use sanitizeUserInput() middleware for sanitizing user input!
    languages:
      - javascript
    severity: WARNING

For this example, I specified the ones used in our example code, but you can enumerate all the methods you need.

Now, the matches are more accurate, but what if there is no user input inside the database querying code, there can be constant values. We have to take it into account also. We must match the code with user input to meet this need. As we mentioned before, in our case, the user input can come from req.body, req.params and req.query. They can be put directly into the querying method or be somehow preprocessed before it, which we must also consider. According to that, there is a modified rule:

rules:
  - id: nosql_injection_prevention
    patterns:
      - pattern-inside: |
          app.$REQUEST('$PATH', (req, res) => {
              ...
              req.$DATA
              ...
          })
      - metavariable-regex:
          metavariable: $DATA
          regex: body|params|query
      - pattern: $MODEL.$METHOD(...)
      - metavariable-regex:
          metavariable: $METHOD
          regex: save|find|findById|findByIdAndUpdate|findByIdAndRemove
    message: |
      Use sanitizeUserInput() middleware for sanitizing user input in $PATH route!
    languages:
      - javascript
    severity: WARNING

I’ve changed the pattern-inside section. Now Semgrep looks for the main pattern that we defined in the last two steps, somewhere inside the application route handler, that has inside the mentioning of req.body or req.params or req.query. I’ve also modified the message field so that no sanitized route will be displayed there.

Reviewing rule findings

At last, we can run the Semgrep rule and check how it works:

Fig. 6. Running the final Semgrep rule.

Here is the point now we can analyze the results:

POST /users: has been captured by Semgrep because there is no sanitizeUserInput middleware, and there is the processing of req.body.
GET /users: there is no sanitizeUserInput middleware; there is calling the database, but there is no data from user input.
GET /users/by-age: has been captured by Semgrep because there is no sanitizeUserInput middleware and there is querying the database with the user input data. Additionally, parseInt() function as extra protection will not allow bad input to reach the query.
GET /users/:userId: there is no sanitizeUserInput middleware, and there is querying the database with the user input data; that’s why Semgrep has captured it.
PUT /users/:userId: there is sanitizeUserInput middleware and querying the database with user data that it sanitized.
DELETE /users/:userId: has been captured by Semgrep because there is no sanitizeUserInput middleware, and there is querying the database with the user input data.
GET /server-info: there is no sanitizeUserInput middleware; however, there is no database querying.
POST /echo: there is no sanitizeUserInput middleware; there is processing user data, but no querying the database.

Summing up

So what's the point? Imagine your application consists of hundreds of consistently changing routes. How much time would you spend checking all of them? It is only about one rule, but there can be lots of them that control the quality of your code and its security. That’s why you now see how Semgrep can manage with all that.

This post first appeared on TechMagic, please read the originial post: here