I made my PyPI repository with authorization and S3. On Nginx

In this article, I want to share my experience with NJS, a JavaScript interpreter for Nginx developed by Nginx Inc, describing its main features using a real example. NJS is a subset of JavaScript that allows you to extend the functionality of Nginx. Question why your interpreter??? Dmitry Volyntsev answered in detail. In short: NJS is the nginx-way, and JavaScript is more progressive, "native" and without GC, unlike Lua.

A long time ago…

At my last job, I inherited gitlab with a number of motley CI / CD pipelines with docker-compose, dind and other delights that were transferred to the kaniko rails. The images that were previously used in CI have moved in their original form. They worked properly until the day our gitlab changed its IP and CI turned into a pumpkin. The problem was that one of the docker images that participated in CI contained git, which pulled Python modules via ssh. ssh needs a private key and ... it was in the image along with known_hosts. And any CI ended with a key verification error due to a mismatch between the real IP and that specified in known_hosts. A new image was quickly assembled from the existing Dockfiles and the option was added StrictHostKeyChecking no. But the unpleasant aftertaste remained and there was a desire to transfer the files to a private PyPI repository. An additional bonus, after switching to private PyPI, was a simpler pipeline and a normal description of requirements.txt

The choice is made, gentlemen!

We spin everything in the clouds and Kubernetes, and in the end we wanted to get a small service that was a stateless container with external storage. Well, since we use S3, then the priority was behind it. And, if possible, with authentication in gitlab (you can add it yourself if necessary).

A quick search yielded several results for s3pypi, pypicloud, and a "manual" html file creation option for turnips. The last option disappeared by itself.

s3pypi: This is the cli for using S3 hosting. Upload files, generate html and upload them to the same bucket. Suitable for home use.

pypicloud: Seemed like an interesting project, but after reading the docs, it got frustrating. Despite the good documentation and the possibility of expanding to suit your tasks, in fact it turned out to be redundant and difficult to configure. Correcting the code for your tasks, according to the then estimates, would take 3-5 days. The service also needs a database. We left it in case we didn't find anything else.

A more in-depth search yielded a module for Nginx, ngx_aws_auth. The result of his testing was XML displayed in the browser, which showed the contents of the S3 bucket. The last commit, at the time of the search, was a year ago. The repository looked abandoned.

Turning to the original source and reading PEP-503 realized that XML can be converted to HTML on the fly and given to pip. Googling a little more according to Nginx and S3, I came across an example of authentication in S3 written in JS for Nginx. That's how I got to know NJS.

Taking this example as a basis, an hour later I observed the same XML in my browser as when using the ngx_aws_auth module, but everything was already written in JS.

I really liked the nginx solution. Firstly, good documentation and many examples, secondly, we get all the Nginx goodies for working with files (out of the box), thirdly, anyone who knows how to write Nginx configs will be able to figure out what's what. Minimalism is also a plus for me, compared to Python or Go (if written from scratch), not to mention nexus.

TL;DR After 2 days, the test version of PyPi was already used in CI.

How does it work?

Module is loaded in Nginx ngx_http_js_module, is included in the official docker image. We import our script using the directive js_importto the Nginx config. The function call is carried out by the directive js_content. The directive is used to set variables js_set, which takes only the function described in the script as an argument. But we can only execute subrequests in NJS using Nginx, not any XMLHttpRequest for you. To do this, the appropriate location must be added to the Nginx configuration. And in the script, a subrequest (subrequest) to this location should be described. To be able to access the function from the Nginx config, the function name must be exported in the script itself export default.

nginx.conf

load_module modules/ngx_http_js_module.so;
http {
  js_import   imported_name  from script.js;

server {
  listen 8080;
  ...
  location = /sub-query {
    internal;

    proxy_pass http://upstream;
  }

  location / {
    js_content imported_name.request;
  }
}

script.js

function request(r) {
  function call_back(resp) {
    // handler's code
    r.return(resp.status, resp.responseBody);
  }

  r.subrequest('/sub-query', { method: r.method }, call_back);
}

export default {request}

When requested in a browser http://localhost:8080/ we fall into location /in which the directive js_content calls a function request described in our script script.js. In turn, in the function request a subquery is made to location = /sub-query, with the method (in the current GET example) obtained from the argument (r), implicitly passed when calling this function. The processing of the subrequest response will be carried out in the function call_back.

Trying S3

To make a request to a private S3 storage, we need:

ACCESS_KEY

SECRET_KEY

S3_BUCKET

From the http method used, the current date/time, S3_NAME and URI, a certain kind of string is generated, which is signed (HMAC_SHA1) using SECRET_KEY. Next line, like AWS $ACCESS_KEY:$HASH, can be used in the Authorization header. The same date/time that was used to generate the string in the previous step should be added to the header X-amz-date. In code it looks like this:

nginx.conf

load_module modules/ngx_http_js_module.so;
http {
  js_import   s3      from     s3.js;

  js_set      $s3_datetime     s3.date_now;
  js_set      $s3_auth         s3.s3_sign;

server {
  listen 8080;
  ...
  location ~* /s3-query/(?<s3_path>.*) {
    internal;

    proxy_set_header    X-amz-date     $s3_datetime;
    proxy_set_header    Authorization  $s3_auth;

    proxy_pass          $s3_endpoint/$s3_path;
  }

  location ~ "^/(?<prefix>[w-]*)[/]?(?<postfix>[w-.]*)$" {
    js_content s3.request;
  }
}

s3.js(AWS Sign v2 authorization example, moved to deprecated status)

var crypt = require('crypto');

var s3_bucket = process.env.S3_BUCKET;
var s3_access_key = process.env.S3_ACCESS_KEY;
var s3_secret_key = process.env.S3_SECRET_KEY;
var _datetime = new Date().toISOString().replace(/[:-]|.d{3}/g, '');

function date_now() {
  return _datetime
}

function s3_sign(r) {
  var s2s = r.method + 'nnnn';

  s2s += `x-amz-date:${date_now()}n`;
  s2s += '/' + s3_bucket;
  s2s += r.uri.endsWith('/') ? '/' : r.variables.s3_path;

  return `AWS ${s3_access_key}:${crypt.createHmac('sha1', s3_secret_key).update(s2s).digest('base64')}`;
}

function request(r) {
  var v = r.variables;

  function call_back(resp) {
    r.return(resp.status, resp.responseBody);
  }

  var _subrequest_uri = r.uri;
  if (r.uri === '/') {
    // root
    _subrequest_uri = '/?delimiter=/';

  } else if (v.prefix !== '' && v.postfix === '') {
    // directory
    var slash = v.prefix.endsWith('/') ? '' : '/';
    _subrequest_uri = '/?prefix=' + v.prefix + slash;
  }

  r.subrequest(`/s3-query${_subrequest_uri}`, { method: r.method }, call_back);
}

export default {request, s3_sign, date_now}

A little explanation about _subrequest_uri: this is a variable that, depending on the initial uri, generates a request to S3. If you need to get the contents of the "root", in this case, you need to form a uri-request with a separator delimiter, which will return a list of all CommonPrefixes xml elements, corresponding to directories (in the case of PyPI, a list of all packages). If you need to get a list of contents in a specific directory (a list of all package versions), then the uri request must contain a prefix field with the name of the directory (package) must end with a slash /. Otherwise, collisions are possible when requesting the contents of a directory, for example. There are aiohttp-request and aiohttp-requests directories, and if the request specifies /?prefix=aiohttp-request, then the response will contain the contents of both directories. If there is a slash at the end, /?prefix=aiohttp-request/, then only the desired directory will be returned. And if we request a file, then the resulting uri should not differ from the original one.

Save, restart Nginx. In the browser, we enter the address of our Nginx, the result of the request will be XML, for example:

Directory List

<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>myback-space</Name>
  <Prefix></Prefix>
  <Marker></Marker>
  <MaxKeys>10000</MaxKeys>
  <Delimiter>/</Delimiter>
  <IsTruncated>false</IsTruncated>
  <CommonPrefixes>
    <Prefix>new/</Prefix>
  </CommonPrefixes>
  <CommonPrefixes>
    <Prefix>old/</Prefix>
  </CommonPrefixes>
</ListBucketResult>

From the list of directories, only elements will be needed CommonPrefixes.

By adding, in the browser, to our address the directory we need, we will get its contents also in the form of XML:

List files in a directory

<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name> myback-space</Name>
  <Prefix>old/</Prefix>
  <Marker></Marker>
  <MaxKeys>10000</MaxKeys>
  <Delimiter></Delimiter>
  <IsTruncated>false</IsTruncated>
  <Contents>
    <Key>old/giphy.mp4</Key>
    <LastModified>2020-08-21T20:27:46.000Z</LastModified>
    <ETag>&#34;00000000000000000000000000000000-1&#34;</ETag>
    <Size>1350084</Size>
    <Owner>
      <ID>02d6176db174dc93cb1b899f7c6078f08654445fe8cf1b6ce98d8855f66bdbf4</ID>
      <DisplayName></DisplayName>
    </Owner>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
  <Contents>
    <Key>old/hsd-k8s.jpg</Key>
    <LastModified>2020-08-31T16:40:01.000Z</LastModified>
    <ETag>&#34;b2d76df4aeb4493c5456366748218093&#34;</ETag>
    <Size>93183</Size>
    <Owner>
      <ID>02d6176db174dc93cb1b899f7c6078f08654445fe8cf1b6ce98d8855f66bdbf4</ID>
      <DisplayName></DisplayName>
    </Owner>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
</ListBucketResult>

Take only elements from the list of files Key.

It remains to parse the resulting XML and return it as HTML, after replacing the Content-Type header with text/html.

function request(r) {
  var v = r.variables;

  function call_back(resp) {
    var body = resp.responseBody;

    if (r.method !== 'PUT' && resp.status < 400 && v.postfix === '') {
      r.headersOut['Content-Type'] = "text/html; charset=utf-8";
      body = toHTML(body);
    }

    r.return(resp.status, body);
  }
  
  var _subrequest_uri = r.uri;
  ...
}

function toHTML(xml_str) {
  var keysMap = {
    'CommonPrefixes': 'Prefix',
    'Contents': 'Key',
  };

  var pattern = `<k>(?<v>.*?)</k>`;
  var out = [];

  for(var group_key in keysMap) {
    var reS;
    var reGroup = new RegExp(pattern.replace(/k/g, group_key), 'g');

    while(reS = reGroup.exec(xml_str)) {
      var data = new RegExp(pattern.replace(/k/g, keysMap[group_key]), 'g');
      var reValue = data.exec(reS);
      var a_text = '';

      if (group_key === 'CommonPrefixes') {
        a_text = reValue.groups.v.replace(///g, '');
      } else {
        a_text = reValue.groups.v.split('/').slice(-1);
      }

      out.push(`<a href="/en/${reValue.groups.v}">${a_text}</a>`);
    }
  }

  return '<html><body>n' + out.join('</br>n') + 'n</html></body>'
}

Trying PyPI

We check that nowhere and nothing breaks on obviously working packages.

# Π‘ΠΎΠ·Π΄Π°Π΅ΠΌ для тСстов Π½ΠΎΠ²ΠΎΠ΅ ΠΎΠΊΡ€ΡƒΠΆΠ΅Π½ΠΈΠ΅
python3 -m venv venv
. ./venv/bin/activate

# Π‘ΠΊΠ°Ρ‡ΠΈΠ²Π°Π΅ΠΌ Ρ€Π°Π±ΠΎΡ‡ΠΈΠ΅ ΠΏΠ°ΠΊΠ΅Ρ‚Ρ‹.
pip download aiohttp

# Π—Π°Π³Ρ€ΡƒΠΆΠ°Π΅ΠΌ Π² ΠΏΡ€ΠΈΠ²Π°Ρ‚Π½ΡƒΡŽ Ρ€Π΅ΠΏΡƒ
for wheel in *.whl; do curl -T $wheel http://localhost:8080/${wheel%%-*}/$wheel; done

rm -f *.whl

# УстанавливаСм ΠΈΠ· ΠΏΡ€ΠΈΠ²Π°Ρ‚Π½ΠΎΠΉ Ρ€Π΅ΠΏΡ‹
pip install aiohttp -i http://localhost:8080

We repeat with our libs.

# Π‘ΠΎΠ·Π΄Π°Π΅ΠΌ для тСстов Π½ΠΎΠ²ΠΎΠ΅ ΠΎΠΊΡ€ΡƒΠΆΠ΅Π½ΠΈΠ΅
python3 -m venv venv
. ./venv/bin/activate

pip install setuptools wheel
python setup.py bdist_wheel
for wheel in dist/*.whl; do curl -T $wheel http://localhost:8080/${wheel%%-*}/$wheel; done

pip install our_pkg --extra-index-url http://localhost:8080

In CI, creating and loading a package looks like this:

pip install setuptools wheel
python setup.py bdist_wheel

curl -sSfT dist/*.whl -u "gitlab-ci-token:${CI_JOB_TOKEN}" "https://pypi.our-domain.com/${CI_PROJECT_NAME}"

Authentication

In Gitlab it is possible to use JWT to authenticate/authorize external services. Using the auth_request directive in Nginx, we will redirect the authentication data to a subrequest containing a function call in the script. In the script, one more subrequest will be made to the Gitlab url, and if the authentication data was specified correctly, then Gitlab will return the code 200 and the download / download of the package will be allowed. Why not use a single subrequest and push the data to Gitlab right away? Because then you will have to edit the Nginx configuration file every time we have some changes in authorization, and this is a rather dreary task. Also, if the read-only root filesystem policy is used in Kubernetes, then this adds even more complexity when replacing nginx.conf through configmap. And it becomes absolutely impossible to configure Nginx through configmap while using policies that prohibit the connection of volumes (pvc) and read-only root filesystem (this also happens).

Using the NJS intermediate, we get the ability to change the specified parameters in the nginx config using environment variables and do some checks in the script (for example, an incorrectly specified URL).

nginx.conf

location = /auth-provider {
  internal;

  proxy_pass $auth_url;
}

location = /auth {
  internal;

  proxy_set_header Content-Length "";
  proxy_pass_request_body off;
  js_content auth.auth;
}

location ~ "^/(?<prefix>[w-]*)[/]?(?<postfix>[w-.]*)$" {
  auth_request /auth;

  js_content s3.request;
}

s3.js

var env = process.env;
var env_bool = new RegExp(/[Tt]rue|[Yy]es|[Oo]n|[TtYy]|1/);
var auth_disabled  = env_bool.test(env.DISABLE_AUTH);
var gitlab_url = env.AUTH_URL;

function url() {
  return `${gitlab_url}/jwt/auth?service=container_registry`
}

function auth(r) {
  if (auth_disabled) {
    r.return(202, '{"auth": "disabled"}');
    return null
  }

  r.subrequest('/auth-provider',
                {method: 'GET', body: ''},
                function(res) {
                  r.return(res.status, "");
                });
}

export default {auth, url}

Most likely, the question is brewing: - Why not use ready-made modules? Everything has already been done there! For example, var AWS = require('aws-sdk') and don't write a bike with S3 authentication!

Let's move on to the cons

For me, the inability to import external JS modules has become a nasty but expected feature. Described in the example above require('crypto'), this build-in-modules and require only works for them. There is also no way to reuse the code from scripts and you have to copy and paste it into different files. I hope that someday this functionality will be implemented.

Also, for the current project in Nginx, compression must be disabled. gzip off;

Because there is no gzip module in NJS and it is impossible to connect it, accordingly there is no way to work with compressed data. True, this is not particularly a minus for this case. There is not much text, and the transferred files are already compressed and additional compression will not help them much. It is also not so busy or critical service to bother with the return of content a few milliseconds faster.

Debugging the script is long and is only possible through "prints" in the error.log. Depending on the set logging level info, warn or error, it is possible to use 3 methods r.log, r.warn, r.error respectively. I try to debug some scripts in Chrome (v8) or the njs console tool, but not everything can be checked there. When debugging code, aka functional testing, history looks something like this:

docker-compose restart nginx
curl localhost:8080/
docker-compose logs --tail 10 nginx

and there may be hundreds of such sequences.

Writing code using subqueries and variables for them turns into a tangled tangle. Sometimes you start rushing around different IDE windows trying to figure out the sequence of actions in your code. It's not difficult, but sometimes very annoying.

There is no full support for ES6.

There may be some other shortcomings, but I have not come across anything else. Share the info if you have a negative experience with NJS.

Conclusion

NJS is a lightweight open-source interpreter that allows you to implement various JavaScript scripts in Nginx. During its development, great attention was paid to performance. Of course, a lot of things are still missing in it, but the project is being developed by a small team and they are actively adding new features and fixing bugs. I hope that someday NJS will allow you to connect external modules, which will make the functionality of Nginx almost unlimited. But there is NGINX Plus and most likely there will be no features!

Repository with the full code for the article

njs-pypi with AWS Sign v4 support

Description of ngx_http_js_module module directives

Official NJS Repository ΠΈ documentation

Examples of using NJS from Dmitry Volintsev

njs - native JavaScript scripting in nginx / Speech by Dmitry Volniev at Saint HighLoad++ 2019

NJS in production / Presentation by Vasily Soshnikov at HighLoad++ 2019

Signing and Authenticating REST Requests in AWS

Source: habr.com