Why libraries need to get with apps and APIs

As an open education supporter, I’ve been following the progress of FOLIO, a community effort built around the idea that “the future of libraries is open” (a phrase that also forms the group’s acronym, FOLIO). FOLIO is a partnership between libraries and vendors that is developing an open source library services platform (LSP). FOLIO’s platform is built on the idea that library management software should be flexible, modular, extensible, modern, and affordable, and it’s gathered a number of partners and contributors to help make that vision a reality.

These partners have grown to include leading academic research libraries including the Open Library Environment (OLE) which is part of the Open Library Foundation, which also hosts the FOLIO project; library and research services provider EBSCO Information Services (EBSCO); library software and services providers including Index Data, Stacks, Qulto and @CULT; and library IT solutions and support provider ByWater Solutions.

To learn more about the initiative, with the help of FOLIO’s Kathleen McEvoy, I asked the project’s collaborators to share their insights on its purpose and future. Their answers are edited for length and clarity. 

Why was FOLIO created?

Kathleen McEvoy, VP of Communications, EBSCO, FOLIO Spokesperson: We believe we can save libraries the money they now invest in the lengthy process of implementing an LSP—which can take years to fully implement. With FOLIO, we believe we are shifting the paradigm for libraries to manage content, regardless of the historical nature of print or the newer requirements of e-content, while creating a system that provides librarians with choice: apps and services created to prevent libraries from being locked into a monolithic system they had no involvement in creating.

If librarians can help create a platform that does what they need it to do, rather than spend time creating workarounds for current monolithic systems, we can create a true workflow that operates the library and helps the library expand its footprint within the academic institution.

Peter Murray, Open Source Community Advocate, Index Data: Open source in libraries has a long history. There is a natural alignment between the inclination of libraries to share books, journals, and knowledge that mirrors what happens in open source communities. FOLIO came together as a deliberate strategic investment by libraries, library organizations, and service providers. It seeks to leverage the dynamics of an open source community to drive innovation in library services and rejuvenate the relationship between libraries and service providers. We expect the result will be new business models for service providers, new ideas coming to the market faster and at lower risk, and more vibrant discussions between users of library software and creators of library software.

What is FOLIO’s technology model?

Gar Sydnor, Senior VP, Resource Management and FOLIO Services, EBSCO: The FOLIO architecture is based on microservices, which is an easy and innovative method to quickly add new capabilities. Typical monolithic systems may be replaced every seven to 10 years. The process takes a long time and is very costly. With microservice architecture, there is no need to replace an entire system, rather apps and modules can be continually renewed and replaced.

Sebastian Hammer, Founder, Index Data: FOLIO is inspired by cloud technologies and by open-ended platforms such as smartphones and operating systems. It is essentially a web-based system for administering complex workflows in and between organizations, in which the individual work tasks are supported by “apps” on the platform.

The FOLIO platform provides shared facilities such as notifications, permissions, and task management, but the organization of the workflows is entirely up to the library, and the possible tasks that FOLIO can support depends on the apps that are installed. In this way, FOLIO can be extremely adaptable to the needs of individual libraries, and it can support libraries that seek to extend the services they provide beyond their traditional role.

What are the advantages of FOLIO’s app-based platform?

Murray: Similar to how users have the freedom to choose from a variety of smartphone apps, libraries can choose which FOLIO apps to use. A library could continue to use its existing automation system (such as Koha or Evergreen) and make use of apps, such as electronic resource management. As long as the FOLIO RESTful API contracts are met, higher-level modules manipulating data from the proxy modules can successfully interact with legacy systems.

Hammer: The “app” model makes it considerably easier for individual developers or organizations to contribute to FOLIO since they only need to worry about the functionality of their own software, not the entire system. This model not only makes it possible for the FOLIO project to progress very rapidly, through the collaboration of many independent, loosely organized teams; it also inspires us to think about ways the FOLIO platform can be applied in entirely new areas. For instance, FOLIO team members are building apps to support research data management, and some medical libraries have even contemplated applying the platform to support certain clinical functions.

FOLIO supports “linked open data.” What is that and what does it provide?

Eric Frierson, Senior Director of Field Engineering, North America, EBSCO: Linked data describes a method of publishing structured data so that it can be interlinked and become more useful. A linked open data repository allows us to build tools that can show users connections across traditional boundaries.

This is easier to demonstrate in an example: Imagine a results list from a library database—a list of articles. A researcher clicks on an article by a particular author. If the database creator also includes the ORCID (Open Researcher and Contributor ID), which is an open linked data system, the database user interface can send a call and pull back all that author’s other writings. The database could then reliably show a list of “other works by this author.” Further, those works might have other linked data elements, co-authors, etc. This information helps to show that this author publishes frequently on particular topics and collaborates with certain professors.

How is FOLIO different from existing open source library solutions, such as Koha or Evergreen?

Michael Winkler, Managing Director, Open Library Environment (OLE): FOLIO is a somewhat unique collaboration of commercial and non-commercial partners, working together with a vision for creating a shared infrastructure for sustained innovation in library services. Community-based systems like Koha and Evergreen are focused on providing open source versions of traditional library management systems and related integrations. FOLIO seeks to build an innovation platform that takes a more comprehensive view of library management, not just bibliographic control and license management.

Brendan Gallagher, CEO and Founder, ByWater Solutions: Any open source project (Koha or Evergreen) would not have to move to FOLIO to be part of the FOLIO project. Ideally, each project would invest in integrating the open source software they already love. Each project should concentrate on REST APIs that would work with FOLIO. Communication between the projects and software will only make a stronger platform. One of the core ideas of FOLIO is a RESTful messaging layer that will allow you to integrate many different applications.

Nathan Curulla, Owner/CRO, ByWater Solutions: FOLIO will be very much like Koha in many ways. The main difference is functionality as it pertains to electronic resources and the backend architecture that lends itself to a more “app-based” platform. [ByWater will also] create integration between FOLIO and Koha. Koha is the most used ILS on the planet and is supported by one of the largest and most vibrant communities in the library world. Creating a bridge between these two systems will help expand both the FOLIO and Koha communities and foster growth in markets already using business-critical open source software systems.

How will the partnership between ByWaterSolutions and EBSCO help the project grow?

Sydnor: Many vendors, services providers, consortia, and other organizations will be offering hosting and service support. Because FOLIO is an open source platform, libraries at smaller two- and four-year colleges, public libraries, and school libraries can take advantage of FOLIO and will not have to provide dedicated IT staff.

Curulla: With ByWater providing the technical means for smaller institutions to adopt FOLIO, a more diverse group of libraries will have access to this system and will be able to take full advantage of the functionality it contains without the need to hire additional staff at a higher cost than a support contract. For larger institutions with internal staff to support the project, ByWater will be able to take care of the minor, yet time consuming, issues that come up with FOLIO, as well as provide upgrade and other maintenance services, thus freeing up staff time to focus on more important projects, such as custom development initiatives and community building. This will result in more feature development coming from users, thus compounding the speed of growth for the project.

What can you tell me about the community supporting FOLIO?

Sydnor: The community represents a variety of libraries and organizations working with or for libraries as well as developers, subject matter experts, vendors/service providers, user experience (UX) experts, and advocates for issues including accessibility, localization, privacy, etc. The community grew out of the participation of vendors, services providers, and developers and operates under the Open Library Foundation.

FOLIO is the first project for the Open Library Foundation. The initial organizations within the FOLIO project included the Open Library Environment (OLE) libraries, Index Data, and EBSCO. The community has grown rapidly since the project’s inception. More than 4,000 librarians, service providers, and developers are now following the project. Since January 2017, global FOLIO meetups—events that are hosted by libraries—have attracted more than 2,000 attendees. Today the project includes more than 70 developers, 200+ subject matter experts, and more than 10 vendors.

The community has set up an elaborate and entirely transparent infrastructure for managing development and community activities. Special interest groups (SIGs), which consist of subject matter experts, discuss system needs. The FOLIO product council then prioritizes the desired functionality before user interface (UI) designers translate these into UI elements. Product owners convert desired functionality into detailed requirements before handing off to developers. Meeting notes, schedules, group members, communications channels, and more are all available to those interested in the project and can be accessed via FOLIO website and the wiki.

Are there ways to get involved?

Sydnor: Opportunities exist to review the roadmap, ask questions, offer suggestions, receive updates, and suggest or join SIGs. This transparency allows individuals to track the project, receive updates, or get as involved as they’d like.

Hammer: Developers interested in learning more about the project can go to dev.folio.org. All source code and documentation are assigned to the Open Library Foundation, which holds the intellectual property on behalf of the community. Code is maintained in the folio-org organization on GitHub. The license used is version 2 of the Apache License, which has been chosen specifically to allow for the broadest possible engagement with the community and reuse of the software.

Anatomy of a perfect pull request

Writing clean code is just one of many factors you should care about when creating a pull request.

Large pull requests cause a big overhead during the code review and can facilitate bugs in the codebase.

That’s why you need to care about the pull request itself. It should be short, have a clear title and description, and do only one thing.

Why should you care?

  • A good pull request will be reviewed quickly
  • It reduces bug introduction into codebase
  • It facilitates new developers onboarding
  • It does not block other developers
  • It speeds up the code review process and consequently, product development

The size of the pull request

The first step to identifying problematic pull requests is to look for big diffs.

Several studies show that it is harder to find bugs when reviewing a lot of code.

In addition, large pull requests will block other developers who may be depending on the code.

How can we determine the perfect pull request size?

A study of a Cisco Systems programming team revealed that a review of 200-400 LOC over 60 to 90 minutes should yield 70-90% defect discovery.

With this number in mind, a good pull request should not have more than 250 lines of code changed.

Image from small business programming.

As shown in the chart above, pull requests with more than 250 lines of changes usually take more than one hour to review.

Break down large pull requests into smaller ones

Feature breakdown is an art. The more you do it, the easier it gets.

What do I mean by feature breakdown?

Feature breakdown is understanding a big feature and breaking it into small pieces that make sense and that can be merged into the codebase piece by piece without breaking anything.

Learning by doing

Let’s say that you need to create a subscribe feature on your app. It’s just a form that accepts an email address and saves it.

Without knowing how your app works, I can already break it into eight pull requests:

  • Create a model to save emails
  • Create a route to receive requests
  • Create a controller
  • Create a service to save it in the database (business logic)
  • Create a policy to handle access control
  • Create a subscribe component (frontend)
  • Create a button to call the subscribe component
  • Add the subscribe button in the interface

As you can see, I broke this feature into many parts, most of which can be done simultaneously by different developers.

Single responsibility principle

The single responsibility principle (SRP) is a computer programming principle that states that every module or class should have responsibility for a single part of the functionality provided by the software, and that responsibility should be entirely encapsulated by the class.

Just like classes and modules, pull requests should do only one thing.

Following the SRP reduces the overhead caused by revising a code that attempts to solve several problems.

Before submitting a PR for review, try applying the single responsibility principle. If the code does more than one thing, break it into other pull requests.

Title and description matter

When creating a pull request, you should care about the title and the description.

Imagine that the code reviewer is joining your team today without knowing what is going on. He should be able to understand the changes.

The image above shows what a good title and description look like.

The title of the pull request should be self-explanatory

The title should make clear what is being changed.

Here are some examples:

Make a useful description

  • Describe what was changed in the pull request
  • Explain why this PR exists
  • Make it clear how it does what it sets out to do— for example, does it change a column in the database? How is this done? What happens to the old data?
  • Use screenshots to demonstrate what has changed.

Recap

Pull request size

The pull request must have a maximum of 250 lines of change.

Feature breakdown

Whenever possible, break pull requests into smaller ones.

Single Responsibility Principle

The pull request should do only one thing.

Title

Create a self-explanatory title that describes what the pull request does.

Description

Detail what was changed, why it was changed, and how it was changed.

This article was originally posted at Medium. Reposted with permission.

A friendly alternative to the find tool in Linux

fd is a super fast, Rust-based alternative to the Unix/Linux find command. It does not mirror all of find‘s powerful functionality; however, it does provide just enough features to cover 80% of the use cases you might run into. Features like a well thought-out and convenient syntax, colorized output, smart case, regular expressions, and parallel command execution make fd a more than capable successor.

Installation

Head over the fd GitHub page and check out the section on installation. It covers how to install the application on macOS, Debian/Ubuntu, Red Hat, and Arch Linux. Once installed, you can get a complete overview of all available command-line options by running fd -h for concise help, or fd --help for more detailed help.

Simple search

fd is designed to help you easily find files and folders in your operating system’s filesystem. The simplest search you can perform is to run fd with a single argument, that argument being whatever it is that you’re searching for. For example, let’s assume that you want to find a Markdown document that has the word services as part of the filename:

$ fd services
downloads/services.md

If called with just a single argument, fd searches the current directory recursively for any files and/or directories that match your argument. The equivalent search using the built-in find command looks something like this:

$ find . -name ‘services’
downloads/services.md

As you can see, fd is much simpler and requires less typing. Getting more done with less typing is always a win in my book.

Files and folders

You can restrict your search to files or directories by using the -t argument, followed by the letter that represents what you want to search for. For example, to find all files in the current directory that have services in the filename, you would use:

$ fd -tf services
downloads/services.md

And to find all directories in the current directory that have services in the filename:

$ fd -td services
applications/services
library/services

How about listing all documents with the .md extension in the current folder?

$ fd .md
administration/administration.md
development/elixir/elixir_install.md
readme.md
sidebar.md
linux.md

As you can see from the output, fd not only found and listed files from the current folder, but it also found files in subfolders. Pretty neat. You can even search for hidden files using the -H argument:

fd -H sessions .
.bash_sessions

Specifying a directory

If you want to search a specific directory, the name of the directory can be given as a second argument to fd:

$ fd passwd /etc
/etc/default/passwd
/etc/pam.d/passwd
/etc/passwd

In this example, we’re telling fd that we want to search for all instances of the word passwd in the etc directory.

Global searches

What if you know part of the filename but not the folder? Let’s say you downloaded a book on Linux network administration but you have no idea where it was saved. No problem:

fd Administration /
/Users/pmullins/Documents/Books/Linux/Mastering Linux Network Administration.epub

Wrapping up

The fd utility is an excellent replacement for the find command, and I’m sure you’ll find it just as useful as I do. To learn more about the command, simply explore the rather extensive man page.

Getting started with Buildah

Buildah is a command-line tool for building Open Container Initiative-compatible (that means Docker- and Kubernetes-compatible, too) images quickly and easily. It can act as a drop-in replacement for the Docker daemon’s docker build command (i.e., building images with a traditional Dockerfile) but is flexible enough to allow you to build images with whatever tools you prefer to use. Buildah is easy to incorporate into scripts and build pipelines, and best of all, it doesn’t require a running container daemon to build its image.

A drop-in replacement for docker build

You can get started with Buildah immediately, dropping it into place where images are currently built using a Dockerfile and docker build. Buildah’s build-using-dockerfile, or bud argument makes it behave just like docker build does, so it’s easy to incorporate into existing scripts or build pipelines.

As with previous articles I’ve written about Buildah, I like to use the example of installing “GNU Hello” from source. Consider this Dockerfile:

FROM fedora:28
LABEL maintainer Chris Collins

RUN dnf install -y tar gzip gcc make \
        && dnf clean all

ADD http://ftpmirror.gnu.org/hello/hello-2.10.tar.gz /tmp/hello-2.10.tar.gz

RUN tar xvzf /tmp/hello-2.10.tar.gz -C /opt

WORKDIR /opt/hello-2.10

RUN ./configure
RUN make
RUN make install
RUN hello -v
ENTRYPOINT “/usr/local/bin/hello”

Buildah can create an image from this Dockerfile as easily as buildah bud -t hello ., replacing docker build -t hello .:

[chris@krang] $ sudo buildah bud -t hello .
STEP 1: FROM fedora:28
Getting image source signatures
Copying blob sha256:e06fd16225608e5b92ebe226185edb7422c3f581755deadf1312c6b14041fe73
 81.48 MiB / 81.48 MiB [====================================================] 8s
Copying config sha256:30190780b56e33521971b0213810005a69051d720b73154c6e473c1a07ebd609
 2.29 KiB / 2.29 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
STEP 2: LABEL maintainer Chris Collins
STEP 3: RUN dnf install -y tar gzip gcc make    && dnf clean all

<snip>

Once the build is complete, you can see the new image with the buildah images command:

[chris@krang] $ sudo buildah images
IMAGE ID        IMAGE NAME                              CREATED AT              SIZE
30190780b56e    docker.io/library/fedora:28             Mar 7, 2018 16:53       247 MB
6d54bef73e63    docker.io/library/hello:latest    May 3, 2018 15:24     391.8 MB

The new image, tagged hello:latest, can be pushed to a remote image registry or run using CRI-O or other Kubernetes CRI-compatible runtimes, or pushed to a remote registry. If you’re testing it as a replacement for Docker build, you will probably want to copy the image to the docker daemon’s local image storage so it can be run by Docker. This is easily accomplished with the buildah push command:

[chris@krang] $ sudo buildah push hello:latest docker-daemon:hello:latest
Getting image source signatures
Copying blob sha256:72fcdba8cff9f105a61370d930d7f184702eeea634ac986da0105d8422a17028
 247.02 MiB / 247.02 MiB [==================================================] 2s
Copying blob sha256:e567905cf805891b514af250400cc75db3cb47d61219750e0db047c5308bd916
 144.75 MiB / 144.75 MiB [==================================================] 1s
Copying config sha256:6d54bef73e638f2e2dd8b7bf1c4dfa26e7ed1188f1113ee787893e23151ff3ff
 1.59 KiB / 1.59 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures

[chris@krang] $ sudo docker images | head -n2
REPOSITORY              TAG             IMAGE ID        CREATED                 SIZE
docker.io/hello      latest       6d54bef73e63  2 minutes ago   398 MB

[chris@krang] $ sudo docker run -t hello:latest
Hello, world!

A few differences

Unlike Docker build, Buildah doesn’t commit changes to a layer automatically for every instruction in the Dockerfile—it builds everything from top to bottom, every time. On the positive side, this means non-cached builds (for example, those you would do with automation or build pipelines) end up being somewhat faster than their Docker build counterparts, especially if there are a lot of instructions. This is great for getting new changes into production quickly from an automated deployment or continuous delivery standpoint.

Practically speaking, however, the lack of caching may not be quite as useful for image development, where caching layers can save significant time when doing builds over and over again. This applies only to the build-using-dockerfile command, however. When using Buildah native commands, as we’ll see below, you can choose when to commit your changes to disk, allowing for more flexible development.

Buildah native commands

Where Buildah really shines is in its native commands, which you can use to interact with container builds. Rather than using build-using-dockerfile/bud for each build, Buildah has commands to actually interact with the temporary container created during the build process. (Docker uses temporary, or intermediate containers, too, but you don’t really interact with them while the image is being built.)

Using the “GNU Hello” example again, consider this image build using Buildah commands:

#!/usr/bin/env bash

set -o errexit

# Create a container
container=$(buildah from fedora:28)

# Labels are part of the “buildah config” command
buildah config –label maintainer=“Chris Collins <collins.christopher@gmail.com>” $container

# Grab the source code outside of the container
curl -sSL http://ftpmirror.gnu.org/hello/hello-2.10.tar.gz -o hello-2.10.tar.gz

buildah copy $container hello-2.10.tar.gz /tmp/hello-2.10.tar.gz

buildah run $container dnf install -y tar gzip gcc make
Buildah run $container dnf clean all
buildah run $container tar xvzf /tmp/hello-2.10.tar.gz -C /opt

# Workingdir is also a “buildah config” command
buildah config –workingdir /opt/hello-2.10 $container

buildah run $container ./configure
buildah run $container make
buildah run $container make install
buildah run $container hello -v

# Entrypoint, too, is a “buildah config” command
buildah config –entrypoint /usr/local/bin/hello $container

# Finally saves the running container to an image
buildah commit –format docker $container hello:latest

One thing that should be immediately obvious is the fact that this is a Bash script rather than a Dockerfile. Using Buildah’s native commands makes it easy to script, in whatever language or automation context you like to use. This could be a makefile, a Python script, or whatever tools you like to use.

So what is going on here? The first Buildah command container=$(buildah from fedora:28), creates a running container from the fedora:28 image, and stores the container name (the output of the command) as a variable for later use. All the rest of the Buildah commands use the $container variable to say what container to act upon. For the most part those commands are self-explanatory: buildah copy moves a file into the container, buildah run executes a command in the container. It is easy to match them to their Dockerfile equivalents.

The final command, buildah commit, commits the container to an image on disk. When building images with Buildah commands rather than from a Dockerfile, you can use the commit command to decide when to save your changes. In the example above, all of the changes are committed at once, but intermediate commits could be included too, allowing you to choose cache points from which to start. (For example, it would be particularly useful to cache to disk after the dnf install, as that can take a long time, but is also reliably the same each time.)

Mountpoints, install directories, and chroots

Another useful Buildah command opens the door to a lot of flexibility in building images. buildah mount mounts the root directory of a container to a mountpoint on your host. For example:

[chris@krang] $ container=$(sudo buildah from fedora:28)
[chris@krang] $ mountpoint=$(sudo buildah mount ${container})
[chris@krang] $ echo $mountpoint
/var/lib/containers/storage/overlay2/463eda71ec74713d8cebbe41ee07da5f6df41c636f65139a7bd17b24a0e845e3/merged
[chris@krang] $ cat ${mountpoint}/etc/redhat-release
Fedora release 28 (Twenty Eight)
[chris@krang] $ ls ${mountpoint}
bin   dev  home  lib64          media  opt   root  sbin  sys  usr
boot  etc  lib   lost+found  mnt        proc  run   srv   tmp  var

This is great because now you can interact with the mountpoint to make changes to your container image. This allows you to use tools on your host to build and install software, rather than including those tools in the container image itself. For example, in the Bash script above, we needed to install the tar, Gzip, GCC, and make packages to compile “GNU Hello” inside the container. Using a mountpoint, we can build an image with the same software, but the downloaded tarball and tar, Gzip, etc., RPMs are all on the host machine rather than in the container and resulting image:

#!/usr/bin/env bash

set -o errexit

# Create a container
container=$(buildah from fedora:28)
mountpoint=$(buildah mount $container)

buildah config –label maintainer=“Chris Collins <collins.christopher@gmail.com>” $container

curl -sSL http://ftpmirror.gnu.org/hello/hello-2.10.tar.gz \
     -o /tmp/hello-2.10.tar.gz
tar xvzf src/hello-2.10.tar.gz -C ${mountpoint}/opt

pushd ${mountpoint}/opt/hello-2.10
./configure
make
make install DESTDIR=${mountpoint}
popd

chroot $mountpoint bash -c “/usr/local/bin/hello -v”

buildah config –entrypoint “/usr/local/bin/hello” $container
buildah commit –format docker $container hello
buildah unmount $container

Take note of a few things in the script above:

  1. The curl command downloads the tarball to the host, not the image

  2. The tar command (running from the host itself) extracts the source code from the tarball into /opt inside the container.

  3. Configure, make, and make install are all running from a directory inside the mountpoint, mounted to the host rather than running inside the container itself.

  4. The chroot command here is used to change root into the mountpoint itself and test that “hello” is working, similar to the buildah run command used in the previous example.

This script is shorter, it uses tools most Linux folks are already familiar with, and the resulting image is smaller (no tarball, no extra packages, etc). You could even use the package manager for the host system to install software into the container. For example, let’s say you wanted to install NGINX into the container with GNU Hello (for whatever reason):

[chris@krang] $ mountpoint=$(sudo buildah mount ${container})
[chris@krang] $ sudo dnf install nginx –installroot $mountpoint
[chris@krang] $ sudo chroot $mountpoint nginx -v
nginx version: nginx/1.12.1

In the example above, DNF is used with the --installroot flag to install NGINX into the container, which can be verified with chroot.

Try it out!

Buildah is a lightweight and flexible way to create container images without running a full Docker daemon on your host. In addition to offering out-of-the-box support for building from Dockerfiles, Buildah is easy to use with scripts or build tools of your choice and can help build container images using existing tools on the build host. The result is leaner images that use less bandwidth to ship around, require less storage space, and have a smaller surface area for potential attackers. Give it a try!

[See our related story, Creating small containers with Buildah]

Getting started with regular expressions

Regular expressions can be one of the most powerful tools in your toolbox as a Linux user, system administrator, or even as a programmer. It can also be one of the most daunting things to learn, but it doesn’t have to be! While there are an infinite number of ways to write an expression, you don’t have to learn every single switch and flag. In this short how-to, I’ll show you a few simple ways to use regex that will have you running in no time and share some follow-up resources that will make you a regex master if you want to be.

A quick overview

Regular expressions, also referred to as “regex” patterns or even “regular statements,” are in simple terms “a sequence of characters that define a search pattern.” The idea came about in the 1950s when Stephen Cole Kleene wrote a description of an idea he called a “regular language,” of which part came to be known as “Kleene’s theorem.” At a very high level, it says if the elements of the language can be defined, then an expression can be written to match patterns within that language.

Since then, regular expressions have been part of even the earliest Unix programs, including vi, sed, awk, grep, and others. In fact, the word grep is derived from the command that was used in the earliest “ed” editor, namely g/re/p, which essentially means “do a global search for this regular expression and print the lines.” Cool!

Why we need regular expressions

As mentioned above, regular expressions are used to define a pattern to help us match on or “find” objects that match that pattern. Those objects can be files in a filesystem when using the find command for instance, or a block of text in a file which we might search using grep, awk, vi, or sed, for example.

Start with the basics

Let’s start at the very beginning; it’s a very good place to start.

The first regex everyone seems to learn is probably one you already know and didn’t realize what it was. Have you ever wanted to print out a list of files in a directory, but it was too long? Maybe you’ve seen someone type \*.gif to list GIF images in a directory, like:

$ ls *.gif

That’s a regular expression!

When writing regular expressions, certain characters have special meaning to allow us to move beyond matching just characters to matching entire sets of characters. In this case, the * character, also called “star” or “splat,” takes the place of filenames and allows you to match all files ending with .gif.

Search for patterns in a file

The next step in your regex foo training is searching for patterns within a file, especially using the replace pattern to make quick changes.

Two common ways to do this are:

  1. Use vi to open the file, search for a pattern, and make the change (even automatically using replace).
  2. Use the “stream editor,” aka sed, to programmatically search within the file and make the change.

Let’s start by learning some regex by using vi to edit the following file:

The quick brown fox jumped over the lazy dog.
Simple test
Harder test
Extreme test case
ABC 123 abc 567
The dog is lazy

Now, with this file open in vi, let’s look at some regex examples that will help us find some matching strings inside and even replace them automatically.

To make things easier, let’s set vi to ignore case. Type set ic to enable case-insensitive searching.

Now, to start searching in vi, type the / character followed by your search pattern.

Search for things at the beginning or end of a line

To find a line that starts with “Simple,” use this regex pattern:

Notice in the image below that only the line starting with “Simple” is highlighted. The carat symbol (^) is the regex equivalent of “starts with.”

'Simple' highlighted

Next, let’s use the $ symbol, which in regex speak is “ends with.”

'Test' highlighted

See how it highlights both lines that end in “test”? Also, notice that the fourth line has the word test in it, but not at the end, so this line is not highlighted.

This is the power of regular expressions, giving you the ability to quickly look across a great number of matches with ease but specifically drill down on only exact matches.

Test for the frequency of occurrence

To further extend your skills in regular expressions, let’s take a look at some more common special characters that allow us to look for not just matching text, but also patterns of matches.

Frequency matching characters:

Character Meaning Example
* Zero or more ab* – the letter a followed by zero or more b‘s
+ One or more ab+ – the letter a followed by one or more b‘s
? Zero or one ab? – zero or just one b
{n} Given a number, find exactly that number ab{2} – the letter a followed by exactly two b‘s
{n,} Given a number, find at least that number ab{2,} – the letter a followed by at least two b‘s
{n,y} Given two numbers, find a range of that number ab{1,3} – the letter a followed by between one and three b‘s

Find classes of characters

The next step in regex training is to use classes of characters in our pattern matching. What’s important to note here is that these classes can be combined either as a list, such as [a,d,x,z], or as a range, such as [a-z], and that characters are usually case sensitive.

To see this work in vi, we’ll need to turn off the ignore case we set earlier. Let’s type: set noic to turn ignore case off again.

Some common classes of characters that are used as ranges are:

  • a-z – all lowercase characters
  • A-Z – all UPPERCASE characters
  • 0-9 – numbers

Now, let’s try a search similar to one we ran earlier:

Do you notice that it finds nothing? That’s because the previous regex looks for exactly “tT.” If we replace this with:

We’ll see that both the lowercase and UPPERCASE T’s are matched across the document.

Letter 't' highlighted

Now, let’s chain a couple of class ranges together and see what we get. Try:

capital letters and 123 are highlighted

Notice that the capital letters and 123 are highlighted, but not the lowercase letters (including the end of line five).

Flags

The last step in your beginning regex training is to understand flags that exist to search for special types of characters without needing to list them in a range.

  • . – any character
  • \s – whitespace
  • \w – word
  • \d – digit (number)

For example, to find all digits in the example text, use:

Notice in the example below that all of the numbers are highlighted.

numbers are highlighted

To match on the opposite, you usually use the same flag, but in UPPERCASE. For example:

  • \S – not a space
  • \W – not a word
  • \D – not a digit

Notice in the example below that by using \D, all characters EXCEPT the numbers are highlighted.

all characters EXCEPT the numbers are highlighted

Searching with sed

A quick note on sed: It’s a stream editor, which means you don’t interact with a user interface. It takes the stream coming in one side and writes it out the other side.

Using sed is very similar to vi, except that you give it the regex to search and replace, and it returns the output. For example:

sed s/dog/cat/ examples

will return the following to the screen:

Searching and replacing

If you want to save that file, it’s only slightly more tricky. You’ll need to chain a couple of commands together to a) write that file, and b) copy it over the top of the first file.

To do this, try:

sed s/dog/cat/ examples > temp.out; mv temp.out examples

Now, if you look at your examples file, you’ll see that the word “dog” has been replaced.

The quick brown fox jumped over the lazy cat.
Simple test
Harder test
Extreme test case
ABC 123 abc 567
The cat is lazy

For more information

I hope this was a helpful overview of regular expressions. Of course, this is just the tip of the iceberg, and I hope you’ll continue to learn about this powerful tool by reviewing the additional resources below.

Where to get help

For more examples, check out

Choosing the right open source tool for movie project management

One thing artists, engineers, and hackers share in common is their antipathy for management. So, when the time comes when we actually need project management, it comes as a painful growing experience.

For the Lunatics! animated open movie project, we started by using basic tools popular with open source software projects, like a version control system (Subversion), a wiki (MediaWiki), and a bug-tracker and online browser for the source code (Trac). This is viable for a team of a half-dozen people and an unhurried schedule on a volunteer project. But it quickly becomes unmanageable for larger teams and tighter schedules.

Fortunately, there are plenty of open source project management software packages, which can provide structural guidance and hold a lot more information about your project than you can comfortably keep in your own head, freeing you to apply yourself more creatively. The challenge is choosing the right package. And for that, we need to think more carefully about what we want from it.

My previous article dealt with the first and most concrete aspect of this problem: digital asset management. But even more important are the people working on the project and how they apply their time and resources to it, so we have to define what we need for that.

Defining what we need to manage

Breakdown

Project management starts with breaking down the big goal into lots of smaller goals—ideally down to the individual assets needed. This is called the breakdown.

Breakdown can be done by reviewing the script and identifying and listing the elements needed to produce each scene. At its simplest, it can be done in a text editor, but more streamlined solutions can speed things up.

Workflow

Once you have broken down the film into its individual assets, each asset will have to go through key phases of production—for example, a 3D model will need to be designed, modeled, textured, rigged, and animated. Each step might be done by a different person with specialized skills, so the asset will have to move from person to person.

Since asset formats (such as our Blender files) often can’t be merged if two people try to work on a file at once, it’s important to keep track of each asset’s phase and who has control of it. If you mess up and produce two parallel, out-of-sync versions of the file, you’ll probably have to ditch one of them and repeat the other work.

Scheduling and time management

Productions run on timetables. You want to be able to tell people when you will be finished, and you want to finish first things first.

You also may need to identify specific times when you can meet to discuss the project, and—depending on the terms of collaboration—you may need to keep track of the time spent by collaborators on the project.

Until now, we’ve handled most of these tasks through simple text files or LibreOffice Calc spreadsheets, in some cases shared through a MediaWiki site.

Communications

A key problem to solve for a team mediated by the internet is how to maintain context for conversations: you need everyone involved to know what you are talking about.

Much of the time spent on communications involves communicating the context of the conversation—what project, asset, or task are we talking about? We’ve done that using GIMP or Inkscape to produce quick markup images that we share by chat, email, or a phpBB forum.

Things can be done to speed that up. Blender contains its own internal markup system, called Grease Pencil, although it isn’t much faster to use than sketching over a screen capture (although it does work better in 3D, and in fact, it’s so sophisticated people have produced short animated films using it artistically).

We’ve considered using videoconferencing and digital whiteboard package Big Blue Button (on GitHub) for team communications, but it’s probably overkill for our project.

New platform options

To step up from our existing Trac site, we might first consider Trac-like alternatives for managing the project, such as Redmine, which would add several new project management tools, including search, workflow, and scheduling features in addition to handling multiple projects.

We could also look at what other projects are using. Blender Foundation runs a Software-as-a-Service subscription platform for open movies called Blender Cloud. Its core project management software is Attract (see its development site). This is tightly integrated with Blender and provides an API that can be accessed from Blender. It’s definitely an attractive option for a Blender-centered project.

Morevna Project has experimented with dotProject (development on GitHub) in the past and more recently with Open Project.

Urchn.org’s “Tube” project has been using Helga for years, but it is essentially orphaned now (see its development on the Internet Archive).

For business reasons, we are also considering installing an open source enterprise platform called Odoo (previously known as OpenERP), which includes the Odoo Project (with development on GitHub). That would potentially be an easy add for us as well.

Wikipedia offers a comparison of project management packages, of which 31 are open source. Aside from the ones mentioned above a few stand out as interesting.

ProjeQtor and TACTIC are among the most full-featured options on that list.

As mentioned in my previous Opensource.com article about asset management, TACTIC was a competitor for the production-management software used with the Blender Gooseberry Project (producing Cosmos Laundromat) before Blender Foundation decided to create a custom solution.

We chose the TACTIC platform because it is:

  • Designed specifically for animation production
  • Highly flexible in terms of workflow, scheduling, and collaboration features and allows template-based, per-project assignment of workflows and asset types
  • Tightly coupled with the digital asset management system, automatically associating tickets, workflow, schedules, and conversations within the context of each asset
  • Neutral on the choice of creative application (web-based interfaces)
  • Easy to integrate with clients through its web API
  • Written in Python, which is a clearly understandable language we have the skills to work with
  • Quite complete in available project management reports and features

Combined with Odoo for business commerce applications and Mumble for real-time voice communications, our new TACTIC platform should allow us to meet our goals of speeding production and growing our team to manage it.

How the four components of a distributed tracing system work together

Ten years ago, essentially the only people thinking hard about distributed tracing were academics and a handful of large internet companies. Today, it’s turned into table stakes for any organization adopting microservices. The rationale is well-established: microservices fail in surprising and often spectacular ways, and distributed tracing is the best way to describe and diagnose those failures.

That said, if you set out to integrate distributed tracing into your own application, you’ll quickly realize that the term “Distributed Tracing” means different things to different people. Furthermore, the tracing ecosystem is crowded with partially-overlapping projects with similar charters. This article describes the four (potentially) independent components in distributed tracing, and how they fit together.

Distributed tracing: A mental model

Most mental models for tracing descend from Google’s Dapper paper. OpenTracing uses similar nouns and verbs, so we will borrow the terms from that project:

Tracing

  • Trace: The description of a transaction as it moves through a distributed system.
  • Span: A named, timed operation representing a piece of the workflow. Spans accept key:value tags as well as fine-grained, timestamped, structured logs attached to the particular span instance.
  • Span context: Trace information that accompanies the distributed transaction, including when it passes from service to service over the network or through a message bus. The span context contains the trace identifier, span identifier, and any other data that the tracing system needs to propagate to the downstream service.

If you would like to dig into a detailed description of this mental model, please check out the OpenTracing specification.

The four big pieces

From the perspective of an application-layer distributed tracing system, a modern software system looks like the following diagram:

Tracing

The components in a modern software system can be broken down into three categories:

  • Application and business logic: Your code.
  • Widely shared libraries: Other people’s code.
  • Widely shared services: Other people’s infrastructure.

These three components have different requirements and drive the design of the Distributed Tracing systems which is tasked with monitoring the application. The resulting design yields four important pieces:

  • A tracing instrumentation API: What decorates application code.
  • Wire protocol: What gets sent alongside application data in RPC requests.
  • Data protocol: What gets sent asynchronously (out-of-band) to your analysis system.
  • Analysis system: A database and interactive UI for working with the trace data.

To explain this further, we’ll dig into the details which drive this design. If you just want my suggestions, please skip to the four big solutions at the bottom.

Requirements, details, and explanations

Application code, shared libraries, and shared services have notable operational differences, which heavily influence the requirements for instrumenting them.

Instrumenting application code and business logic

In any particular microservice, the bulk of the code written by the microservice developer is the application or business logic. This is the code that defines domain-specific operations; typically, it contains whatever special, unique logic justified the creation of a new microservice in the first place. Almost by definition, this code is usually not shared or otherwise present in more than one service.

That said, you still need to understand it, and that means it needs to be instrumented somehow. Some monitoring and tracing analysis systems auto-instrument code using black-box agents, and others expect explicit “white-box” instrumentation. For the latter, abstract tracing APIs offer many practical advantages for microservice-specific application code:

  • An abstract API allows you to swap in new monitoring tools without re-writing instrumentation code. You may want to change cloud providers, vendors, and monitoring technologies, and a huge pile of non-portable instrumentation code would add meaningful overhead and friction to that procedure.
  • It turns out there are other interesting uses for instrumentation, beyond production monitoring. There are existing projects that use this same tracing instrumentation to power testing tools, distributed debuggers, “chaos engineering” fault injectors, and other meta-applications.
  • But most importantly, what if you wanted to extract an application component into a shared library? That leads us to:

Instrumenting shared libraries

The utility code present in most applications—code that handles network requests, database calls, disk writes, threading, queueing, concurrency management, and so on—is often generic and not specific to any particular application. This code is packaged up into libraries and frameworks which are then installed in many microservices, and deployed into many different environments.

This is the real difference: with shared code, someone else is the user. Most users have different dependencies and operational styles. If you attempt to instrument this shared code, you will note a couple of common issues:

  • You need an API to write instrumentation. However, your library does not know what analysis system is being used. There are many choices, and all the libraries running in the same application cannot make incompatible choices.
  • The task of injecting and extracting span contexts from request headers often falls on RPC libraries, since those packages encapsulate all network-handling code. However, a shared library cannot not know which tracing protocol is being used by each application.
  • Finally, you don’t want to force conflicting dependencies on your user. Most users have different dependencies and operational styles. Even if they use gRPC, will it be the same version of gRPC you are binding to? So any monitoring API your library brings in for tracing must be free of dependencies.

So, an abstract API which (a) has no dependencies, (b) is wire protocol agnostic, and (c) works with popular vendors and analysis systems should be a requirement for instrumenting shared library code.

Instrumenting shared services

Finally, sometimes entire services—or sets of microservices—are general-purpose enough that they are used by many independent applications. These shared services are often hosted and managed by third parties. Examples might be cache servers, message queues, and databases.

It’s important to understand that shared services are essentially “black boxes” from the perspective of application developers. It is not possible to inject your application’s monitoring solution into a shared service. Instead, the hosted service often runs its own monitoring solution.

The four big solutions

So, an abstracted tracing API would help libraries emit data and inject/extract Span Context. A standard wire protocol would help black-box services interconnect, and a standard data format would help separate analysis systems consolidate their data. Let’s have a look at some promising options for solving these problems.

Tracing API: The OpenTracing project

As shown above, in order to instrument application code, a tracing API is required. And in order to extend that instrumentation to shared libraries, where most of the Span Context injection and extraction occurs, the API must be abstracted in certain critical ways.

The OpenTracing project aims to solve this problem for library developers. OpenTracing is a vendor-neutral tracing API which comes with no dependencies, and is quickly gaining support from a large number of monitoring systems. This means that, increasingly, if libraries ship with native OpenTracing instrumentation baked in, tracing will automatically be enabled when a monitoring system connects at application startup.

Personally, as someone who has been writing, shipping, and operating open source software for over a decade, it is profoundly satisfying to work on the OpenTracing project and finally scratch this observability itch.

In addition to the API, the OpenTracing project maintains a growing list of contributed instrumentation, some of which can be found here. If you would like to get involved, either by contributing an instrumentation plugin, natively instrumenting your own OSS libraries, or just want to ask a question, please find us on Gitter and say hi.

Wire Protocol: The trace-context HTTP headers

In order for monitoring systems to interoperate, and to mitigate migration issues when changing from one monitoring system to another, a standard wire protocol is needed for propagating Span Context.

The w3c Distributed Trace Context Community Group is hard at work defining this standard. Currently, the focus is on defining a set of standard HTTP headers. The latest draft of the specification can be found here. If you have questions for this group, the mailing list and Gitter chatroom are great places to go for answers.

Data protocol (Doesn’t exist yet!!)

For black-box services, where it is not possible to install a tracer or otherwise interact with the program, a data protocol is needed to export data from the system.

Work on this data format and protocol is currently at an early stage, and mostly happening within the context of the w3c Distributed Trace Context Working Group. There is particular interest is in defining higher-level concepts, such as RPC calls, database statements, etc, in a standard data schema. This would allow tracing systems to make assumptions about what kind of data would be available. The OpenTracing project is also working on this issue, by starting to define a standard set of tags. The plan is for these two efforts to dovetail with each other.

Note that there is a middle ground available at the moment. For “network appliances” that the application developer operates, but does not want to compile or otherwise perform code modifications to, dynamic linking can help. The primary examples of this are service meshes and proxies, such as Envoy or NGINX. For this situation, an OpenTracing-compliant tracer can be compiled as a shared object, and then dynamically linked into the executable at runtime. This option is currently provided by the C++ OpenTracing API. For Java, an OpenTracing Tracer Resolver is also under development.

These solutions work well for services that support dynamic linking, and are deployed by the application developer. But in the long run, a standard data protocol may solve this problem more broadly.

Analysis system: A service for extracting insights from trace data

Last but not least, there is now a cornucopia of tracing and monitoring solutions. A list of monitoring systems known to be compatible with OpenTracing can be found here, but there are many more options out there. I would encourage you to research your options, and I hope you find the framework provided in this article to be useful when comparing options. In addition to rating monitoring systems based on their operational characteristics (not to mention whether you like the UI and features), make sure you think about the three big pieces above, their relative importance to you, and how the tracing system you are interested in provides a solution to them.

Conclusion

In the end, how important each piece is depends heavily on who you are and what kind of system you are building. For example, open source library authors are very interested in the OpenTracing API, while service developers tend to be more interested in the Trace-Context specification. When someone says one piece is more important than the other, they usually mean “one piece is more important to me than the other.”

However, the reality is this: Distributed Tracing has become a necessity for monitoring modern systems. In designing the building blocks for these systems, the age-old approach—”decouple where you can”—still holds true. Cleanly decoupled components are the best way to maintain flexibility and forwards-compatibility when building a system as cross-cutting as a distributed monitoring system.

Thanks for reading! Hopefully, now when you’re ready to implement tracing in your own application, you have a guide to understanding which pieces they are talking about, and how they fit together.


Want to learn more? Sign up to attend KubeCon EU in May or KubeCon North America in December.

A brief history of bad passwords

IT-mandated password policies seem like a good idea—after all, what are the chances that an attacker will guess your exact passcode out of the 782 million potential combinations in an eight-character string with at least one upper-case letter, one lower-case letter, two numerals, and one symbol? 

Those odds are not in your favor because most IT password policies don’t consider how people select and use passwords in the real world, says Kyle Rankin, chief security officer at Purism and author of Linux Hardening in Hostile Networks. Password polices don’t work because hackers do consider how people think.

Watch Kyle’s Lightning Talk, “Sex, Secret, and God: A Brief History of Bad Passwords,” from the 16th annual Southern California Linux Expo (SCALE) to learn the history of bad passcode policies and what we must do instead to secure our data.

During the UpSCALE Lightning Talks hosted by Opensource.com at the 16th annual Southern California Linux Expo (SCALE) in March 2018, eight presenters shared quick takes on interesting open source topics, projects, and ideas. Watch all of the UpSCALE Lightning Talks on the Opensource.com YouTube channel.

Linux video editing, open source ERP systems, Windows apps, password managers, and more

Our biggest hit last week was Don Watkins’ article on why System76 will start making its Linux computers in the U.S. Here’s more of what readers were talking about the week of April 9-15:

  1. Linux computer maker to move manufacturing to the U.S., by Don Watkins
  2. Top 9 open source ERP systems to consider, by Opensource.com
  3. The current state of Linux video editing 2018, by Seth Kenlon
  4. 3 password managers for the Linux command line, by Scott Nesbitt
  5. 3 open source apps for Windows, by Jeff Macharyas
  6. Getting started with Jenkins Pipelines, by Miguel Suarez
  7. Replicate your custom Linux settings with DistroTweaks, by David Spring
  8. How to create LaTeX documents with Emacs, by Sachin Patil
  9. Git turns 13, Linux and SSH commands to know, Python programming, and more, by Rikki Endsley
  10. Build your first Redis Hello World application in Python, by Tague Griffith

Win a year of AdaBox

AdaBox is a US$ 60 per quarter service that delivers hand-picked Adafruit products, unique collectibles, and exclusive discounts to your door. Enter our giveaway by Sunday, April 29 at 11:59 p.m. Eastern Time for a chance to win.

Free 2017 Open Source Yearbook download

Our third annual open source community yearbook rounds up the top projects, technologies, and stories from 2017.

Call for articles

We want to see your JavaScript story ideas. Send article proposals, along with brief outlines, to rikki@opensource.com.

Stay up on what’s going on with Opensource.com by subscribing to our highlights newsletter.

Check out the editorial calendar for a preview of what’s ahead. Got a story idea? Send us a proposal!

LISA18 CFP now open

The CFP for LISA18 is open, and Brendan Gregg (Netflix) and I will co-chair this year’s event, which will be held Oct 29-31 in downtown Nashville. Do you have something to say about the present and future of Ops? If so, send in your talk proposal by May 24th. Follow LISA on Twitter to stay updated on deadlines and announcements. If you have questions or feedback, contact us at lisa18chairs@usenix.org.

Using less to view text files at the Linux command line

If there’s one thing you’re sure to find on a Linux system, it’s text files. A lot of them. Readme files, configuration files, documents, and more.

Most of the time, you probably open text files using a text editor. But there is a faster and, I think, better way of reading text files. That’s using a utility called less. Standard kit with all Linux distributions (at least the ones I’ve used), less is a command-line textfile viewer with some useful features.

Don’t let the fact that it’s a command-line tool scare you. less is very easy to use and has a very shallow learning curve.

Let’s take a look at some of the things that you can do with less.

Getting started

Crack open a terminal window and navigate to a directory containing one or more text files that you want to view. Then run the command less filename, where filename is the name of the file you want to view.

The file takes over your terminal window, and you’ll notice a colon (:) at the bottom of the window. The colon is where you can type any of the internal commands you use with less. More on these in a moment.

Moving around

Chances are that the text file you’re perusing is more than a couple of lines long; it’s probably a page or more. With less, you can move forward in the file in a few ways:

  • Move down a page by pressing the spacebar or the PgDn key
  • Move down one line at a time by pressing the Down arrow key

less also allows you to move backward in a file. To do that, press the PgUp key (to move up a page at a time) or the Up arrow key (to move up one line at a time).

Finding text

If you have a large text file or are trying to find a specific piece of text, you can do that easily in less. To find a word or phrase, press / on your keyboard and type what you want to find.

Note that the search function in less is case-sensitive. Typing “the silence” isn’t the same as typing “The Silence.”

less also highlights the words or phrases you search for. That’s a nice touch that makes it easier for you to scan the text.

You can press n on your keyboard to find the next instance of the word or phrase. Press p on your keyboard to find the previous instance.

Getting out of there

Once you get to the end of a text file and you’re done viewing it, how do you exit less? That’s easy. Just press q on your keyboard. (You can also press q at any time to leave the program.)

As I mentioned at the beginning of this post, less is easy to use. Once you use it, you’ll wonder how you ever did without it.