Marill

Marill -- Automated Site Testing Utility

Marill -- Automated site testing utility.

Marill is an automated site testing utility, which is meant to make administrators lives easier by taking much of the leg-work out of testing. It’s intended to be lightweight, flexible, and easy to use, while still being very powerful.

Goal
Features
Limitations
How does it work?
- Examples
Installation
Usage
Getting started
Frequently Asked Questions
Contributing

Goal

Often times during server administration, migrations, and large server changes, things can and will go wrong. Servers are complex systems with many working parts, and with that comes a lot of breakage.

Creating an automated site testing utility, like Marill, allows:

Less human interaction to test sites.
Integration and flexibility to be built into other systems.
Clients to be at ease; you know they won’t test all of their sites.
Administrators or developers to hate you less.
You to be witty and say “time is money!”.

Features

Disclaimer: Marill is still in early development, and this list is subject to change drastically. (code, libraries, tools, cli-args, etc)

Cross platform: Marill can compile across many platforms. linux, netbsd, openbsd, freebsd, and more. (Windows too, possibly!)
Configurable output. Only output what you need.
Many cli flags to configure input, output, what is tested, what isn’t, etc.
Ability to test cPanel based servers, Apache, Nginx (coming soon!), and others! (any can be scanned with --domains)
Flexible testing system. You can even write your own tests! Load them from a URL in JSON format, or from a directory! (see marill/tests)

Limitations

There are a few limitations with Marill, due to how the utility was developed. Marill was meant to be lightweight, and portable. This means it cannot work exactly like a normal browser. Below are a few examples:

Marill currently isn’t able to take a screenshot of the site. However, there many other external resources for this. (usecase: pixel-by-pixel comparison – easily tell if CSS is broken)
Marill can’t execute Javascript. If your site is heavy on Javascript, this tool may not be best suited. (there are some sites which rely heavily on Javascript. (however not frequently do sites Javascript break during a migration or move, unless a resource fails to load, which Marill should catch)
Marill cannot and will not load certain resources. E.g. videos, iframes, embeds, ftp links, etc. This would make crawling the site very complex and convoluted. (however, embed plugins within things like wordpress could possibly be caught, if a test were to be written to search for bad tags)
Marill cannot search through webserver, PHP, or other misc. logs to determine what the issue may be. This will likely never change, because adding this functionality would make the utility fragile and clunky. If there is an error, you should be able to find out what is causing it.

How does it work?

The general idea is that you place Marill on the server you would like to test. Marill by default will then figure out the list of domains that server is hosting. Marill will then begin to act much like a browser, crawling each site (and all resources like images/css/javascript/etc if --assets is used). It will then pass each resource it fetches through the list of builtin, or external tests. Each domain is given a starting score of 10, and each test has a pre-defined weight. If the test matches, that score is applied to the main score. If the score falls below the minimum configured score, it is considered failed.

Examples

Here are a few examples of tests that are useful:

Visual PHP errors on the page. For example: Warning: Invalid argument supplied for function() in /path/to/some/file.php
Invalid status codes. For example, Forbidden, Internal Service Error, Payment Required
Blank pages generated by PHP (common if PHP has display_errors disabled)
MySQL or PostgreSQL related errors.
cPanel “Sorry!” related pages (common if the incorrect IP is configured for example).

Example running from my workstation (though, this would be best suited running from the server itself):

Installation

Check out the releases page for prebuilt versions. Below are example commands of how you would install the utility. Some of the more popular OS/distro steps are provided below, but there are more released versions on the releases page previously mentioned.

Ubuntu/Debian

$ wget https://liam.sh/ghr/marill_0.1.1_linux_amd64.deb
$ dpkg -i marill_0.1.1_linux_amd64.deb

CentOS/Redhat

$ yum localinstall https://liam.sh/ghr/marill_0.1.1_linux_amd64.rpm

Manual Install

$ wget https://liam.sh/ghr/marill_0.1.1_linux_amd64.tar.gz
$ tar -C /usr/bin/ -xzvf marill_0.1.1_linux_amd64.tar.gz marill
$ chmod +x /usr/bin/marill

Build From Source

Dependencies (to build from source only):

Go (1.9 or greater, though latest preferred). Ensure your $GOPATH is setup.

# you can "git clone" the repo too, just make sure it's following this directory
# structure.
$ go get -d -u github.com/lrstanley/marill
$ cd $GOPATH/src/github.com/lrstanley/marill
# this will show you all of the available options (to fetch dependencies,
# run in debug mode, etc.)
$ make help
$ make
$ ./marill --help

Usage

This is very likely to change quite a bit until we’re out of beta. Please use wisely.

$ marill --help
NAME:
   marill - Automated website testing utility

USAGE:
   marill [global options] command [command options] [arguments...]

VERSION:
   git revision XXXXXX

AUTHOR(S):
   Liam Stanley <[email protected]>

COMMANDS:
     scan           [DEFAULT] Start scan for all domains on server
     urls, domains  Print the list of urls as if they were going to be scanned
     tests          Print the list of tests that are loaded and would be used
     help, h        Shows a list of commands or help for one command

GLOBAL OPTIONS:
   -d, --debug              Print debugging information to stdout
   -q, --quiet              Do not print regular stdout messages
   --no-color               Do not print with color
   --no-banner              Do not print the colorful banner
   --show-warnings          Show a warning if one or more test failed, even if it didn't drop below min-score
   --exit-on-fail           Send exit code 1 if any domains fail tests
   --log FILE               Log information to FILE
   --debug-log FILE         Log debugging information to FILE
   --result-file FILE       Dump result template into FILE (will overwrite!)
   --no-updates             Don't check to see if there are updates
   --threads n              Use n threads to fetch data (0 defaults to server cores/2) (default: 0)
   --delay DURATION         Delay DURATION before each resource is crawled (e.g. 5s, 1m, 100ms) (default: 0s)
   --http-timeout DURATION  DURATION before an http request is timed out (e.g. 5s, 10s, 1m) (default: 10s)
   --domains DOMAIN:IP ...  Manually specify list of domains to scan in form: DOMAIN:IP ..., or DOMAIN:IP:PORT
   --min-score value        Minimum score for domain (default: 8)
   -a, --assets             Crawl assets (css/js/images) for each page
   --ignore-success         Only print results if they are considered failed
   --allow-insecure         Don't check to see if an SSL certificate is valid
   --tmpl value             Golang text/template string template for use with formatting scan output
   --json PATH              Optional PATH to output json results to
   --json-pretty            Used with [--json], pretty-prints the output json
   --ignore-http            Ignore http-based URLs during domain search
   --ignore-https           Ignore https-based URLs during domain search
   --ignore-remote          Ignore all resources that resolve to a remote IP (use with --assets)
   --ignore-domains GLOB    Ignore URLS during domain search that match GLOB, pipe separated list
   --match-domains GLOB     Allow URLS during domain search that match GLOB, pipe separated list
   --ignore-test GLOB       Ignore tests that match GLOB, pipe separated list
   --match-test GLOB        Allow tests that match GLOB, pipe separated list
   --tests-url URL          Import tests from a specified URL
   --tests-path PATH        Import tests from a specified file-system PATH
   --ignore-std-tests       Ignores all built-in tests (useful with --tests-url)
   --pass-text GLOB         Give sites a +10 score if body matches GLOB
   --fail-text GLOB         Give sites a -10 score if body matches GLOB
   --help, -h               show help
   --version, -v            print the version

COPYRIGHT:
   (c) 2016 Liam Stanley

Getting Started

Getting started with Marill should be fairly easy. Since Marill is a single binary, there are no dependencies that are needed for the utility to run.

Head to this page and download the top item in the list. For example, using the latest version:

$ wget -q -O- https://release.liam.sh/marill/latest.tar.gz | tar -zx -C /root/tmp/

You should now see a file named marill in /root/tmp/. Feel free to look over the current flags and arguments:

$ /root/tmp/marill --help

The main arguments that may be useful are:

-a or --assets: This will fetch all of the assets for the page (css/javascript/images, etc)
-d or --debug: This will enable debugging. It doesn’t provide a whole lot more information, but can help if something isn’t working.
--delay: Utilize this if the load caused by the crawling is too high. E.g. --delay 10s.
--threads: This is the amount of parallel scans that will run at a single time. By default it will be 1/2 the amount of cores on the server.
--ignore-domains and --match-domains: utilize these to skip or only scan certain domains during the crawl. E.g. --ignore-domains "*domain.com|someotherdomain.com"

So, for example, to start off with:

$ /root/tmp/marill -a

cPanel/Apache based servers

Marill has out of the box support for cPanel based servers (though things like /var/cpanel/users/<user> and /var/cpanel/userdata/<domain>).

For Apache, Marill will find the current running httpd instance, and run <binary> -S, which pulls information about all virtual host entries. Note that this isn’t supported on all Apache versions (see here for more information).

Alternatives (Nginx, Caddy, etc)

If your web server does not match the above description, you can utilize the manual domain list flag of Marill. The current syntax for this is as follows:

$ marill --domains "<items>"

Replace <items> with one of the following list of inputs:

DOMAIN:IP:PORT
DOMAIN:IP
DOMAIN:PORT
DOMAIN

DOMAIN can be any of one of the following examples:

domain.com
www.domain.com
random.subdomain.domain.com
http://some-example.com/
https://some-example.com/some-login.php

So, to put it all together, you can do something like:

$ marill a --domains "somedomain.com:443 domain.com:1234 example.com:123.456.7.89:80 https://domain.com/"

Things to note/Troubleshooting

If there are any problems or bugs, PLEASE LET ME KNOW! You can submit bugs if you have a Github account here or here if you do not

FAQ

Will it cause high load?
- The general target at which this was written for are servers under maintenance, or being ran on a new server that is being migrated to. That being said, Marill does run scans in parallel. It will run scans in parallel in the amount of cores divided by 2. (8 core server, 4 concurrent crawls, 2 core server, 1 crawl at a time). If you see Marill still causing too much load, you can utilize --delay and --threads.
How long does Marill take to crawl sites (e.g. 1,000 sites on a server)?
- Given a cPanel server, is must be noted that along with the input (default http) version of a domain, the https version of the site will be scanned as well if cPanel has a certificate for it. Furthermore, it will also attempt to crawl www.domain.com, not just domain.com. As for other webservers, it all depends on the input. Please note that using --assets (-a), that Marill will take longer. This is because this fetches all resources for each site being crawled. If you would like Marill to crawl faster, don’t use -a.
- Generally speaking, crawling without -a is fairly fast.
Is it better to run Marill from inside of the server, or from a remote location?
- Running remotely ensures there are no ip or firewall related issues, however in the same sense if you are crawling quite a few sites, many servers may assume due to the high connection count, that your connections are malicious.
- If ran from inside of the server, Marill can scan and determine what the server is hosting (by checking Apache, cPanel, etc).
Can I give Marill a custom IP address for which to crawl a site (beforeit goes live and DNS is updated)?
- Yes! For example, rather than --domains "domain.com domain2.com", you can do something like:
```
$ marill --domains "domain.com:1.2.3.4 domain2.com:2.3.4.5"
```
- Also note that you can run scans on alternate ports:
```
$ marill --domains "domain.com:1.2.3.4:8080 domain.com:9000"
```
Can I give Marill a custom port for which to crawl a site?
- Yes! see FAQ #4

Can Marill crawl sub-domains and sub-folders?

Yes! You can pass any url into --domains as necessary. For example:

$ marill --domains "https://domain.com/sub/folder/some-page"

or with a custom ip as well:

$ marill --domains "https://domain.com/sub/folder/some-page:1.2.3.4"

Contributing

Please review the CONTRIBUTING doc for submitting issues/a guide on submitting pull requests and helping out.

License

LICENSE: The MIT License (MIT)
Copyright (c) 2016 Liam Stanley <[email protected]>

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.