Tuesday, February 28, 2023

Reference Server

Phil Eaton made a repository of Barebones UNIX socket servers with this description:

I find myself writing this server in some language every few months. Each time I have to scour the web for a good reference. Use this as a reference to write your own bare server in C or other languages with a UNIX API (Python, OCaml, etc).

Many developers learning network programming will encounter Beej's Guide to Network Programming which uses the sockets API, has been ported to many platforms, and explains the intricacies of making computers talk to each other in this manner.

C

We can take a look at his C implementation of a server that listens on port 15000, accepts client connections, reads up to 1024 bytes which are printed to the screen, then writes hello world back to the client and disconnects them:

#include <netinet/in.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <unistd.h>

int main() {
    int server, client;
    socklen_t addrlen;

    int bufsize = 1024;
    char *buffer = malloc(bufsize);
    struct sockaddr_in address;

    server = socket(AF_INET, SOCK_STREAM, 0);

    address.sin_family = AF_INET;
    address.sin_addr.s_addr = INADDR_ANY;
    address.sin_port = htons(15000);

    bind(server, (struct sockaddr *) &address, sizeof(address));

    while (1) {
        listen(server, 10);
        client = accept(server, (struct sockaddr *) &address, &addrlen);
        recv(client, buffer, bufsize, 0);
        printf("%s\n", buffer);
        write(client, "hello world\n", 12);
        close(client);
    }

    close(server);
    return 0;
}

Factor

A direct Factor translation — without any error checking, like in the original example — using the C library interface might look something like this:

USING: accessors alien.c-types alien.data byte-arrays
classes.struct io io.encodings.string io.encodings.utf8 kernel
sequences unix.ffi unix.types ;

:: reference-server ( -- )
    1024 <byte-array> :> buffer
    AF_INET SOCK_STREAM 0 socket :> server
    sockaddr-in malloc-struct
        AF_INET >>family
        0 >>addr
        15000 htons >>port :> address

    server address sockaddr-in heap-size bind drop

    [
        server 10 listen drop
        server address 0 socklen_t <ref> accept :> client
        client buffer 1024 0 recv
        buffer swap head-slice utf8 decode print flush
        client $[ "hello world\n" >byte-array ]
        dup length unix.ffi:write drop
        client close drop
        t
    ] loop

    server close drop ;

I noticed that some of his examples are more idiomatic to the language, so we could rewrite this using threaded servers — gaining the benefit of working on Windows as well as error handling and logging — using a handler quotation to implement the read/print/write/disconnect logic.

USING: accessors io io.encodings.binary io.encodings.string
io.encodings.utf8 io.servers kernel namespaces ;

: reference-server ( -- )
    binary <threaded-server>
        15000 >>insecure
        [
            1024 read-partial [
                [ utf8 decode print flush ] with-global
                $[ "hello world\n" >byte-array ] io:write flush
            ] when*
        ] >>handler
    start-server wait-for-server ;

This is available on my GitHub.

Sunday, February 26, 2023

Weighted Random

Some time ago, I implemented a way to generate weighted random values from a discrete distribution in Factor. It ended up being a pretty satisfyingly simple word that builds a cumulative probability table, generates a random probability, then searches the table to find which value to return:

: weighted-random ( histogram -- obj )
    unzip cum-sum [ last >float random ] keep bisect-left swap nth ;

Is It Fast?

We can define a simple discrete distribution of values:

CONSTANT: dist H{ { "A" 1 } { "B" 2 } { "C" 3 } { "D" 4 } }

And it seems to work — we can make a few random values from it:

IN: scratchpad dist weighted-random .
"C"
IN: scratchpad dist weighted-random .
"C"
IN: scratchpad dist weighted-random .
"D"
IN: scratchpad dist weighted-random .
"B"

After generating a lot of random values, we can see the histogram matches our distribution:

IN: scratchpad 10,000,000 [ dist weighted-random ] replicate histogram .
H{
    { "A" 998403 }
    { "B" 2000400 }
    { "C" 3001528 }
    { "D" 3999669 }
}

But, how fast is it?

IN: scratchpad [ 10,000,000 [ dist weighted-random ] replicate ] time 
Running time: 3.02998325 seconds

Okay, so it's not that fast... generating around 3.3 million per second on one of my computers.

Improvements

We can make two quick improvements to this:

  1. First, we can factor out the initial step from the random number generation.
  2. Second, we can take advantage of a recent improvement to the random vocabulary, mainly to change the random word that was previously implemented for different types to instead get the current random-generator and then pass it to the random* implementation instead. This allows a few speedups where we can lookup the dynamic variable once and then use it many times.

That results in this definition:

: weighted-randoms ( length histogram -- seq )
    unzip cum-sum swap
    [ [ last >float random-generator get ] keep ] dip
    '[ _ _ random* _ bisect-left _ nth ] replicate ;

That gives us a nice speedup, just over 10 million per second:

IN: scratchpad [ 10,000,000 dist weighted-randoms ] time histogram .
Running time: 0.989039625 seconds

H{
    { "A" 1000088 }
    { "B" 1999445 }
    { "C" 3000688 }
    { "D" 3999779 }
}

That's pretty nice, but it turns out that we can do better.

Vose Alias Method

Keith Schwarz wrote a fascinating blog post about some better algorithms for sampling from a discrete distribution. One of those algorithms is the Vose Alias Method which creates a data structure of items, probabilities, and an alias table that is used to return an alternate choice:

TUPLE: vose
    { n integer }
    { items array }
    { probs array }
    { alias array } ;

We construct a vose tuple by splitting the distribution into items and their probabilities, and then processing the probabilities into lists of small (less than 1) or large (greater than or equal to 1), iteratively aliasing the index of smaller items to larger items.

:: <vose> ( dist -- vose )
    V{ } clone :> small
    V{ } clone :> large
    dist assoc-size :> n
    n f <array> :> alias

    dist unzip dup [ length ] [ sum ] bi / v*n :> ( items probs )
    probs [ swap 1 < small large ? push ] each-index

    [ small empty? not large empty? not and ] [
        small pop :> s
        large pop :> l
        l s alias set-nth
        l dup probs [ s probs nth + 1 - dup ] change-nth
        1 < small large ? push
    ] while

    1 large [ probs set-nth ] with each
    1 small [ probs set-nth ] with each

    n items probs alias vose boa ;

We can implement the random* generic to select a random item from the vose tuple — choosing a random item index, check it's probability against a random number between 0.0 and 1.0, and if it is over a threshold we return the aliased item instead:

M:: vose random* ( obj rnd -- elt )
    obj n>> rnd random*
    dup obj probs>> nth rnd (random-unit) >=
    [ obj alias>> nth ] unless
    obj items>> nth ;

It's much faster, over 14.4 million per second:

IN: scratchpad [ 10,000,000 dist <vose> randoms ] time 
Running time: 0.693588458 seconds

This is available now in the math.extras vocabulary in the current development version, along with a few tweaks that brings the performance over 21.7 million per second...

Wednesday, February 15, 2023

DuckDuckGo

The conversation around the current quality of web search engines, the doomsday prediction about various incumbents, and the equal parts inspiring and challenging rollout of large language models to improve search has been fascinating to watch. There are many challengers in the search engine space including companies like Kagi and Neeva among many search engine startups. One privacy-focused startup that has been fun to follow for awhile has been DuckDuckGo.

You can see an example of the DuckDuckGo API that is available on api.duckduckgo.com. This does not provide access to their full search results, but instead provides access to their instant answers. Regardless, I thought it would be neat if we could use this from Factor.

We can take a search query and turn it into a URL object:

: duckduckgo-url ( query -- url )
    URL" http://api.duckduckgo.com"
        swap "q" set-query-param
        "json" "format" set-query-param
        "1" "pretty" set-query-param
        "1" "no_redirect" set-query-param
        "1" "no_html" set-query-param
        "1" "skip_disambig" set-query-param ;

Using the http.client vocabulary and the json vocabulary we can retrieve a result set:

: duckduckgo ( query -- results )
    duckduckgo-url http-get nip utf8 decode json> ;

We can make a word that prints out the abstract response with clickable links:

: abstract. ( results -- )
    dup "Heading" of [ drop ] [
        swap {
            [ "AbstractURL" of >url write-object nl ]
            [ "AbstractText" of print ]
            [ "AbstractSource" of "- " write print ]
        } cleave nl
    ] if-empty ;

And then a word that prints out a result response, parsing the HTML using the html.parser vocabulary and output as text using the html.parser.printer vocabulary:

: result. ( result -- )
    "Result" of [
        "<a href=\"" ?head drop "\">" split1 "</a>" split1
        [ swap >url write-object ]
        [ parse-html html-text. nl ] bi*
    ] when* ;

There are more aspects to the response from the API, but we can initially print out the abstract, the results, and the related topics:

: duckduckgo. ( query -- )
    duckduckgo {
        [ abstract. ]
        [ "Results" of [ result. ] each ]
        [ "RelatedTopics" of [ result. ] each ]
    } cleave ;

We can try it out on a topic that this particular blog likes to discuss:

IN: scratchpad "factorcode" duckduckgo.
Factor (programming language)
Factor is a stack-oriented programming language created by Slava
Pestov. Factor is dynamically typed and has automatic memory
management, as well as powerful metaprogramming features. The
language has a single implementation featuring a self-hosted
optimizing compiler and an interactive development environment.
The Factor distribution includes a large standard library.
- Wikipedia

Official site - Factor (programming language)
Concatenative programming languages
Stack-oriented programming languages
Extensible syntax programming languages
Function-level languages
High-level programming languages
Programming languages
Software using the BSD license

This is available on my GitHub.

Sunday, February 12, 2023

Magic

Ever wonder what the type of a particular binary file is? Or wonder how a program knows that a particular binary file is in a compatible file format? One way is to look at the magic number used by the file format in question. You can see some examples in a list of file signatures.

The libmagic library commonly supports the file command on Unix systems, other than Apple macOS which has its own implementation, and uses magic numbers and other techniques to identify file types. You can see how it works through a few examples:

$ file vm/factor.hpp
vm/factor.hpp: C++ source text, ASCII text

$ file Factor.app/Contents/Info.plist 
Factor.app/Contents/Info.plist: XML  document text

$ file factor
factor: Mach-O 64-bit executable x86_64

$ file factor.image
factor.image: data

Wrapping the C library

I am going to show how to wrap a C library using the alien vocabulary which provides an FFI capability in Factor. The man pages for libmagic show us some of the functions available in magic.h.

The libmagic library needs to be made available to the Factor instance:

"magic" {
    { [ os macosx? ] [ "libmagic.dylib" ] }
    { [ os unix? ] [ "libmagic.so" ] }
} cond cdecl add-library 

We start by defining an opaque type for magic_t:

TYPEDEF: void* magic_t

Some functions are available for opening, loading, and then closing the magic_t:

FUNCTION: magic_t magic_open ( int flags )

FUNCTION: int magic_load ( magic_t magic, c-string path )

FUNCTION: void magic_close ( magic_t magic )

It is convenient to wrap the close function as a destructor for use in a with-destructors form.

DESTRUCTOR: magic_close

A function that "returns a textual description of the contents of the filename argument", which gives us the file command ability above:

FUNCTION: c-string magic_file ( magic_t magic, c-string path )

That should be everything we need to continue...

Using the C library

Now that we have the raw C library made available as Factor words, we can create a simpler interface by wrapping some of the words into a simple word that guesses the type of a file:

: guess-file ( path -- result )
    [
        normalize-path
        0 magic_open &magic_close
        [ f magic_load drop ]
        [ swap magic_file ] bi
    ] with-destructors ;

And we can then try it on a few files:

IN: scratchpad "vm/factor.hpp" guess-file .
"C++ source, ASCII text"

IN: scratchpad "Factor.app/Contents/Info.plist" guess-file .
"XML 1.0 document, Unicode text, UTF-8 text"

IN: scratchpad "factor" guess-file .
"symbolic link to Factor.app/Contents/MacOS/factor"

IN: scratchpad "factor.image" guess-file .
"data"

This has been available for awhile in the magic vocabulary with improved error checking and some options to guess the MIME type of files.

Wednesday, February 8, 2023

Hipku

Once upon a time, there was a Javascript project called Hipku. The original post that described it was lost somewhere in the series of tubes, but thankfully the "full documentation and a working demo" was saved by the Wayback Machine. It is also still available on npm for installation.

Hipku is a small javascript library to encode IP addresses as haiku. It can express any IPv4 or IPv6 address as a Western-style 5/7/5 syllable haiku.

An implementation in Python was created called PyHipku. It is still available on PyPi for installation, but the website associated with it was also lost to history and not even the Great Wayback Machine seems able to recover it. I think of programming as aspiring to a kind of poetic result — and wonder what kind of language could run Waka Waka Bang Splat — well, the haiku style caught my interest, so I ported the hipku algorithm to the Factor programming language.

At it's core, we encode an IPv4 address or IPv6 address into a series of numerical values and then make a poem by looking up each word from a word list. Some symbols are defined to help us know to start a sentence with an uppercase letter or end a sentence with a period:

SYMBOLS: Octet octet octet. ;

For example, an IPv4 key specifies the word lists to use for each octet and an IPv4 schema specify how the octets form into a hipku — an f indicates a newline:

CONSTANT: ipv4-key ${
    animal-adjectives animal-colors animal-nouns animal-verbs
    nature-adjectives nature-nouns plant-nouns plant-verbs
}

CONSTANT: ipv4-schema ${
    "The" octet octet octet f
    octet "in the" octet octet. f
    Octet octet.
}

To create the hipku, we iterate across the key, choosing words numerically by looking up the octet value, and then composing them into the ordering specified by the schema.

You can see a couple examples below:

IN: scratchpad "127.0.0.1" >hipku print
The hungry white ape
aches in the ancient canyon.
Autumn colors crunch.

IN: scratchpad "2001:db8:3333:4444:5555:6666:7777:8888" >hipku print
Chilled apes and blunt seas
clap dear firm firm grim grim gnomes.
Large holes grasp pained mares.

We support both encoding into a hipku as well as decoding back into an IPv4/IPv6 address. This is available as the hipku vocabulary in a recent nightly build.

Tuesday, February 7, 2023

Proquint

A few days ago, Ciprian Dorin Craciun wrote a binary to text encoding blog post about the "state of the art and missed opportunities" in various encoding schemes. In that post, I was introduced to the Proquint encoding which stands for "PRO-nouncable QUINT-uplets".

In the Factor programming language, we have enjoyed implementing many encoding/decoding methods including: base16, base24, base32, base32hex, base32-crockford, base36, base58, base62, base64, base85, base91, uu, and many others. I thought it would be fun to add a quick implementation of Proquint.

Like other encodings, it makes use of an alphabet — grouped as consonants and vowels:

CONSTANT: consonant "bdfghjklmnprstvz"

CONSTANT: vowel "aiou"

Numbers are grouped into 5-character blocks representing a 16-bit number, with alternating consonants representing 4 bits and vowels representing 2 bits:

: >quint16 ( m -- str )
    5 [
        even? [
            [ -4 shift ] [ 4 bits consonant nth ] bi
        ] [
            [ -2 shift ] [ 2 bits vowel nth ] bi
        ] if
    ] "" map-integers-as reverse nip ;

Encoding a 32-bit number is made by joining two 16-bit blocks:

: >quint32 ( m -- str )
    [ -16 shift ] keep [ 16 bits >quint16 ] bi@ "-" glue ;

Decoding numbers looks up each consonant or vowel, skipping separators:

: quint> ( str -- m )
    0 [
        dup $[ consonant alphabet-inverse ] nth [
            nip [ 4 shift ] [ + ] bi*
        ] [
            dup $[ vowel alphabet-inverse ] nth [
                nip [ 2 shift ] [ + ] bi*
            ] [
                CHAR: - assert=
            ] if*
        ] if*
    ] reduce ;

We can use this to make a random password that might be more memorable — but perhaps more secure if using more random-bits:

: quint-password ( -- quint )
    32 random-bits >quint32 ;

And we could use our ip-parser vocabulary to make IPv4 addresses more memorable:

: ipv4>quint ( ipv4 -- str )
    ipv4-aton >quint32 ;

: quint>ipv4 ( str -- ipv4 )
    quint> ipv4-ntoa ;

You can see how this might work by building a test suite to show roundtrips work:

{ t } [
    {
        { "127.0.0.1"       "lusab-babad" }
        { "63.84.220.193"   "gutih-tugad" }
        { "63.118.7.35"     "gutuk-bisog" }
        { "140.98.193.141"  "mudof-sakat" }
        { "64.255.6.200"    "haguz-biram" }
        { "128.30.52.45"    "mabiv-gibot" }
        { "147.67.119.2"    "natag-lisaf" }
        { "212.58.253.68"   "tibup-zujah" }
        { "216.35.68.215"   "tobog-higil" }
        { "216.68.232.21"   "todah-vobij" }
        { "198.81.129.136"  "sinid-makam" }
        { "12.110.110.204"  "budov-kuras" }
    } [
        [ quint>ipv4 = ] [ swap ipv4>quint = ] 2bi and
    ] assoc-all?
] unit-test

This is now available as the proquint vocabulary in a recent nightly build.