Thursday, April 27, 2023

Migrating...

I am migrating this blog to a new site available on the domain re.factorcode.org.

While I do joke a lot about all of the Killed by Google apps, products, and features... it seems inevitable that at some point Google will kill the Blogger product and I wanted more control of this website.

For anyone curious, the new Re: Factor is built currently on Hugo the "world’s fastest framework for building websites" and water.css semantic CSS framework. It is a lot simpler to use than the Blogger interface, plus a bit faster since it's just a simple static website.

This blog will remain as a historical archive, but new posts will be on the new website...

Tuesday, April 25, 2023

OpenAI

It’s been pretty hard to avoid all of the incredible stories about artificial intelligence over the past few months. There seem to be incredible applications to the area of generative AI occurring on a daily basis. Image generation with Midjourney is pretty next-level. Code generation using GitHub Copilot seems pretty amazing. Interacting with large language models like GPT-4 or Bard or Bing Chat or Facebook LLaMA or StableLM and so many others seems like science fiction. Audio models like Whisper used for audio transcription even make popular audio assistants look pretty dated.

With all of the hype, it seemed inevitable that Factor would gain some kind of AI functionality.

Despite the non-profit vs for-profit controversy of OpenAI, they do seem to have a momentary lead in the race to make tools that others can build upon. One of those tools is the OpenAI API, which is made available using JSON and HTTP. Besides the OpenAI API Reference, there is an OpenAI Cookbook and popular libraries such as OpenAI Python for building systems using it.

Recently, I contributed the openai vocabulary which allows using all the methods currently made available by OpenAI. You will need an OpenAI API Key.

Once you obtain that, you can set it in the listener.

IN: scratchpad USE: openai

IN: scratchpad "sk-....................." openai-api-key set-global

And now you have enough to try it out…

IN: scratchpad "text-davinci-003"
               "what is the factor programming language"
               <completion> 100 >>max_tokens create-completion
               "choices" of first "text" of print

Factor is a stack-oriented programming language, designed for creating
flexible, reusable software components. It combines elements from both
object-oriented and functional programming, and provides powerful features,
including static typing and static type checking, an interactive program
development environment, built-in automated testing, and a wide range of
built-in data types. The language is designed to be easy to use, yet provide
a high degree of flexibility.

Cool!

We even have a Discord bot using OpenAI that answers on the Factor Discord server.

Tuesday, March 7, 2023

ASCII Table PDF

Vasudev Ram has a blog with many different posts about various programming topics including Python, Linux, SQL, and PDFs. On the topic of PDF generation, they have a blog post about making an ASCII Table to PDF with xtopdf.

Recently, I had the need for an ASCII table lookup, which I searched for and found, thanks to the folks here:

www.ascii-code.com

That gave me the idea of writing a simple program to generate an ASCII table in PDF. Here is the code for a part of that table - the first 32 (0 to 31) ASCII characters, which are the control characters:

It might not be widely known, but Factor has built-in support for writing to PDF Streams using the formatted output protocol. This supports text styles including changing font names, bold and italic styles, foreground and background colors, etc.

We start by defining the symbols and descriptions of the first 32 ASCII characters. These are all non-printable control character, which is why we use this array of strings to render them in a table.

CONSTANT: ASCII {
    "NUL Null char"
    "SOH Start of Heading"
    "STX Start of Text"
    "ETX End of Text"
    "EOT End of Transmission"
    "ENQ Enquiry"
    "ACK Acknowledgment"
    "BEL Bell"
    "BS Back Space"
    "HT Horizontal Tab"
    "LF Line Feed"
    "VT Vertical Tab"
    "FF Form Feed"
    "CR Carriage Return"
    "SO Shift Out / X-On"
    "SI Shift In / X-Off"
    "DLE Data Line Escape"
    "DC1 Device Control 1 (oft. XON)"
    "DC2 Device Control 2"
    "DC3 Device Control 3 (oft. XOFF)"
    "DC4 Device Control 4"
    "NAK Negative Acknowledgement"
    "SYN Synchronous Idle"
    "ETB End of Transmit Block"
    "CAN Cancel"
    "EM End of Medium"
    "SUB Substitute"
    "ESC Escape"
    "FS File Separator"
    "GS Group Separator"
    "RS Record Separator"
    "US Unit Separator"
}

The core printing logic is a header, followed by rows for each character, formatted into a table of decimal, octal, hexadecimal, and binary values along with their symbol and description from the array above:

: ascii. ( -- )
    "ASCII Control Characters - 0 to 31" print nl
    ASCII [
        1 + swap [
            {
                [ >dec ]
                [ >oct 3 CHAR: 0 pad-head ]
                [ >hex 2 CHAR: 0 pad-head ]
                [ >bin 8 CHAR: 0 pad-head ]
            } cleave
        ] dip " " split1 6 narray
    ] map-index {
        "DEC" "OCT" "HEX" "BIN" "Symbol" "Description"
    } prefix format-table unclip
    H{ { font-style bold } } format nl
    [ print ] each ;

Since the UI listener supports formatted streams, you can see it from the listener:

Outputting this to a PDF file is now easy. We make sure to set the font to monospace and then run ascii. with our PDF writer, saving the generated PDF output into a file.

: ascii-pdf ( path -- )
    [
        H{ { font-name "monospace" } } [ ascii. ] with-style
    ] with-pdf-writer pdf>string swap utf8 set-file-contents ;

We also support writing to HTML streams in a similar manner, so it would be pretty easy to create an ascii-html word to output an HTML file with the same printing logic above but instead using our HTML writer.

Friday, March 3, 2023

Short UUID

The shortuuid project is a “simple python library that generates concise, unambiguous, URL-safe UUIDs”. I thought it would be a fun exercise to implement this in Factor.

What is a “short UUID”?

You can read the original announcement, but basically it is a string representation of a number using a reduced alphabet that can be used in places like URLs where conciseness is desirable. The author mentions that it provides security by “not divulging information (such as how many rows there are in that particular table, the time difference between one item and the next, etc.)”. However, I think it is more security through obscurity than real security.

In any event, the alphabet used are these 57 characters:

CONSTANT: alphabet
"23456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz"

We encode a numeric input by repeatedly “divmod”, indexing into an alphabet, until exhausted.

: encode-uuid ( uuid -- shortuuid )
    [ dup 0 > ] [
        alphabet [ length /mod ] [ nth ] bi
    ] "" produce-as nip reverse ;

We decode using a reverse process, looking up the position of each character in the alphabet, re-assembling the numeric input for each character in the shortuuid.

: decode-uuid ( shortuuid -- uuid )
    0 [
        alphabet index [ alphabet length * ] dip +
    ] reduce ;

This is available on my GitHub, including features to deal with legacy values generated before version 1.0.0 as well as supporting different alphabets being used.

Wednesday, March 1, 2023

Geo Timezones

Brad Fitzpatrick wrote a Go package called latlong which efficiently maps a latitude/longitude to a timezone. The original post describing it was on Google+ and is likely lost forever — unless it made it into the Google+ archive before Google+ joined the Google Graveyard.

It tries to have a small binary size (~360 KB), low memory footprint (~1 MB), and incredibly fast lookups (~0.5 microseconds). It does not try to be perfectly accurate when very close to borders.

It’s available in other languages, too!

Huon Wilson ported the library to the Rust Programming Language, making the code available on GitHub and installable via Cargo. There is even a wrapper made for NodeJs that is installable via NPM that uses a command-line executable written in Go.

When it was announced in 2015, I had ported the library to Factor, but missed the opportunity to blog about it. Below we discuss some details about the implementation, starting with its use of a shapefile of the TZ timezones of the world to divide the world into zones that are assigned timezone values — looking something like this:

The world is divided into 6 zoom levels of tiles (represented by a key and an index value) that allow us to search from a very large area first, then down to the more specific geographic area. Note: we represent the struct as a big endian struct with structure packing to minimize wasted space in the files.

The zoom levels are then cached using literal syntax into a zoom-levels constant.

BE-PACKED-STRUCT: tile
    { key uint }
    { idx ushort } ;

SPECIALIZED-ARRAY: tile

CONSTANT: zoom-levels $[
    6 <iota> [
        number>string
        "vocab:geo-tz/zoom" ".dat" surround
        binary file-contents tile cast-array
    ] map
]

Each of the zoom levels reference indexes into a leaves data structure that contains 14,110 items — each represented by one of three data types:

  1. Type S is a string.
  2. Type 2 is a one bit tile.
  3. Type P is a pixmap thats 128 bytes long.

These we load and cache into a unique-leaves constant.

CONSTANT: #leaves 14110

BE-PACKED-STRUCT: one-bit-tile
    { idx0 ushort }
    { idx1 ushort }
    { bits ulonglong } ;

CONSTANT: unique-leaves $[
    "vocab:geo-tz/leaves.dat" binary [
        #leaves [
            read1 {
                { CHAR: S [ { 0 } read-until drop utf8 decode ] }
                { CHAR: 2 [ one-bit-tile read-struct ] }
                { CHAR: P [ 128 read ] }
            } case
        ] replicate
    ] with-file-reader
]

The core logic involves looking up a leaf (which is one of three types, loaded above), given an (x, y) coordinate. If it is a string type, we are done. If it is a one-bit-tile, we defer to the appropriate leaf specified by idx0 or idx1. And if it is pixmap, we have a smidge more logic to detect oceans or defer again to a different leaf.

CONSTANT: ocean-index 0xffff

GENERIC#: lookup-leaf 2 ( leaf x y -- zone/f )

M: string lookup-leaf 2drop ;

M:: one-bit-tile lookup-leaf ( leaf x y -- zone/f )
    leaf bits>> y 3 bits 3 shift x 3 bits bitor bit?
    [ leaf idx1>> ] [ leaf idx0>> ] if
    unique-leaves nth x y lookup-leaf ;

M:: byte-array lookup-leaf ( leaf x y -- zone/f )
    y 3 bits 3 shift x 3 bits bitor 2 * :> i
    i leaf nth 8 shift i 1 + leaf nth +
    dup ocean-index = [ drop f ] [
        unique-leaves nth x y lookup-leaf
    ] if ;

We’re almost done! Given a zoom level, a tile-key helps us find a specific tile that we then can lookup the leaf for, hopefully finding the timezone associated with the coordinate.

:: lookup-zoom-level ( zoom-level x y tile-key -- zone/f )
    zoom-level [ key>> tile-key >=< ] search swap [
        dup key>> tile-key = [
            idx>> unique-leaves nth x y lookup-leaf
        ] [ drop f ] if
    ] [ drop f ] if ;

Each coordinate is effectively a pixel in the image, so our logic searches from the outermost zoom level to the innermost, trying to lookup a timezone in each one using the coordinate and level as a tile-key.

:: tile-key ( x y level -- tile-key )
    level dup 3 + neg :> n
    y x [ n shift 14 bits ] bi@
    { 0 14 28 } bitfield ;

:: lookup-pixel ( x y -- zone )
    6 <iota> [| level |
        level zoom-levels nth
        x y 2dup level tile-key
        lookup-zoom-level
    ] map-find-last drop ;

Finally, we have enough to implement our public API, converting a given latitude/longitude coordinate to a pixel value, deferring to the word we just defined above.

CONSTANT: deg-pixels 32

:: lookup-zone ( lat lon -- zone )
    lon 180 + deg-pixels * 0 360 deg-pixels * 1 - clamp
    90 lat - deg-pixels * 0 180 deg-pixels * 1 - clamp
    [ >integer ] bi@ lookup-pixel ;

And then a couple of test cases to show it’s working:

{ "America/Los_Angeles" } [ 37.7833 -122.4167 lookup-zone ] unit-test

{ "Australia/Sydney" } [ -33.8885 151.1908 lookup-zone ] unit-test

Performance is pretty good, we can generate over 3 million lookups per second, putting our cost per lookup around 0.33 microseconds. And all of that in less than 70 lines of code.

This is available on my GitHub.

Tuesday, February 28, 2023

Reference Server

Phil Eaton made a repository of Barebones UNIX socket servers with this description:

I find myself writing this server in some language every few months. Each time I have to scour the web for a good reference. Use this as a reference to write your own bare server in C or other languages with a UNIX API (Python, OCaml, etc).

Many developers learning network programming will encounter Beej's Guide to Network Programming which uses the sockets API, has been ported to many platforms, and explains the intricacies of making computers talk to each other in this manner.

C

We can take a look at his C implementation of a server that listens on port 15000, accepts client connections, reads up to 1024 bytes which are printed to the screen, then writes hello world back to the client and disconnects them:

#include <netinet/in.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <unistd.h>

int main() {
    int server, client;
    socklen_t addrlen;

    int bufsize = 1024;
    char *buffer = malloc(bufsize);
    struct sockaddr_in address;

    server = socket(AF_INET, SOCK_STREAM, 0);

    address.sin_family = AF_INET;
    address.sin_addr.s_addr = INADDR_ANY;
    address.sin_port = htons(15000);

    bind(server, (struct sockaddr *) &address, sizeof(address));

    while (1) {
        listen(server, 10);
        client = accept(server, (struct sockaddr *) &address, &addrlen);
        recv(client, buffer, bufsize, 0);
        printf("%s\n", buffer);
        write(client, "hello world\n", 12);
        close(client);
    }

    close(server);
    return 0;
}

Factor

A direct Factor translation — without any error checking, like in the original example — using the C library interface might look something like this:

USING: accessors alien.c-types alien.data byte-arrays
classes.struct io io.encodings.string io.encodings.utf8 kernel
sequences unix.ffi unix.types ;

:: reference-server ( -- )
    1024 <byte-array> :> buffer
    AF_INET SOCK_STREAM 0 socket :> server
    sockaddr-in malloc-struct
        AF_INET >>family
        0 >>addr
        15000 htons >>port :> address

    server address sockaddr-in heap-size bind drop

    [
        server 10 listen drop
        server address 0 socklen_t <ref> accept :> client
        client buffer 1024 0 recv
        buffer swap head-slice utf8 decode print flush
        client $[ "hello world\n" >byte-array ]
        dup length unix.ffi:write drop
        client close drop
        t
    ] loop

    server close drop ;

I noticed that some of his examples are more idiomatic to the language, so we could rewrite this using threaded servers — gaining the benefit of working on Windows as well as error handling and logging — using a handler quotation to implement the read/print/write/disconnect logic.

USING: accessors io io.encodings.binary io.encodings.string
io.encodings.utf8 io.servers kernel namespaces ;

: reference-server ( -- )
    binary <threaded-server>
        15000 >>insecure
        [
            1024 read-partial [
                [ utf8 decode print flush ] with-global
                $[ "hello world\n" >byte-array ] io:write flush
            ] when*
        ] >>handler
    start-server wait-for-server ;

This is available on my GitHub.

Sunday, February 26, 2023

Weighted Random

Some time ago, I implemented a way to generate weighted random values from a discrete distribution in Factor. It ended up being a pretty satisfyingly simple word that builds a cumulative probability table, generates a random probability, then searches the table to find which value to return:

: weighted-random ( histogram -- obj )
    unzip cum-sum [ last >float random ] keep bisect-left swap nth ;

Is It Fast?

We can define a simple discrete distribution of values:

CONSTANT: dist H{ { "A" 1 } { "B" 2 } { "C" 3 } { "D" 4 } }

And it seems to work — we can make a few random values from it:

IN: scratchpad dist weighted-random .
"C"
IN: scratchpad dist weighted-random .
"C"
IN: scratchpad dist weighted-random .
"D"
IN: scratchpad dist weighted-random .
"B"

After generating a lot of random values, we can see the histogram matches our distribution:

IN: scratchpad 10,000,000 [ dist weighted-random ] replicate histogram .
H{
    { "A" 998403 }
    { "B" 2000400 }
    { "C" 3001528 }
    { "D" 3999669 }
}

But, how fast is it?

IN: scratchpad [ 10,000,000 [ dist weighted-random ] replicate ] time 
Running time: 3.02998325 seconds

Okay, so it's not that fast... generating around 3.3 million per second on one of my computers.

Improvements

We can make two quick improvements to this:

  1. First, we can factor out the initial step from the random number generation.
  2. Second, we can take advantage of a recent improvement to the random vocabulary, mainly to change the random word that was previously implemented for different types to instead get the current random-generator and then pass it to the random* implementation instead. This allows a few speedups where we can lookup the dynamic variable once and then use it many times.

That results in this definition:

: weighted-randoms ( length histogram -- seq )
    unzip cum-sum swap
    [ [ last >float random-generator get ] keep ] dip
    '[ _ _ random* _ bisect-left _ nth ] replicate ;

That gives us a nice speedup, just over 10 million per second:

IN: scratchpad [ 10,000,000 dist weighted-randoms ] time histogram .
Running time: 0.989039625 seconds

H{
    { "A" 1000088 }
    { "B" 1999445 }
    { "C" 3000688 }
    { "D" 3999779 }
}

That's pretty nice, but it turns out that we can do better.

Vose Alias Method

Keith Schwarz wrote a fascinating blog post about some better algorithms for sampling from a discrete distribution. One of those algorithms is the Vose Alias Method which creates a data structure of items, probabilities, and an alias table that is used to return an alternate choice:

TUPLE: vose
    { n integer }
    { items array }
    { probs array }
    { alias array } ;

We construct a vose tuple by splitting the distribution into items and their probabilities, and then processing the probabilities into lists of small (less than 1) or large (greater than or equal to 1), iteratively aliasing the index of smaller items to larger items.

:: <vose> ( dist -- vose )
    V{ } clone :> small
    V{ } clone :> large
    dist assoc-size :> n
    n f <array> :> alias

    dist unzip dup [ length ] [ sum ] bi / v*n :> ( items probs )
    probs [ swap 1 < small large ? push ] each-index

    [ small empty? not large empty? not and ] [
        small pop :> s
        large pop :> l
        l s alias set-nth
        l dup probs [ s probs nth + 1 - dup ] change-nth
        1 < small large ? push
    ] while

    1 large [ probs set-nth ] with each
    1 small [ probs set-nth ] with each

    n items probs alias vose boa ;

We can implement the random* generic to select a random item from the vose tuple — choosing a random item index, check it's probability against a random number between 0.0 and 1.0, and if it is over a threshold we return the aliased item instead:

M:: vose random* ( obj rnd -- elt )
    obj n>> rnd random*
    dup obj probs>> nth rnd (random-unit) >=
    [ obj alias>> nth ] unless
    obj items>> nth ;

It's much faster, over 14.4 million per second:

IN: scratchpad [ 10,000,000 dist <vose> randoms ] time 
Running time: 0.693588458 seconds

This is available now in the math.extras vocabulary in the current development version, along with a few tweaks that brings the performance over 21.7 million per second...

Wednesday, February 15, 2023

DuckDuckGo

The conversation around the current quality of web search engines, the doomsday prediction about various incumbents, and the equal parts inspiring and challenging rollout of large language models to improve search has been fascinating to watch. There are many challengers in the search engine space including companies like Kagi and Neeva among many search engine startups. One privacy-focused startup that has been fun to follow for awhile has been DuckDuckGo.

You can see an example of the DuckDuckGo API that is available on api.duckduckgo.com. This does not provide access to their full search results, but instead provides access to their instant answers. Regardless, I thought it would be neat if we could use this from Factor.

We can take a search query and turn it into a URL object:

: duckduckgo-url ( query -- url )
    URL" http://api.duckduckgo.com"
        swap "q" set-query-param
        "json" "format" set-query-param
        "1" "pretty" set-query-param
        "1" "no_redirect" set-query-param
        "1" "no_html" set-query-param
        "1" "skip_disambig" set-query-param ;

Using the http.client vocabulary and the json vocabulary we can retrieve a result set:

: duckduckgo ( query -- results )
    duckduckgo-url http-get nip utf8 decode json> ;

We can make a word that prints out the abstract response with clickable links:

: abstract. ( results -- )
    dup "Heading" of [ drop ] [
        swap {
            [ "AbstractURL" of >url write-object nl ]
            [ "AbstractText" of print ]
            [ "AbstractSource" of "- " write print ]
        } cleave nl
    ] if-empty ;

And then a word that prints out a result response, parsing the HTML using the html.parser vocabulary and output as text using the html.parser.printer vocabulary:

: result. ( result -- )
    "Result" of [
        "<a href=\"" ?head drop "\">" split1 "</a>" split1
        [ swap >url write-object ]
        [ parse-html html-text. nl ] bi*
    ] when* ;

There are more aspects to the response from the API, but we can initially print out the abstract, the results, and the related topics:

: duckduckgo. ( query -- )
    duckduckgo {
        [ abstract. ]
        [ "Results" of [ result. ] each ]
        [ "RelatedTopics" of [ result. ] each ]
    } cleave ;

We can try it out on a topic that this particular blog likes to discuss:

IN: scratchpad "factorcode" duckduckgo.
Factor (programming language)
Factor is a stack-oriented programming language created by Slava
Pestov. Factor is dynamically typed and has automatic memory
management, as well as powerful metaprogramming features. The
language has a single implementation featuring a self-hosted
optimizing compiler and an interactive development environment.
The Factor distribution includes a large standard library.
- Wikipedia

Official site - Factor (programming language)
Concatenative programming languages
Stack-oriented programming languages
Extensible syntax programming languages
Function-level languages
High-level programming languages
Programming languages
Software using the BSD license

This is available on my GitHub.