alt.hn

1/13/2025 at 5:44:29 AM

The history and use of /etc/glob in early Unixes

https://utcc.utoronto.ca/~cks/space/blog/unix/EtcGlobHistory

by zdw

1/13/2025 at 2:06:31 PM

User defined functions were implemented similarly as external execs in early shells. As the script was parsed, functions were dropped into /tmp without their wrappings and then called as external programs. Since they would still reference parameters as $1, $2 etc, it just worked: function bodies and standalone sh scripts had the same interface! Such a clever idea to avoid managing an interpreted call stack in the parent.

by imglorp

1/13/2025 at 4:23:47 PM

The linked C source file is an excellent example of ancient C, when it was still more closer to high level assembly:

https://www.tuhs.org/cgi-bin/utree.pl?file=V2/cmd/glob.c

by miohtama

1/13/2025 at 10:55:09 PM

I assume that you are referring to the liberal use of “goto”. Of course, “if”, “while”, and even “switch” are also used. Quite the mix.

Directly calling into system calls (“write”) is interesting.

by LeFantome

1/14/2025 at 1:19:33 AM

write(2) is POSIX. That's not "directly calling into a system call"; it's a normal C API from the POSIX header <unistd.h>.

by quuxplusone

1/14/2025 at 1:15:13 AM

The modern form of stdio only appear in Seventh Edition Unix.

by kps

1/13/2025 at 5:52:52 PM

And when buffer overflows were (attempted to be) avoided by guestimating a large enough buffer size.

by tomtomtom777

1/13/2025 at 9:49:27 AM

Why is there a period after etc in the title? Another example of HN's stupid automated title editing?

by ginko

1/13/2025 at 9:55:35 AM

Probably the submitter typed it on a phone instead of copy-paste and "etc" got autoincorrected.

by mkl

1/13/2025 at 6:53:46 AM

> PS: I don't know why expanding shell wildcards used a separate program in V6 and earlier, but part of it may have been to keep the shell smaller and more minimal so that it required less memory.

See, I thought it was a nice separation of concerns and wondered why we lost such a nice approach, until I read:

> How escaping wildcards works in the V5 and V6 shell is that all characters in commands and arguments are restricted to being seven-bit ASCII. The shell and /etc/glob both use the 8th bit to mark quoted characters, which means that such quoted characters don't match their unquoted versions and won't be seen as wildcards by either the shell

at which point I suddenly became a fan of ditching it. I do wonder if there's not some better way to factor that functionality out...

by yjftsjthsd-h

1/13/2025 at 9:32:08 AM

Important thing to remember is that even after the move to PDP-11, early Unix systems had to deal with 32kB as entire space available to userland program, both code and data (including stack)

by p_l

1/13/2025 at 4:15:16 PM

You mean 32k words, not 32k bytes, right[1]? And AFAIK by V5 or V6, Unix could use split instruction and data if the MMU supported it giving a bit more headroom. But, yeah, memory was very tight, and a lot of very clever tricks were used to get around it.

[1] Even worse, the top 4kW/8kB was reserved for I/O.

by kjs3

1/14/2025 at 8:54:43 PM

I meant 32k bytes - PDP-11 was byte-addressed, not word-addressed. The 64kB address space was split in half between kernel and userland in so called "high and low moby" scheme (as it required minimal logic latching onto single address line).

And for I/D split you needed appropriate CPU model.

the top 8kB "I/O page" is reserved as part of the kernel space, not userspace, so it does not impact as much the userspace part.

by p_l

1/15/2025 at 3:07:49 PM

Ah, I misunderstood your point. And while the PDP-11 was byte addressable, the doco often talked about memory size in words. Carry on.

by kjs3

1/13/2025 at 9:05:50 AM

Why would I want to factor out some syntactic functionality of one specific (and not very well thought out) shell to reuse, again?

But if you really insist, you can write your own glob(1) that would invoke glob(3) for you, sure. There is also wordexp(3) although I believe its implementations had security problems for quite some time?

by Joker_vD

1/13/2025 at 7:45:29 AM

The way Murex works is each parameter is first compiled into an AST, and then globing only works against the unquoted tokens.

Globbing is also a separate built in, which allows for other types of wildcard matches like regex too. Eg https://murex.rocks/tour.html#filesystem-wildcards-globbing

So you have have the best of both worlds: inline globbing for convenience and also wildcard matching as a function too.

by hnlmorg

1/13/2025 at 8:30:08 AM

> at which point I suddenly became a fan of ditching it. I do wonder if there's not some better way to factor that functionality out...

Just use backslash escaping like we do practically everywhere else in the Unix world?

by eru

1/13/2025 at 10:51:16 AM

That's kind of cure worse than disease. Just ditch escaping completely.

by rini17

1/13/2025 at 11:43:06 PM

Why? This is just for communication between the shell and its helper programs, the user wouldn't even see.

What do you not like about escaping?

Of course, for program-to-program communication you can also use different techniques, instead of escaping. Escaping is just the most human-readable and human-producible.

(As a simple example, to be able to represent all characters in a string, you can either escape quotes like \" or you can prefix the string with its length.

Computers can work with either convention, but humans will hate you if they have to prefix every string literal with its length and keep that length in sync with the string.)

by eru

1/14/2025 at 8:31:16 AM

Are you aware that the main issue here is not with string literals, but with glob expansions? Literals are quite easy to check statically as mistakes usually cause havoc with surrounding code syntax. Even so, I avoid nontrivial use of them.

But expansions and substitutions with escaping are the can of worms.

by rini17

1/13/2025 at 9:16:10 PM

If you completely ditch escaping, how do you handle filenames that contain special characters (in this context, mostly ? and *, but ()[] are also perennial favorites)? And to preempt the most obvious answer: No, you can't just ban them, because existing OSs and filesystems allow them and you need interoperability.

by yjftsjthsd-h

1/13/2025 at 10:24:31 PM

There are ways, no idea why doing anything here is so reviled.

Find and xargs can delimite filenames by NUL, which is not allowed in filenames. Best practice in SQL was to abandon parameters escaping completely and pass them out of band. For internal representation, use array datastructures with length information.

Actually, would it be that bad, to ban * and ? in filenames? If you accept them in the name of interop, something inevitably breaks later. Better to fail upfront. Many applications do sanitize filenames already and when they need to use binary data as file name, convert it to hex instead. It's a hassle otherwise.

by rini17

1/13/2025 at 11:52:59 PM

> Actually, would it be that bad, to ban * and ? in filenames?

That's possible, if you design your filesystem from scratch.

But if you take your filesystem as given for now (with its ability to represent all kinds of interesting characters), and just want to design globbing you have to solve this problem. Otherwise you have a tool that can only handle some files. That's what Gnu Make does, btw. Try handling any file or output with whitespace in the name in Make, if you want some frustration.

Yes, null-termination works for the specific problem of termination. Though if you just use program-to-program communication, you can also prefix your strings with their length.

> If you accept them in the name of interop, something inevitably breaks later.

Why? That's only the case when you have legacy software written by less than careful people. There's no reason to expect breakage when you are designing new software, just like the people in the article where doing. (Of course, back then they didn't know what they were doing, so we have a lot of breakage historically.)

But for the very specific purpose of the shell talking to a helper program for globbing, you can control exactly what's happening, including all the encoding and decoding (or escaping and unescaping). So there's no unexpected breakage.

And btw, you also need to give the human a way to specify a literal * in a filename, too. Not just for communication between programs.

> Best practice in SQL was to abandon parameters escaping completely and pass them out of band.

Yes, that's partially because SQL is such a complicated language, and because you are talking about program-to-program communication anyway, so you don't need to be human-friendly there. So communicating them on a separate is the simplest thing that covers all cases.

by eru

1/13/2025 at 9:27:26 AM

Sweet.

I use xterm.js a lot and have a "shell backbone" that I use to make shell based access to APIs, S3 and other things "cloud." This is essentially how I implement globbing as well. The convenience is that you can run glob by itself to get an idea of exactly what kind of automated nightmare you are about to kick off.

Anyways.. mine currently has V3 behavior. My shell command exec routine could actually benefit from that hack. What's old is new again?

by timewizard

1/13/2025 at 3:38:46 PM

Recent versions of Bash don't expand the * (et cetera) patterns when there is no match, which although sometimes useful, I still feel it's a hack.

by amelius

1/13/2025 at 6:21:15 PM

The action to take upon no match is configurable in recent Bash versions.

The 'failglob' shopt option will cause an error to be generated if a glob matches nothing.

The 'nullglob' shopt option toggles between no match expanding to an empty string and the traditional default of no match leaving the glob characters untouched.

by pwg

1/13/2025 at 4:25:10 PM

That's been around since the original Bourne shell; /etc/glob, from what I can see from its source, would refuse to run the command if the resulting expansion turned out completely empty; and the globs with no matches would be simply removed.

by Joker_vD

1/13/2025 at 8:16:26 PM

That's not how it works in recent Ubuntu releases. If there is no match, the command runs with the wildcard chars not substituted.

    # echo foo*bar
    foo*bar

by amelius

1/14/2025 at 6:48:13 AM

Yes, this current behaviour was introduced by the original Bourne shell and then it stuck for some silly reason or another (it probably has some fringe use cases but they elude me). Thompson's original shell, or rather, /etc/glob, at various versions implemented the mix of behaviours that would later be reintroduced as nullglob and failglob options in Bash.

by Joker_vD

1/13/2025 at 11:24:08 AM

binaries in /etc/ -- i mean __really__

by gjvc

1/13/2025 at 3:00:05 PM

Even now you'll come across this, for example "/etc/rmt" probably exists, and other tape-related binaries if installed.

by stevekemp

1/13/2025 at 2:53:00 PM

Yes, really. That's what /etc was for.

by tedunangst

1/13/2025 at 6:11:24 PM

I know. I'm saying it's sick. I hate computers.

by gjvc

1/13/2025 at 6:46:08 PM

Why sick? That was the directory for binaries that weren't meant to be run directly — `getty`, `login`, etc.

Today there's much more software, so some things got moved into finer-grained locations like /libexec and /sbin. That wasn't the case in the /etc/glob era when the entire UNIX system was smaller than today's average web page.

by kps

1/13/2025 at 9:45:53 PM

and /sbin was full

by gjvc

1/13/2025 at 6:47:40 AM

[flagged]

by JimmyWilliams1

1/13/2025 at 7:38:35 AM

This reads like slop for some reasons; even to my non-native brain.

by rednafi

1/13/2025 at 2:20:55 PM

I wonder what the point of these accounts is - they show up in almost every post now. If the goal is farming karma, they aren't doing a very good job.

by marcus0x62

1/13/2025 at 10:47:26 AM

This is php.ini level of madness, and I'm glad it's gone from (semi-)modern shells. A formal (e.g. programming) language should be defined in its entirety by its formal grammar, its semantics by a formal spec, etc. There's barely any good reason to let the system administrator change the logic and semantics of deployed code.

You could argue that Lisp reader macros also somewhat violate this rule. As a longtime Lisp fan, I dislike reader macros, but I'm more conflicted about macros in general. A good macro system should aim to provide enough context for IDEs and LSPs to aid the developer, but Lisp macros are entirely about just transforming the AST. It's usually just better to evolve the language.

by rollcat

1/13/2025 at 11:10:56 AM

It's not there to give the system administrator flexibility. It's there because early Unix was heavily constrained, and doing thing with lots of little overlays (and what was decades later known as "Bernstein chaining") rather than 1 big program was the way to architect stuff. exit(1), goto(1), and if(1) were all external commands in the Thompson shell.

* https://v6sh.org

by JdeBP

1/13/2025 at 1:54:00 PM

I would argue with almost anyone else, that this is a poor design, but...

Thank you for your perspective, work, and contributions.

by rollcat

1/13/2025 at 6:49:53 PM

You are likely looking at this design from a modern system perspective.

But the PDP-11 system that many of these designs were made upon had a minimum memory size of 4K bytes and with varying models that had different maximum memory sizes that are smaller than a single JPEG photo in today's world: PDP 11/45 max memory 256kbyte - PDP 11/70 max memory 4Mbyte.

And this was the total memory for everything, the OS, and the users, and the system supported multiple users sharing the same machine at the same time.

With those resource constraints, the design rules that determine good from poor are radically different than with one of today's systems with multiple Gb of RAM.

by pwg

1/14/2025 at 4:16:34 AM

Also remember that in the early days Unix did segment swapping, with demand paging only coming in with BSD and the VAX. So there was no paging in just a tiny part of a big executable.

by JdeBP

1/13/2025 at 3:17:41 PM

The other thing to bear in mind is that it’s undergone literally decades of evolution while still being backwards compatible.

The shells weren’t originally intended to be Turing complete. They were just a job launcher. What you use today would have been unimaginable when these shells were first designed.

Whereas all other programming languages have had a drastically smaller evolution in comparison and yet still had a worse compatibility story.

It’s very easy to be critical of the Bourne shell (and compatible shells too) because they are archaic by modern standards. But they weren’t written to solve modern problems. So it’s like looking at a bicycle and complaining how the designers didn’t design a sports car while ignoring the fact that technology didn’t exist and still push bikes are good enough for millions to use daily.

by hnlmorg

1/13/2025 at 9:53:26 PM

What in the world is "php.ini level of madness"?

If you are trying to attack php you are not doing a good job of it, especially because there were good reason for using a separate program for glob.

by ars

1/17/2025 at 3:38:50 PM

I didn't consider the historical context, which several people in this thread provided. I already knew that "/etc" used to literally mean "etcetera" - "whatever doesn't fit elsewhere", but didn't immediately connect the dots that "/etc/glob" was still considered a fixed part of the system, and wasn't meant to be substituted by the administrator.

I won't argue about PHP. I've dealt with it while there was money to be made from that, and moved on as soon as I had the chance. ¯\_(ツ)_/¯

by rollcat