Discussion:
file(1) / magic(5) database vs shared-mime-info database
Kip Warner
2018-02-24 07:39:06 UTC
Permalink
Hey list,

Unless I'm misunderstanding things, it seems to me as though typical
GNU typical distros seem to ship two different systems for determining
a file's magic.

There is the older file(1) / libmagic / magic(5) mechanism. One can
test it via running file(1) on a particular file and it will identify
it and can also provide the MIME type.

The second way is via /usr/share/mime/packages/<mypackage>.xml
mechanism as described by the fd.o shared-mime-info-spec. Typical
desktop shells like Gnome's Nautilus, Xfce's Thunar, pcmanfm, etc.,
seem to identify files via this way.

The amount of redundant work in specifying file magic patterns for both
systems is substantial considering how many different types of files
there are out there in the wild. I am assuming, but I don't know this
for certain, but the latter's
/usr/share/mime/packages/freedesktop.org.xml might have been initially
generated from file(1)'s upstream magic source database (ships as
libmagic-mgc on my distro).

Ironically neither one of these two mechanisms seem to communicate with
each other. If the second mechanism is intended to superannuate the
first, would it not make sense to provide an fd.o replacement for
file(1) which queries the system's shared-mime-info database instead? I
would think this should be trivial to implement.

Yours truly,
--
Kip Warner | Senior Software Engineer
OpenPGP signed/encrypted mail preferred
https://www.cartesiantheatre.com
Emmanuele Bassi
2018-02-24 12:10:51 UTC
Permalink
Post by Kip Warner
Hey list,
Unless I'm misunderstanding things, it seems to me as though typical
GNU typical distros seem to ship two different systems for determining
a file's magic.
The file(1) commands predates not just Linux by about 20 years, but
the whole of freedesktp.org, including the shared-mime database, by
about 30 years.

From a portability perspective, file(1) is probably the best option if
you find yourself stranded in the past, or on Unix-like systems like
macOS. I think it'd be kind of unreasonable to make file(1) depend on
the shared-mime database, considering the tool's history and the fact
that the people that tend to use exclusively file(1) are a fairly
conservative bunch. Nevertheless, you could ask the author to detect
if the shared-mime database is installed, and use that instead of the
magic numbers database:

http://www.darwinsys.com/file/

Ideally, though, you should ignore file(1) and magic(5) altogether on
Linux, if you are dealing with MIME types.
Post by Kip Warner
If the second mechanism is intended to superannuate the
first, would it not make sense to provide an fd.o replacement for
file(1) which queries the system's shared-mime-info database instead? I
would think this should be trivial to implement.
Considering that every single xdg-util utility is a shell script that
calls existing binaries, you can very likely write an
"xdg-content-type" that calls things like `gio info -a
standard::content-type` on a file under GNOME, or any other utility
under other environments, and propose it for inclusion in the xdg-util
suite:

https://cgit.freedesktop.org/xdg/xdg-utils

Ciao,
Emmanuele.
--
https://www.bassi.io
[@] ebassi [@gmail.com]
Kip Warner
2018-02-25 21:51:17 UTC
Permalink
Post by Emmanuele Bassi
The file(1) commands predates not just Linux by about 20 years, but
the whole of freedesktp.org, including the shared-mime database, by
about 30 years.
That's what I thought. On my Debian based distro it appears to ship a
BSD variant.
Post by Emmanuele Bassi
Ideally, though, you should ignore file(1) and magic(5) altogether on
Linux, if you are dealing with MIME types.
Out of curiosity though, what about in the web world where php(1) or
some other ensemble of tools used to serve browser scripts need to
check MIME types? At this time I think they probably all rely on
file(1) and its API?
Post by Emmanuele Bassi
Considering that every single xdg-util utility is a shell script that
calls existing binaries, you can very likely write an
"xdg-content-type" that calls things like `gio info -a
standard::content-type` on a file under GNOME, or any other utility
under other environments, and propose it for inclusion in the xdg-
util
https://cgit.freedesktop.org/xdg/xdg-utils
Yes, that could work. For people who have traditional file(1) installed
already, they could have them both on their system and select which one
to use via update-alternatives or some such.
--
Kip Warner | Senior Software Engineer
OpenPGP signed/encrypted mail preferred
https://www.cartesiantheatre.com
Kip Warner
2018-02-25 22:18:07 UTC
Permalink
No, Apache has its own mechanism that mostly relies on filename
conventions, as do most other Web servers.
Interesting. They probably also have different requirements where
sometimes the MIME type has to be detected quickly and purely based on
the file name before there is an opportunity to actually probe the
input stream.
--
Kip Warner | Senior Software Engineer
OpenPGP signed/encrypted mail preferred
https://www.cartesiantheatre.com
Bastien Nocera
2018-02-24 13:17:42 UTC
Permalink
Post by Kip Warner
Hey list,
Unless I'm misunderstanding things, it seems to me as though typical
GNU typical distros seem to ship two different systems for
determining
a file's magic.
There is the older file(1) / libmagic / magic(5) mechanism. One can
test it via running file(1) on a particular file and it will identify
it and can also provide the MIME type.
The second way is via /usr/share/mime/packages/<mypackage>.xml
mechanism as described by the fd.o shared-mime-info-spec. Typical
desktop shells like Gnome's Nautilus, Xfce's Thunar, pcmanfm, etc.,
seem to identify files via this way.
The amount of redundant work in specifying file magic patterns for both
systems is substantial considering how many different types of files
there are out there in the wild. I am assuming, but I don't know this
for certain, but the latter's
/usr/share/mime/packages/freedesktop.org.xml might have been
initially
generated from file(1)'s upstream magic source database (ships as
libmagic-mgc on my distro).
Ironically neither one of these two mechanisms seem to communicate with
each other. If the second mechanism is intended to superannuate the
first, would it not make sense to provide an fd.o replacement for
file(1) which queries the system's shared-mime-info database instead? I
would think this should be trivial to implement.
No, it wouldn't make sense. They have very different use cases and
restrictions. In shared-mime-info:
- mime-types don't have to have a magic associated
- and mime-types have globs associated
- the magic length is limited to avoid seeking through huge files
- descriptions are translated, acronyms can be split-off and expanded
- mime-types have inheritance

There's probably others, but that's already a good chunk of the
problems we'd encounter if we used file's database.
Kip Warner
2018-02-25 22:13:19 UTC
Permalink
Post by Bastien Nocera
No, it wouldn't make sense. They have very different use cases and
- mime-types don't have to have a magic associated
True, but I believe file(1)'s magic(5) database doesn't need to either.
It can "detect" a MIME type based on just a file extension if that's
what the rule writer provided.
Post by Bastien Nocera
- and mime-types have globs associated
- the magic length is limited to avoid seeking through huge files
The magic(5) can also handle this by having rules that specify specific
file offsets to check.
Post by Bastien Nocera
- descriptions are translated, acronyms can be split-off and expanded
Yes, that's a good point.
Post by Bastien Nocera
- mime-types have inheritance
Another good point.
Post by Bastien Nocera
There's probably others, but that's already a good chunk of the
problems we'd encounter if we used file's database.
One capability file(1)'s magic(5) method has that nobody has mentioned
is the ability to identify not only the MIME type, but also a more
descriptive comment on the file's contents. As an example, consider the
magic I wrote to detect Maxis Database Packed Files.

https://github.com/file/file/blob/master/magic/Magdir/dbpf

If I just want to know the MIME type:

$ file --mime-type SimCity_Audio_Banks.package
SimCity_Audio_Banks.package: application/x-maxis-dbpf

But if I want to see a more descriptive comment:

$ file SimCity_Audio_Banks.package
SimCity_Audio_Banks.package: Maxis Database Packed File, version: 3.0, files: 83

I think what I've started to figure out is file(1) / magic(5) are meant
to be used directly by users as well as the API for magic(5) by other
applications. In the case of shared-mime-info it's designed to be used
primarily by other applications. It's rare users try to identify a file
by 'gio info foo'.

That might make these two mechanisms justify their own distinct
existence, but something that I think should be done at the least is
consolidate the redundant magic itself.
--
Kip Warner | Senior Software Engineer
OpenPGP signed/encrypted mail preferred
https://www.cartesiantheatre.com
Kip Warner
2018-02-25 22:24:42 UTC
Permalink
Post by Bastien Nocera
There's probably others, but that's already a good chunk of the
problems we'd encounter if we used file's database.
In case it's of interest to anyone, I asked file(1) / magic(5)'s
upstream maintainer about whether his mechanism and shared-mime-info
were redundant with each other or whether they were intended to solve
different problems. This was Christos Zoulas's (cc'd) response in
respect of shared-mime-info:

I have not looked at the code but for the most part getting the
majority of the file formats that need mime handing does not require
the complexity that file requires to parse all the weird and
recursive cases... Also they don't have any performance constraints
(they don't need to be able to process thousands of files per
second). So for them, it is perhaps simpler to parse and maintain a
separate database.
--
Kip Warner | Senior Software Engineer
OpenPGP signed/encrypted mail preferred
https://www.cartesiantheatre.com
Loading...