Trash spec: directory size cache

Discussion:

Trash spec: directory size cache

David Faure

2013-04-14 21:48:40 UTC

To implement a maximum size for the trash directory, one needs to check the
size every time a new item is being trashed. With the current spec, the only
solution is to do a recursive traversal, which is pretty expensive.
To make this efficient, we need a cache.
My initial idea of a global "total size" cache doesn't work well with older
implementations which don't update that value, so it gets out of date quickly.

Instead, Ryan Lortie and I came up with the following idea, which we would
like to standardize into the trash spec:

For files, we get the file from stat. For dirs, we use a cache:
in every trash directory, a metadata file is created, with one entry per
directory (that was trashed by the user).
That entry contains the total size in bytes of the directory, and the
modification time of the trashinfo file [*].

The metadata file uses desktop file syntax, where the key is the directory
name, and the value is a pair: size, and mtime.

However the desktop file standard restricts the available characters for keys,
so instead of just writing out the directory name, we write the sha1 of the
directory name (a bit like the thumbnail spec uses sha1s too).

In summary, it would look like this:

[Directories]
# One entry per sub-directory of the "files" directory
# key = sha1 of the directory name
# value = size in bytes, timestamp of the trashinfo file, in UTC
cb58e5c11a6802db43fd82ca8d3c7393353c0eab=25383,2009-07-11T20:18:30
f1d2d2f924e986ac86fdf7b36c94bcdf32beec15=2315,2012-04-12T10:05:20

To determine size of the trash directory, this leads to the following
algorithm:

totalsize = 0
prepare empty set of sha1s
list "files" directory, and for each entry:
stat the entry
if a file, totalsize += file size
if a directory,
stat the trashinfo file to get its mtime
calculate sha1 of the directory name
read entry from metadata file
if entry found
extract cached_size and cached_mtime
if mtime != cached_mtime
re-calculate directory size
update entry (size of directory, mtime of trashinfo file)
else
calculate directory size
write entry (size of directory, mtime of trashinfo file)
totalsize += directory size
add sha1 to set of seen sha1s
done

for each entry in the metadata file,
if entry key is not in the set of seen sha1s
remove entry

[*] This way, if an older trash implementation deletes and recreates this
entry, we can detect that the cache entry is stale [even if the directory got
restored and trashed again, so the mtime of the directory itself didn't
change, this is why we use the mtime of the trashinfo file, instead].

If there is no objection, I will make a patch for the trash spec.

--
David Faure, ***@kde.org, http://www.davidfaure.fr
Working on KDE, in particular KDE Frameworks 5

Alexander Larsson

2013-04-15 08:07:40 UTC

Post by David Faure
To implement a maximum size for the trash directory, one needs to check the
size every time a new item is being trashed. With the current spec, the only
solution is to do a recursive traversal, which is pretty expensive.
To make this efficient, we need a cache.
My initial idea of a global "total size" cache doesn't work well with older
implementations which don't update that value, so it gets out of date quickly.
Instead, Ryan Lortie and I came up with the following idea, which we would
in every trash directory, a metadata file is created, with one entry per
directory (that was trashed by the user).
That entry contains the total size in bytes of the directory, and the
modification time of the trashinfo file [*].
The metadata file uses desktop file syntax, where the key is the directory
name, and the value is a pair: size, and mtime.
However the desktop file standard restricts the available characters for keys,
so instead of just writing out the directory name, we write the sha1 of the
directory name (a bit like the thumbnail spec uses sha1s too).
[Directories]
# One entry per sub-directory of the "files" directory
# key = sha1 of the directory name
# value = size in bytes, timestamp of the trashinfo file, in UTC
cb58e5c11a6802db43fd82ca8d3c7393353c0eab=25383,2009-07-11T20:18:30
f1d2d2f924e986ac86fdf7b36c94bcdf32beec15=2315,2012-04-12T10:05:20

In general this sounds good to me. I have two minor objections:

1: Using sha1 seems wrong to me. There is no need to get an even
distribution of the keys (like for thumbnail subdirectories), and a sha1
is slow to calculate. Also, if you ever look at the file manually its
says very little. I would much prefer simple character escape model, say
you allow A-Za-z0-9 and everyting else you escape as "-" + the hex
digits (like "-2d" for "-"). This is valid desktop file keys, are cheap
to calculate and makes most files readable by humans.

2: Don't store the mtime in a format that needs parsing. Time and date
parsing is a very complicated area that is easy to get wrong. And the
source is always a stat which is in epoch format, why not just save it
in the same format to avoid any day/month order issues, timezone
weirdnesses or whatnot.

Bastien Nocera

2013-04-15 08:19:55 UTC

Post by Alexander Larsson

Post by David Faure
To implement a maximum size for the trash directory, one needs to check the
size every time a new item is being trashed. With the current spec, the only
solution is to do a recursive traversal, which is pretty expensive.
To make this efficient, we need a cache.
My initial idea of a global "total size" cache doesn't work well with older
implementations which don't update that value, so it gets out of date quickly.
Instead, Ryan Lortie and I came up with the following idea, which we would
in every trash directory, a metadata file is created, with one entry per
directory (that was trashed by the user).
That entry contains the total size in bytes of the directory, and the
modification time of the trashinfo file [*].
The metadata file uses desktop file syntax, where the key is the directory
name, and the value is a pair: size, and mtime.
However the desktop file standard restricts the available characters for keys,
so instead of just writing out the directory name, we write the sha1 of the
directory name (a bit like the thumbnail spec uses sha1s too).
[Directories]
# One entry per sub-directory of the "files" directory
# key = sha1 of the directory name
# value = size in bytes, timestamp of the trashinfo file, in UTC
cb58e5c11a6802db43fd82ca8d3c7393353c0eab=25383,2009-07-11T20:18:30
f1d2d2f924e986ac86fdf7b36c94bcdf32beec15=2315,2012-04-12T10:05:20

1: Using sha1 seems wrong to me. There is no need to get an even
distribution of the keys (like for thumbnail subdirectories), and a sha1
is slow to calculate. Also, if you ever look at the file manually its
says very little. I would much prefer simple character escape model, say
you allow A-Za-z0-9 and everyting else you escape as "-" + the hex
digits (like "-2d" for "-"). This is valid desktop file keys, are cheap
to calculate and makes most files readable by humans.

Or base-64 encode the directory name. Even if that fails the "readable
by humans" test, at least it'll avoid slightly differently buggy escape
sequences).

Post by Alexander Larsson
2: Don't store the mtime in a format that needs parsing. Time and date
parsing is a very complicated area that is easy to get wrong. And the
source is always a stat which is in epoch format, why not just save it
in the same format to avoid any day/month order issues, timezone
weirdnesses or whatnot.

ISO 8601 time format requires very little work to parse, and uses
GMT/UTC. It also passes the human-readable test ;)

David Faure

2013-04-15 09:11:42 UTC

Post by Bastien Nocera

Post by Alexander Larsson

Post by David Faure
To implement a maximum size for the trash directory, one needs to check the
size every time a new item is being trashed. With the current spec, the
only solution is to do a recursive traversal, which is pretty
expensive. To make this efficient, we need a cache.
My initial idea of a global "total size" cache doesn't work well with older
implementations which don't update that value, so it gets out of date quickly.
Instead, Ryan Lortie and I came up with the following idea, which we would
in every trash directory, a metadata file is created, with one entry per
directory (that was trashed by the user).
That entry contains the total size in bytes of the directory, and the
modification time of the trashinfo file [*].
The metadata file uses desktop file syntax, where the key is the directory
name, and the value is a pair: size, and mtime.
However the desktop file standard restricts the available characters for
keys, so instead of just writing out the directory name, we write the
sha1 of the directory name (a bit like the thumbnail spec uses sha1s
too).
[Directories]
# One entry per sub-directory of the "files" directory
# key = sha1 of the directory name
# value = size in bytes, timestamp of the trashinfo file, in UTC
cb58e5c11a6802db43fd82ca8d3c7393353c0eab=25383,2009-07-11T20:18:30
f1d2d2f924e986ac86fdf7b36c94bcdf32beec15=2315,2012-04-12T10:05:20

1: Using sha1 seems wrong to me. There is no need to get an even
distribution of the keys (like for thumbnail subdirectories), and a sha1
is slow to calculate. Also, if you ever look at the file manually its
says very little. I would much prefer simple character escape model, say
you allow A-Za-z0-9 and everyting else you escape as "-" + the hex
digits (like "-2d" for "-"). This is valid desktop file keys, are cheap
to calculate and makes most files readable by humans.

Or base-64 encode the directory name. Even if that fails the "readable
by humans" test, at least it'll avoid slightly differently buggy escape
sequences).

Base-64 leads to huge output. I think escaping with '-'+hex is fine (possibly
faster, more readable, and not as big). I don't see the "buggy" argument.

--
David Faure, ***@kde.org, http://www.davidfaure.fr
Working on KDE, in particular KDE Frameworks 5

Jerome Leclanche

2013-04-15 10:17:14 UTC

Do the keys accept %? uri-encoding them would be the obvious solution here.
I'm also very uncomfortable with the sha1 bit, but apart from that
everything looks good.

J. Leclanche

Post by David Faure

Post by Bastien Nocera

Post by Alexander Larsson

Post by David Faure
To implement a maximum size for the trash directory, one needs to

check

Post by Bastien Nocera

Post by Alexander Larsson

Post by David Faure
the
size every time a new item is being trashed. With the current spec,

the

Post by Bastien Nocera

Post by Alexander Larsson

Post by David Faure
only solution is to do a recursive traversal, which is pretty
expensive. To make this efficient, we need a cache.
My initial idea of a global "total size" cache doesn't work well with older
implementations which don't update that value, so it gets out of date quickly.
Instead, Ryan Lortie and I came up with the following idea, which we would
in every trash directory, a metadata file is created, with one entry

per

Post by Bastien Nocera

Post by Alexander Larsson

Post by David Faure
directory (that was trashed by the user).
That entry contains the total size in bytes of the directory, and the
modification time of the trashinfo file [*].
The metadata file uses desktop file syntax, where the key is the directory
name, and the value is a pair: size, and mtime.
However the desktop file standard restricts the available characters

for

Post by Bastien Nocera

Post by Alexander Larsson

Post by David Faure
keys, so instead of just writing out the directory name, we write the
sha1 of the directory name (a bit like the thumbnail spec uses sha1s
too).
[Directories]
# One entry per sub-directory of the "files" directory
# key = sha1 of the directory name
# value = size in bytes, timestamp of the trashinfo file, in UTC
cb58e5c11a6802db43fd82ca8d3c7393353c0eab=25383,2009-07-11T20:18:30
f1d2d2f924e986ac86fdf7b36c94bcdf32beec15=2315,2012-04-12T10:05:20

1: Using sha1 seems wrong to me. There is no need to get an even
distribution of the keys (like for thumbnail subdirectories), and a

sha1

Post by Bastien Nocera

Post by Alexander Larsson
is slow to calculate. Also, if you ever look at the file manually its
says very little. I would much prefer simple character escape model,

say

Post by Bastien Nocera

Post by Alexander Larsson
you allow A-Za-z0-9 and everyting else you escape as "-" + the hex
digits (like "-2d" for "-"). This is valid desktop file keys, are cheap
to calculate and makes most files readable by humans.

Or base-64 encode the directory name. Even if that fails the "readable
by humans" test, at least it'll avoid slightly differently buggy escape
sequences).

Base-64 leads to huge output. I think escaping with '-'+hex is fine (possibly
faster, more readable, and not as big). I don't see the "buggy" argument.
--
Working on KDE, in particular KDE Frameworks 5
_______________________________________________
xdg mailing list
http://lists.freedesktop.org/mailman/listinfo/xdg

David Faure

2013-04-15 10:31:45 UTC

Post by Jerome Leclanche
Do the keys accept %? uri-encoding them would be the obvious solution here.

Yes, it would have been, but this isn't a supported character...
see last section of http://standards.freedesktop.org/desktop-entry-spec/latest/ar01s02.html

IMHO we could also add more allowed characters to the spec, but experience tells me
that trying to get stuff into the desktop entry spec is pretty hard. Maybe at the next freedesktop
meeting, when we have more people involved with that...

(KDE supports any character in there, and uses '\' for escaping, and even supports \x2d escaping).

--
David Faure, ***@kde.org, http://www.davidfaure.fr
Working on KDE, in particular KDE Frameworks 5

Jerome Leclanche

2013-04-15 10:35:20 UTC

Adding the % character in there is probably not a big deal; is there any
software in the world that actually validates trashinfo files?
Additionally, the current trash spec only says "The format of this file is
similar to the format of a desktop entry file", so we can make a note of it
being supported if changing the desktop entry spec isn't an option.

J. Leclanche

Post by David Faure

Post by Jerome Leclanche
Do the keys accept %? uri-encoding them would be the obvious solution

here.
Yes, it would have been, but this isn't a supported character...
see last section of
http://standards.freedesktop.org/desktop-entry-spec/latest/ar01s02.html
IMHO we could also add more allowed characters to the spec, but experience tells me
that trying to get stuff into the desktop entry spec is pretty hard. Maybe
at the next freedesktop
meeting, when we have more people involved with that...
(KDE supports any character in there, and uses '\' for escaping, and even
supports \x2d escaping).
--
Working on KDE, in particular KDE Frameworks 5

Vincent Untz

2013-04-15 14:10:50 UTC

Post by David Faure
IMHO we could also add more allowed characters to the spec, but experience tells me
that trying to get stuff into the desktop entry spec is pretty hard. Maybe at the next freedesktop
meeting, when we have more people involved with that...

Hrm, I was hoping that you would discuss maintenance of the specs during
that meeting (I think I mailed Ryan about that last month). I don't
think anyone is particularly happy with the current state of things;
having a small group of well-defined people from different projects
taking ownership of the specs would be a good first step.

Cheers,

Vincent

--
Les gens heureux ne sont pas pressés.

Alexander Larsson

2013-04-15 11:43:40 UTC

hi,

Post by Alexander Larsson
1: Using sha1 seems wrong to me. There is no need to get an even
distribution of the keys (like for thumbnail subdirectories), and a sha1
is slow to calculate. Also, if you ever look at the file manually its
says very little. I would much prefer simple character escape model, say
you allow A-Za-z0-9 and everyting else you escape as "-" + the hex
digits (like "-2d" for "-"). This is valid desktop file keys, are cheap
to calculate and makes most files readable by humans.

I think you overestimate how slow sha1 is...
Even still, this is fine to me but there was a concern that the length
of the keys could easily get out of hand. In the 'normal case' of
reasonably-named files it will be more readable, though, and faster.

Yeah, i don't think its actually gonna affect real performance, it just
seems unnecessary and makes the files harder to read.

Post by Alexander Larsson
2: Don't store the mtime in a format that needs parsing. Time and date
parsing is a very complicated area that is easy to get wrong. And the
source is always a stat which is in epoch format, why not just save it
in the same format to avoid any day/month order issues, timezone
weirdnesses or whatnot.

We already have a parser for this in the trashinfo file (the original
deletion time) although I guess it would be easy to implement the spec
without ever parsing this key (since it is strictly informational).
It may not have been clear in David's email but it is the intention
(from our discussions last night) that this timestamp would always be UTC.
No strong objections to turning it into unix-timestamp-in-UTC though.

NOOOO! The timestamp timezone of a mtime is undefined, and may be
different from the actual timezone of the computer doing the stat()
call, for instance for NFS mounts. There is no reason whatsoever to do
anything to the value returned from stat, as all we do with it is
compare it to a later stat value. Any kind of timezone conversion on it
will not help and may hurt.

Ryan Lortie

2013-04-15 12:25:53 UTC

hi,

Post by Alexander Larsson

No strong objections to turning it into unix-timestamp-in-UTC though.

NOOOO! The timestamp timezone of a mtime is undefined, and may be
different from the actual timezone of the computer doing the stat()
call, for instance for NFS mounts. There is no reason whatsoever to do
anything to the value returned from stat, as all we do with it is
compare it to a later stat value. Any kind of timezone conversion on it
will not help and may hurt.

The kernel time is in UTC...

Cheers

Alexander Larsson

2013-04-15 12:36:56 UTC

Post by Ryan Lortie
hi,

Post by Alexander Larsson

No strong objections to turning it into unix-timestamp-in-UTC though.

NOOOO! The timestamp timezone of a mtime is undefined, and may be
different from the actual timezone of the computer doing the stat()
call, for instance for NFS mounts. There is no reason whatsoever to do
anything to the value returned from stat, as all we do with it is
compare it to a later stat value. Any kind of timezone conversion on it
will not help and may hurt.

The kernel time is in UTC...

The kernel will report whatever time_t value was sent over the network
via the NFS protocol. Whether this is UTC or not depends on the server
setup and OS. But sure, if what you mean by "unix-timemstamp-in-UTC" is
"whatever value the kernel gave you in stat()" then that is fine. Just
don't ever try to do anything on that value that depends on the timezone
of the NFS client.

This is exactly where using a string for time makes things tricky. You
need to parse the string, and its easy to accidentally pick some API
that assumes that the string is in local timezone rather than UTC, and
thus get things wrong. If the file just lists a time_t integer value it
is less likely that things go wrong when you compare it with the
stat().st_mtime value.

Bastien Nocera

2013-04-15 14:28:45 UTC

Post by Alexander Larsson

Post by Ryan Lortie
hi,

Post by Alexander Larsson

No strong objections to turning it into unix-timestamp-in-UTC though.

NOOOO! The timestamp timezone of a mtime is undefined, and may be
different from the actual timezone of the computer doing the stat()
call, for instance for NFS mounts. There is no reason whatsoever to do
anything to the value returned from stat, as all we do with it is
compare it to a later stat value. Any kind of timezone conversion on it
will not help and may hurt.

The kernel time is in UTC...

The kernel will report whatever time_t value was sent over the network
via the NFS protocol. Whether this is UTC or not depends on the server
setup and OS. But sure, if what you mean by "unix-timemstamp-in-UTC" is
"whatever value the kernel gave you in stat()" then that is fine. Just
don't ever try to do anything on that value that depends on the timezone
of the NFS client.
This is exactly where using a string for time makes things tricky. You
need to parse the string, and its easy to accidentally pick some API
that assumes that the string is in local timezone rather than UTC, and
thus get things wrong. If the file just lists a time_t integer value it
is less likely that things go wrong when you compare it with the
stat().st_mtime value.

How is it any better defined than ISO8601 with the timezone information?

From g_time_val_to_iso8601():
"
ISO 8601 allows a large number of date/time formats, with or without
punctuation and optional elements. The format returned by this function
is a complete date and time, with optional punctuation included, the UTC
time zone represented as "Z", and the tv_usec part included if and only
if it is nonzero, i.e. either "YYYY-MM-DDTHH:MM:SSZ" or
"YYYY-MM-DDTHH:MM:SS.fffffZ".
"

Alexander Larsson

2013-04-15 14:46:53 UTC

Post by Bastien Nocera

Post by Alexander Larsson

Post by Ryan Lortie
hi,

Post by Alexander Larsson

No strong objections to turning it into unix-timestamp-in-UTC though.

NOOOO! The timestamp timezone of a mtime is undefined, and may be
different from the actual timezone of the computer doing the stat()
call, for instance for NFS mounts. There is no reason whatsoever to do
anything to the value returned from stat, as all we do with it is
compare it to a later stat value. Any kind of timezone conversion on it
will not help and may hurt.

The kernel time is in UTC...

The kernel will report whatever time_t value was sent over the network
via the NFS protocol. Whether this is UTC or not depends on the server
setup and OS. But sure, if what you mean by "unix-timemstamp-in-UTC" is
"whatever value the kernel gave you in stat()" then that is fine. Just
don't ever try to do anything on that value that depends on the timezone
of the NFS client.
This is exactly where using a string for time makes things tricky. You
need to parse the string, and its easy to accidentally pick some API
that assumes that the string is in local timezone rather than UTC, and
thus get things wrong. If the file just lists a time_t integer value it
is less likely that things go wrong when you compare it with the
stat().st_mtime value.

How is it any better defined than ISO8601 with the timezone information?
"
ISO 8601 allows a large number of date/time formats, with or without
punctuation and optional elements. The format returned by this function
is a complete date and time, with optional punctuation included, the UTC
time zone represented as "Z", and the tv_usec part included if and only
if it is nonzero, i.e. either "YYYY-MM-DDTHH:MM:SSZ" or
"YYYY-MM-DDTHH:MM:SS.fffffZ".
"

Its not less well defined, its just vastly less complicated (no spec
needed, no parser, no whatever). Less moving parts means less chance for
errors.

For instance, if someone forgot the add the Z in the timestamp the
string is immediately non-utc and things will not work in some
timezones.

David Faure

2013-04-15 16:47:27 UTC

It turns out that the trash spec already uses %-encoding: for the path stored
in the trashinfo file.

So we discussed it a bit more and came up with this update:

Let's not use the desktop entry standard for this, but a simple text format,
where each line is:
<size> <mtime> <%-encoded-path>

Technically % encoding is really only necessary for control characters, but if
we use URL escaping and escape other chars too, who cares, that'll work too.

To resolve the timezone discussion, we'll just use the mtime in number format.
So the file will look like:

16950 15803468 Documents
2467 15803582 Another_Folder

Due to the clash with the KDE "metadata" file (our earlier attempt at caching
the size, using INI format), this change of format means we should call it
something else after all, say:

TRASH_DIR/directorysizes

--
David Faure, ***@kde.org, http://www.davidfaure.fr
Working on KDE, in particular KDE Frameworks 5

Jerome Leclanche

2013-04-15 17:36:34 UTC

Why can't the current <name>.trashinfo files be reused, with new keys added?

J. Leclanche

Post by David Faure
It turns out that the trash spec already uses %-encoding: for the path stored
in the trashinfo file.
Let's not use the desktop entry standard for this, but a simple text format,
<size> <mtime> <%-encoded-path>
Technically % encoding is really only necessary for control characters, but if
we use URL escaping and escape other chars too, who cares, that'll work too.
To resolve the timezone discussion, we'll just use the mtime in number format.
16950 15803468 Documents
2467 15803582 Another_Folder
Due to the clash with the KDE "metadata" file (our earlier attempt at caching
the size, using INI format), this change of format means we should call it
TRASH_DIR/directorysizes
--
Working on KDE, in particular KDE Frameworks 5
_______________________________________________
xdg mailing list
http://lists.freedesktop.org/mailman/listinfo/xdg

David Faure

2013-04-15 21:26:19 UTC

Post by Jerome Leclanche
Why can't the current <name>.trashinfo files be reused, with new keys added?

So that we open and parse only one file, instead of 10000.

--
David Faure, ***@kde.org, http://www.davidfaure.fr
Working on KDE, in particular KDE Frameworks 5

Jerome Leclanche

2013-04-15 21:40:52 UTC

Ah sorry, I thought you wanted to store it in multiple separate files :)

The simple file format is nice however I have a small concern about
extensibility; any new value added is going to break backwards
compatibility, unless we specify that additional values may be eventually
added.
Performance-wise, does it really matter that much? There should/will be an
ini parser because of the trashinfo files anyway.

J. Leclanche

Post by Jerome Leclanche

Post by Jerome Leclanche
Why can't the current <name>.trashinfo files be reused, with new keys

added?
So that we open and parse only one file, instead of 10000.
--
Working on KDE, in particular KDE Frameworks 5

David Faure

2013-04-15 21:56:21 UTC

Post by Jerome Leclanche
Ah sorry, I thought you wanted to store it in multiple separate files :)

No, TRASH_DIR/directorysizes will contain the multiple lines I showed in my
email.

Post by Jerome Leclanche
The simple file format is nice however I have a small concern about
extensibility; any new value added is going to break backwards
compatibility, unless we specify that additional values may be eventually
added.

There isn't much to add in order to solve the question 'how big is this
directory'. No extensibility is needed here.

Post by Jerome Leclanche
Performance-wise, does it really matter that much? There should/will be an
ini parser because of the trashinfo files anyway.

Yes, parsing 1000 files is really a lot slower than parsing one file.
Due to I/O, and due to cache locality.
The whole point of a cache is to be fast, otherwise we wouldn't be doing this
in the first place.

--
David Faure, ***@kde.org, http://www.davidfaure.fr
Working on KDE, in particular KDE Frameworks 5

Jerome Leclanche

2013-04-15 22:24:45 UTC

Post by David Faure

Post by Jerome Leclanche
Ah sorry, I thought you wanted to store it in multiple separate files :)

No, TRASH_DIR/directorysizes will contain the multiple lines I showed in my
email.

Post by Jerome Leclanche
The simple file format is nice however I have a small concern about
extensibility; any new value added is going to break backwards
compatibility, unless we specify that additional values may be eventually
added.

There isn't much to add in order to solve the question 'how big is this
directory'. No extensibility is needed here.

Post by Jerome Leclanche
Performance-wise, does it really matter that much? There should/will be

an

Post by Jerome Leclanche
ini parser because of the trashinfo files anyway.

Yes, parsing 1000 files is really a lot slower than parsing one file.
Due to I/O, and due to cache locality.
The whole point of a cache is to be fast, otherwise we wouldn't be doing this
in the first place.

I meant does it matter that much to avoid an ini-based format there
(keeping, of course, the single file)

Post by David Faure
--
Working on KDE, in particular KDE Frameworks 5

J. Leclanche

Alexander Larsson

2013-04-15 19:18:53 UTC

Post by David Faure
It turns out that the trash spec already uses %-encoding: for the path stored
in the trashinfo file.
Let's not use the desktop entry standard for this, but a simple text format,
<size> <mtime> <%-encoded-path>
Technically % encoding is really only necessary for control characters, but if
we use URL escaping and escape other chars too, who cares, that'll work too.
To resolve the timezone discussion, we'll just use the mtime in number format.
16950 15803468 Documents
2467 15803582 Another_Folder
Due to the clash with the KDE "metadata" file (our earlier attempt at caching
the size, using INI format), this change of format means we should call it
TRASH_DIR/directorysizes

This seems excellent to me. Simple, efficient, compatible.

Ryan Lortie

2013-04-15 22:15:40 UTC

hi David,

Post by David Faure
16950 15803468 Documents
2467 15803582 Another_Folder

One thing I forgot to ask for a clarification on earlier, and certainly
something that we should spell out in the spec: what do we mean by
'size'? Sum of byte-sizes of all files, or 'disk space used' sizes of
all files and directories?

I guess the second one makes more sense, but the sizes you show here
don't seem to be multiples of disk block sizes, which is usually the
case for this type of sizes.

Your thoughts?

Alexander Larsson

2013-04-16 07:03:22 UTC

Post by Ryan Lortie
hi David,

Post by David Faure
16950 15803468 Documents
2467 15803582 Another_Folder

One thing I forgot to ask for a clarification on earlier, and certainly
something that we should spell out in the spec: what do we mean by
'size'? Sum of byte-sizes of all files, or 'disk space used' sizes of
all files and directories?
I guess the second one makes more sense, but the sizes you show here
don't seem to be multiples of disk block sizes, which is usually the
case for this type of sizes.
Your thoughts?

Sum of block sizes isn't a perfect measurement either. For one it
doesn't count the disk space of the directories themseleves, nor does it
handle things like tailpacking, hardlinks, etc.

However, it is more reliable than just the sum of the sizes (i.e. it
handles sparse files and many-small-files better), and its tractable to
compute, so I'd say go with it.

David Faure

2013-04-17 08:12:22 UTC

Post by Alexander Larsson

Post by Ryan Lortie
hi David,

Post by David Faure
16950 15803468 Documents
2467 15803582 Another_Folder

One thing I forgot to ask for a clarification on earlier, and certainly
something that we should spell out in the spec: what do we mean by
'size'? Sum of byte-sizes of all files, or 'disk space used' sizes of
all files and directories?
I guess the second one makes more sense, but the sizes you show here
don't seem to be multiples of disk block sizes, which is usually the
case for this type of sizes.
Your thoughts?

Sum of block sizes isn't a perfect measurement either. For one it
doesn't count the disk space of the directories themseleves, nor does it
handle things like tailpacking, hardlinks, etc.
However, it is more reliable than just the sum of the sizes (i.e. it
handles sparse files and many-small-files better), and its tractable to
compute, so I'd say go with it.

OK. Thanks everyone for the input.
I have combined it all into the attached patch for the trash specification.

Note that I largely rewrote the non-normative algorithm compared to the
initial email; please check.

If you find html-in-a-patch hard to read, you can look for "Directory size
cache" at this page: http://www.davidfaure.fr/2013/trashspec_proposal.html

--
David Faure, ***@kde.org, http://www.davidfaure.fr
Working on KDE, in particular KDE Frameworks 5

Ryan Lortie

2013-04-17 08:32:37 UTC

hi,

Post by David Faure
OK. Thanks everyone for the input.
I have combined it all into the attached patch for the trash specification.
Note that I largely rewrote the non-normative algorithm compared to the
initial email; please check.
If you find html-in-a-patch hard to read, you can look for "Directory size
cache" at this page: http://www.davidfaure.fr/2013/trashspec_proposal.html

Looks good except for one very tiny detail: you go out of your way to
mention that URI encoding is only needed for dealing with '\n' in such a
way as it might lead someone to write a buggy implementation that fails
to encode '%' itself...

Other than that, ACK from me.

Cheers

Diggory Hardy

2013-04-18 07:28:32 UTC

Post by David Faure

Post by Alexander Larsson

Post by Ryan Lortie
hi David,

Post by David Faure
16950 15803468 Documents
2467 15803582 Another_Folder

One thing I forgot to ask for a clarification on earlier, and certainly
something that we should spell out in the spec: what do we mean by
'size'? Sum of byte-sizes of all files, or 'disk space used' sizes of
all files and directories?
I guess the second one makes more sense, but the sizes you show here
don't seem to be multiples of disk block sizes, which is usually the
case for this type of sizes.
Your thoughts?

Sum of block sizes isn't a perfect measurement either. For one it
doesn't count the disk space of the directories themseleves, nor does it
handle things like tailpacking, hardlinks, etc.
However, it is more reliable than just the sum of the sizes (i.e. it
handles sparse files and many-small-files better), and its tractable to
compute, so I'd say go with it.

OK. Thanks everyone for the input.
I have combined it all into the attached patch for the trash specification.
Note that I largely rewrote the non-normative algorithm compared to the
initial email; please check.
If you find html-in-a-patch hard to read, you can look for "Directory size
cache" at this page: http://www.davidfaure.fr/2013/trashspec_proposal.html

Hello,

Looks pretty good. Only thing that occurs to me is if too jobs are updating
the directorysizes file simultaneously two problems are possible: a read
clashing with a write or two clashing writes. Requiring atomic writes (i.e. to
a temporary file followed by rename) avoids the first problem and is relatively
simple. Clashed writes are probably not worth worrying about (worst case a
slow/paused job overwrites a recent file with an old one, but this seems
neither particularly bad nor likely from my POV).

David Faure

2013-04-19 15:13:56 UTC

Post by Diggory Hardy

Post by David Faure
If you find html-in-a-patch hard to read, you can look for "Directory size
cache" at this page: http://www.davidfaure.fr/2013/trashspec_proposal.html

Hello,
Looks pretty good. Only thing that occurs to me is if too jobs are updating
the directorysizes file simultaneously two problems are possible: a read
clashing with a write or two clashing writes. Requiring atomic writes (i.e.
to a temporary file followed by rename) avoids the first problem and is
relatively simple. Clashed writes are probably not worth worrying about
(worst case a slow/paused job overwrites a recent file with an old one, but
this seems neither particularly bad nor likely from my POV).

Right. We had mentionned this during the meeting, but I forgot to spell it out
in the spec. Done now, I added:

<P>Implementations MUST use a temporary file followed by an atomic rename()
operation in order to avoid corruption due to two implementations writing to
the file at the same time. The fact that the changes from one of the writers
could get lost isn't an issue, the cache can be updated again later on to add
that entry.</P>

Post by Diggory Hardy
Looks good except for one very tiny detail: you go out of your way to
mention that URI encoding is only needed for dealing with '\n' in such a
way as it might lead someone to write a buggy implementation that fails
to encode '%' itself...

Indeed. Added an explicit mention of the '%' character in that sentence.

http://www.davidfaure.fr/2013/trashspec_proposal.html updated.

I think we're good to go now. Cc'ing the original author of the spec and other
implementors that I know about, to inform them of the change (and give them a
chance to react :).

--
David Faure, ***@kde.org, http://www.davidfaure.fr
Working on KDE, in particular KDE Frameworks 5

Mikhail Ramendik

2013-04-19 21:48:05 UTC

Post by David Faure
I think we're good to go now. Cc'ing the original author of the spec and other
implementors that I know about, to inform them of the change (and give them a
chance to react :).

Thanks and here is my reaction :) I'm that technical writer who compiled
the original version.

I see an ambiguity in the text, as currently written. The ambiguity applies
in cases a multilevel directory is trashed. Let us say that the directory
foo was trashed; it contains the subdirectory bar.

In $trash/directorysizes there will be an entry for foo. But what about an
entry for foo/bar? The problem is that "trashing a directory" is not
exactly defined. I guess there will only be an entry for foo; can you
confirm that?

Also, the term "directory size" is used but not defined. Is "the sum of
sizes of all files in the directory, including files in any subdirectories"
a correct definition?

After you confirm, and if you don't mind, I'll make a revised release
candidate. I'd like to fix a couple of old glitches as well (like a "note
the dot" endnote where the dot no longer exists". I'll make sure to have a
diff against the released version too.

--
Yours, Mikhail Ramendik

Unless explicitly stated, all opinions in my mail are my own and do not
reflect the views of any organization

David Faure

2013-04-24 16:26:20 UTC

Hi Mikhail!

Glad to hear from you.

Post by Mikhail Ramendik
I see an ambiguity in the text, as currently written. The ambiguity applies
in cases a multilevel directory is trashed. Let us say that the directory
foo was trashed; it contains the subdirectory bar.
In $trash/directorysizes there will be an entry for foo. But what about an
entry for foo/bar? The problem is that "trashing a directory" is not
exactly defined. I guess there will only be an entry for foo; can you
confirm that?

Correct, this is only about direct children of $trash/files/.
This is actually clarified by the sentence

<P>The character '/' is not allowed in the directory name (even as %2F), since
all these directories must be direct children of the "files" directory, and
absolute paths are not allowed.</P>

Post by Mikhail Ramendik
Also, the term "directory size" is used but not defined. Is "the sum of
sizes of all files in the directory, including files in any subdirectories"
a correct definition?

The way to calculate sizes is defined further down:

<P>The size is calculated to be the disk space used by the directory and its
contents, i.e. the size of the blocks, in bytes (like `du -B1` would
calculate).</P>

Post by Mikhail Ramendik
After you confirm, and if you don't mind, I'll make a revised release
candidate. I'd like to fix a couple of old glitches as well (like a "note
the dot" endnote where the dot no longer exists". I'll make sure to have a
diff against the released version too.

Sure, please do. Maybe do that as an additional commit on top of mine, and in
the end one of us pushes both commits together? This way it'll be easier to
review your changes (without diff'ing diffs).

--
David Faure, ***@kde.org, http://www.davidfaure.fr
Working on KDE, in particular KDE Frameworks 5

Mikhail Ramendik

2013-04-24 16:33:45 UTC

Post by David Faure
Sure, please do. Maybe do that as an additional commit on top of mine, and in
the end one of us pushes both commits together? This way it'll be easier to
review your changes (without diff'ing diffs).

I'm not sure I know how to commit or have the access, but I suggest we
handle this later. I will try to have my version ready no later than the
weekend and post it at a temporary location. Once everything is approved I
am sure we can sort out the commit part without bothering the entire list
with it.

--
Yours, Mikhail Ramendik

Unless explicitly stated, all opinions in my mail are my own and do not
reflect the views of any organization

David Faure

2014-02-02 09:52:49 UTC

Post by Mikhail Ramendik

Post by David Faure
Sure, please do. Maybe do that as an additional commit on top of mine, and in
the end one of us pushes both commits together? This way it'll be easier to
review your changes (without diff'ing diffs).

I'm not sure I know how to commit or have the access, but I suggest we
handle this later. I will try to have my version ready no later than the
weekend and post it at a temporary location. Once everything is approved I
am sure we can sort out the commit part without bothering the entire list
with it.

Wrapping this up almost a year later :)

Given that everyone was in agreement with this spec change, I have now pushed
these two commits, after receiving the style update by Mikhail:

commit 41d01202368d5296da77cda72a0f18f3590e9ca9
Author: Mikhail Ramendik <***@ramendik.ru>
Date: Sun Feb 2 10:47:52 2014 +0100

Style review of the language in the trash spec

commit 4636efe0c39c332886b8ba704e2345eba1c2a7a3
Author: David Faure <***@kde.org>
Date: Wed Apr 24 18:36:22 2013 +0200

Add "directorysizes" spec, bump version to 1.0

and I updated the information in web-export/specs.idx (how do I trigger an
update on the website now?)

--
David Faure, ***@kde.org, http://www.davidfaure.fr
Working on KDE, in particular KDE Frameworks 5

PCMan

2013-04-23 09:21:07 UTC

I have some questions about the proposal.

Post by David Faure

Post by Diggory Hardy

Post by David Faure
If you find html-in-a-patch hard to read, you can look for "Directory size
cache" at this page: http://www.davidfaure.fr/2013/trashspec_proposal.html

Hello,
Looks pretty good. Only thing that occurs to me is if too jobs are updating
the directorysizes file simultaneously two problems are possible: a read
clashing with a write or two clashing writes. Requiring atomic writes (i.e.
to a temporary file followed by rename) avoids the first problem and is
relatively simple. Clashed writes are probably not worth worrying about
(worst case a slow/paused job overwrites a recent file with an old one, but
this seems neither particularly bad nor likely from my POV).

Right. We had mentionned this during the meeting, but I forgot to spell it out
<P>Implementations MUST use a temporary file followed by an atomic rename()
operation in order to avoid corruption due to two implementations writing to
the file at the same time. The fact that the changes from one of the writers
could get lost isn't an issue, the cache can be updated again later on to add
that entry.</P>

The cache can be updated again later.
How, when, and who should do the update?
If multiple file manager implementations are updating the file at the
same time, only one of their change will be kept.
How can we know there are directories missing from the cache?
Should implementation enumerate the whole trash dir everytime to see
is there any one missing from the cache? If that's the case, the
implementation needs to calculate the total size of the missing
folders, and add them to the cache.
In the current proposal, we need to check frequently if there are dirs
not included in the cache and fix the errors. This looks odds.
Would you please clarify this part?
Thank you.

Post by David Faure

Post by Diggory Hardy
Looks good except for one very tiny detail: you go out of your way to
mention that URI encoding is only needed for dealing with '\n' in such a
way as it might lead someone to write a buggy implementation that fails
to encode '%' itself...

Indeed. Added an explicit mention of the '%' character in that sentence.
http://www.davidfaure.fr/2013/trashspec_proposal.html updated.
I think we're good to go now. Cc'ing the original author of the spec and other
implementors that I know about, to inform them of the change (and give them a
chance to react :).

Diggory Hardy

2013-04-24 12:25:03 UTC

Post by PCMan

Post by David Faure
<P>Implementations MUST use a temporary file followed by an atomic rename()
operation in order to avoid corruption due to two implementations writing
to the file at the same time. The fact that the changes from one of the
writers could get lost isn't an issue, the cache can be updated again
later on to add that entry.</P>

The cache can be updated again later.
How, when, and who should do the update?
If multiple file manager implementations are updating the file at the
same time, only one of their change will be kept.
How can we know there are directories missing from the cache?
Should implementation enumerate the whole trash dir everytime to see
is there any one missing from the cache? If that's the case, the
implementation needs to calculate the total size of the missing
folders, and add them to the cache.
In the current proposal, we need to check frequently if there are dirs
not included in the cache and fix the errors. This looks odds.
Would you please clarify this part?
Thank you.
Note that if the cache is recalculated by two processes simultaneously, one

of the updates will be lost; this is not considered an issue since from the
perspective of either of the updating processes the value obtained was correct
at some point in their executation and from the point of view of a future
process reading the cache, the worst case is that the cache read is a little
older than the value last calculated.

David Faure

2013-04-24 14:50:51 UTC

[Reducing CC's, now that people have been notified]

Post by Diggory Hardy

Post by PCMan

Post by David Faure
<P>Implementations MUST use a temporary file followed by an atomic rename()
operation in order to avoid corruption due to two implementations writing
to the file at the same time. The fact that the changes from one of the
writers could get lost isn't an issue, the cache can be updated again
later on to add that entry.</P>

The cache can be updated again later.
How, when, and who should do the update?
If multiple file manager implementations are updating the file at the
same time, only one of their change will be kept.
How can we know there are directories missing from the cache?

Well, if the directory exists and is not in the cache, it's missing :)

Post by Diggory Hardy

Post by PCMan
Should implementation enumerate the whole trash dir everytime to see
is there any one missing from the cache?

Yes. Please read the spec, the suggested algorithm does exactly that.
This is necessary anyway, to cope with older implementations, not only due to
the race condition.

Post by Diggory Hardy

Post by PCMan
If that's the case, the
implementation needs to calculate the total size of the missing
folders, and add them to the cache.

Yes, as the algorithm does.

Post by Diggory Hardy

Post by PCMan
In the current proposal, we need to check frequently if there are dirs
not included in the cache and fix the errors. This looks odds.

That's the same as the previous paragraph, I don't see the difference.

Post by Diggory Hardy

Post by PCMan
Note that if the cache is recalculated by two processes simultaneously, one

of the updates will be lost; this is not considered an issue since from the
perspective of either of the updating processes the value obtained was
correct at some point in their executation and from the point of view of a
future process reading the cache, the worst case is that the cache read is
a little older than the value last calculated.

This sounds less clear to me :)
You seem to look at the issue that a directory might "grow" over time. But
since the spec already suggests using rename() to trash directories (since it
suggests to only do it for the same partition), then the directory is either
there (and full) or it's not. So there is no "old value" in the cache.
Either a directory is in the cache, at the time you read it, or it's not (and
you need to calculate its size "manually"), and then add it to the cache (and
hope another process doesn't throw that away, but if it does, it will be
adding it to the cache, or it will happen even later, on the next trash
operation).

--
David Faure, ***@kde.org, http://www.davidfaure.fr
Working on KDE, in particular KDE Frameworks 5

32 Replies
2 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

David Faure 2013-04-14 21:48:40 UTC

Alexander Larsson 2013-04-15 08:07:40 UTC

Bastien Nocera 2013-04-15 08:19:55 UTC

David Faure 2013-04-15 09:11:42 UTC

Jerome Leclanche 2013-04-15 10:17:14 UTC

David Faure 2013-04-15 10:31:45 UTC

Jerome Leclanche 2013-04-15 10:35:20 UTC

Vincent Untz 2013-04-15 14:10:50 UTC

Alexander Larsson 2013-04-15 11:43:40 UTC

Ryan Lortie 2013-04-15 12:25:53 UTC

Alexander Larsson 2013-04-15 12:36:56 UTC

Bastien Nocera 2013-04-15 14:28:45 UTC

Alexander Larsson 2013-04-15 14:46:53 UTC

David Faure 2013-04-15 16:47:27 UTC

Jerome Leclanche 2013-04-15 17:36:34 UTC

David Faure 2013-04-15 21:26:19 UTC

Jerome Leclanche 2013-04-15 21:40:52 UTC

David Faure 2013-04-15 21:56:21 UTC

Jerome Leclanche 2013-04-15 22:24:45 UTC

Alexander Larsson 2013-04-15 19:18:53 UTC

Ryan Lortie 2013-04-15 22:15:40 UTC

Alexander Larsson 2013-04-16 07:03:22 UTC

David Faure 2013-04-17 08:12:22 UTC

Ryan Lortie 2013-04-17 08:32:37 UTC

Diggory Hardy 2013-04-18 07:28:32 UTC

David Faure 2013-04-19 15:13:56 UTC

Mikhail Ramendik 2013-04-19 21:48:05 UTC

David Faure 2013-04-24 16:26:20 UTC

Mikhail Ramendik 2013-04-24 16:33:45 UTC

David Faure 2014-02-02 09:52:49 UTC

PCMan 2013-04-23 09:21:07 UTC

Diggory Hardy 2013-04-24 12:25:03 UTC

David Faure 2013-04-24 14:50:51 UTC

about - legalese

Loading...