Discussion:
GOP2: 2 - Stable releases and roadmap (radical change)
(too old to reply)
Graham Percival
2012-06-26 20:55:42 UTC
Permalink
Not quite up to the ideal standard of GOP proposals, but there's a
lot of interest and this should be enough to see what way the wind
is blowing.

html-formatted version:
http://lilypond.org/~graham/gop/gop_3.html


*** Summary

Let’s drop the “any unintended change” thing, and go totally with
the regression tests. Tests pass? We can make a stable release.
Also, let’s have an official roadmap.


*** Motivation

There seems to be widespread frustration with the current system.
At the moment, any “unintended change” blocks a release (plus a
few extra conditions), so we’re at the mercy of all sorts of
behaviour that isn’t covered by the regtests. This makes it hard
to plan ahead for everybody: developers wanting to work on large
features or refactoring, users, linux distribution packagers, etc.


*** Details: Critical issues

A type-Critical issue will block a stable release, but the
definition is:

- a reproducible failure to build either make or make doc, from
an empty build tree, in a first run, if configure does not report
any errors.

- anything which stops contributors from helping out (e.g.
lily-git.tcl not working, source tree(s) not being available,
LilyDev being unable to compile git master, inaccurate
instructions in the Contributor’s Guide 2 Quick start).

To limit this scope of this point, we will assume that the
contributor is using the latest LilyDev and has read the relevant
part(s) of the Contributor’s Guide. Problems in other chapters of
the CG are not sufficient to qualify as Type-Critical.

- any regression test which fails to compile or shows incorrect
output.

The only change is to the third point, namely the “regression test
failure” as opposed to “any unintentional change”.


*** Details: Regtests

The current regtests don’t cover enough – that’s why we keep on
finding new regression-Critical issues. I think it’s worth
expanding the regtests and splitting them into multiple
categories.

These names don’t (deliberately) match any specific testing
methodology. If they do, then it’s probably a mistake and we
should rename these.

Crash: we don’t care about the output of these; we just want
to make sure that lilypond doesn’t crash with this input.
Tiny: these files would test individual features, such as
printing accidentals or slurs, with a minimum of shared features.
Integration: these are constructed examples which combine
multiple features together.
Pieces: musically-interesting fragments of music, such as a
few systems from a Bach sonata or Debussy piano work.
Syntax: short fragments of music for which the .ly files are
“frozen” – we never run convert-ly on these files until LilyPond
4.0. (see below, “roadmap”)

I figure that we’ll double the total number of regtests. There’s
probably some old ones that can be eliminated (or combined with
newer ones), but we’ll be adding a lot more.


*** Programming regtests

To avoid slowing down programming to a crawl, I figure that we’ll
identify some subset of these regtests and have a separate make
regtests-quick command which only evaluates that subset.

As a rule of thumb, I’d say that the regtests-quick target should
take as long as a make from scratch. I’m sympathetic to developers
with limited computing resources, but I think it’s reasonable to
ask everybody submitting programming patches to “double” the time
it takes to test their patch (since obviously everybody would run
make before submitting anything).

The patchy test-patches will still run the full regtest checks.


*** When breakage occurs

There will of course be functionality which breaks. When that
happens, we file a normal bug. A new regtest can only be added for
that bug when it is fixed – we won’t add the regtest first, then
try to fix it.

In other words, git master should always pass all regtests. If it
doesn’t, then reverting should be the first option.


*** Roadmap

With this change, we would no longer be committed to the same kind
of stability that we were before. As such, I think it’s worth
bumping the version up to 3.0.

The 3.x series will consist of a series of random breakage from
functionality not covered under the existing regtests and from
manual .ly changes required by GLISS. This is intentional – or
rather, we don’t intend to break stuff, but the policy accepts
that this will happen. Somebody may offer to maintain the 2.x
series to cater to users who want additional stability.

Over the next 3 months or so, we’ll discuss a number of syntax
changes in GLISS. Then discussion will cease until all the changes
have been implemented. We’ll then have release 3.2, which will
almost certainly require manual changes to all .ly files.

We’ll then have another few months of GLISS discussions, then a
pause for implementions, then 3.4. Repeat as necessary.

LilyPond 4.0 will mark the ending of GLISS, and by that point we
should have much improved regtest coverage. We can’t really plan
too much for this, since it’s likely two years away.


- Graham
Keith OHara
2012-06-27 05:33:47 UTC
Permalink
Post by Graham Percival
Let’s drop the “any unintended change” thing, and go totally with
the regression tests. Tests pass? We can make a stable release.
I don't know. Maybe that would be alright. I'm not sure.

The 'Regression' label would be come more important, because we will
want to keep track of and fix regressions before too many stable
releases go by. For this purpose, I guess a regression would be
failure of something that worked on purpose in /any/ stable release.
Post by Graham Percival
- any regression test which fails to compile or shows incorrect
output.
For any changed test then, it is probably worth reading the header, to
see if a subtle change that looks harmless happens to be the point of
the test (and would presumably cause other trouble).

The "incorrect output" should only count where the previous stable
release gave correct output. Lots of tests show behavior that some
people think is wrong (‘accidental-ledger.ly’ ‘ambitus.ly’) or that
looks bad because it is a a stress test (‘break.ly’ ‘prefatory-
separation.ly’ ‘spacing-strict-spacing-grace.ly’).
Post by Graham Percival
*** Details: Regtests
The current regtests don’t cover enough – that’s why we keep on
finding new regression-Critical issues. I think it’s worth
expanding the regtests and splitting them into multiple
categories.
This cannot be done quickly. Adding a few pieces of music in
various styles might help, but I remember from my regression this
cycle that my patch worked fine on score and parts of a full symphony,
but then version 2.15.37 failed spectacularly on two pages of guitar
music.
Post by Graham Percival
Tiny: these files would test individual features, such as
printing accidentals or slurs, with a minimum of shared features.
As a goal, I suggest "Targeted" instead of "Tiny" -- testing performance
in one narrow area, but thoroughly. More often than not, after I fix
a bug, I find a regtest that should have caught it, and expand that
test rather than add a new one.
Graham Percival
2012-06-27 09:57:41 UTC
Permalink
Post by Keith OHara
Post by Graham Percival
- any regression test which fails to compile or shows incorrect
output.
For any changed test then, it is probably worth reading the header, to
see if a subtle change that looks harmless happens to be the point of
the test (and would presumably cause other trouble).
Hmm. It could be necessary to add either the texidoc to the
regtest comparison page, or add markup to the score itself to
highlight any subtle points which are vital.
Post by Keith OHara
The "incorrect output" should only count where the previous stable
release gave correct output. Lots of tests show behavior that some
people think is wrong (‘accidental-ledger.ly’ ‘ambitus.ly’) or that
looks bad because it is a a stress test (‘break.ly’ ‘prefatory-
separation.ly’ ‘spacing-strict-spacing-grace.ly’).
Yes; there is a hidden assumption that the existing regtests have
been exhaustively checked and we agree that they pass (again,
possibly only after making a note in the texidoc that the output
may look "too squished" or something).

I'll make that assumption explicit.
Post by Keith OHara
Post by Graham Percival
*** Details: Regtests
The current regtests don’t cover enough – that’s why we keep on
finding new regression-Critical issues. I think it’s worth
expanding the regtests and splitting them into multiple
categories.
This cannot be done quickly.
Yes; I should have noted that it will likely take 100+ hours.
Post by Keith OHara
Adding a few pieces of music in various styles might help, but I
remember from my regression this cycle that my patch worked fine
on score and parts of a full symphony, but then version 2.15.37
failed spectacularly on two pages of guitar music.
Yes. Under this proposal, version 2.15.37 could have became
2.16.0, even with that spectacular failure known, because that
particular edge case was not checked in the regtests. The fix
would not occur until 2.16.1 or later -- possibly years later.
Post by Keith OHara
Post by Graham Percival
Tiny: these files would test individual features, such as
printing accidentals or slurs, with a minimum of shared features.
As a goal, I suggest "Targeted" instead of "Tiny" -- testing performance
in one narrow area, but thoroughly. More often than not, after I fix
a bug, I find a regtest that should have caught it, and expand that
test rather than add a new one.
good idea! I've modified the name.

- Graham
Janek Warchoł
2012-07-11 15:48:48 UTC
Permalink
Hi All, Graham,

first, let me apologise for not responding promptly.
Secondly, here's my reply to Graham's almost-original proposition;
i'll send a reply to current discussion ("Clear policy discussions")
separately.

On Tue, Jun 26, 2012 at 10:55 PM, Graham Percival
Post by Graham Percival
Let’s drop the “any unintended change” thing, and go totally with
the regression tests. Tests pass? We can make a stable release.
Also, let’s have an official roadmap.
In general, sounds reasonable but not perfect. See below.
Post by Graham Percival
*** Motivation
There seems to be widespread frustration with the current system.
At the moment, any “unintended change” blocks a release (plus a
few extra conditions), so we’re at the mercy of all sorts of
behaviour that isn’t covered by the regtests. This makes it hard
to plan ahead for everybody: developers wanting to work on large
features or refactoring, users, linux distribution packagers, etc.
Yup. The problem with current system is the "surprise" part: we know
things break, but it's annoying when they disrupt our plans.
Post by Graham Percival
*** Details: Regtests
The current regtests don’t cover enough – that’s why we keep on
finding new regression-Critical issues. I think it’s worth
expanding the regtests and splitting them into multiple
categories.
[several types described in http://lilypond.org/~graham/gop/gop_3.html]
This is a really good idea, Graham! +10
Post by Graham Percival
In cases where the output may look bad because it is a stress test
(e.g., ‘break.ly’, ‘spacing-strict-spacing-grace.ly’), this fact will be noted
in either the texidoc or as a markup inside the score.
i vote for markup inside the score. Much more convenient imo.
Post by Graham Percival
*** Programming regtests
To avoid slowing down programming to a crawl, I figure that we’ll
identify some subset of these regtests and have a separate make
regtests-quick command which only evaluates that subset.
As a rule of thumb, I’d say that the regtests-quick target should
take as long as a make from scratch.
*very* good idea! +20
Post by Graham Percival
git master should always pass all regtests. If it
doesn’t, then reverting should be the first option.
+1
Post by Graham Percival
*** Roadmap
The 3.x series will consist of a series of random breakage from
functionality not covered under the existing regtests and from
manual .ly changes required by GLISS.
[..]
We’ll then have another few months of GLISS discussions, then a
pause for implementions, then 3.4. Repeat as necessary.
That's ok with me.
Post by Graham Percival
So far there have been c. 75 critical regressions under the
current definition of 'critical' since 2.14. All but one have been
fixed, many of them promptly. This prompt attention IMO
is due only to the fact that they were deemed to block a
stable release. If the only criterion is that the release compiles
the (extended) regtests satisfactorily, then I doubt that adequate
attention will be directed to bugs discovered after the release
that would be deemed critical on the current definition. That
would seriously degrade the quality of our stable releases.
This is a valid concern.
What about something like this:
when a regression against latest stable is found, it's not marked as
critical (as Graham suggests). However, when we make a stable
release, all regressions present in the tracker become critical. In
other words, if current unstable is, say, 2.17.x, regressions against
2.16 aren't critical (don't prevent releasing 2.18), but still-unfixed
regressions against 2.14 are critical.
This way we are "forced" to fix all regressions (sooner or later), but
we eliminate the element of surprise that is so annoying in current
system. Things may break, but we won't leave them broken too long.
Since the lack of "surprises" should allow frequent stable releases,
all regressions should be fixed pretty quickly. We could actually
adopt a policy of aiming to have a stable release each 3 months, which
should help with that.

How do you like it?
Janek
David Kastrup
2012-07-11 16:07:41 UTC
Permalink
Post by Janek Warchoł
Post by Trevor Daniels
So far there have been c. 75 critical regressions under the
current definition of 'critical' since 2.14. All but one have been
fixed, many of them promptly. This prompt attention IMO
is due only to the fact that they were deemed to block a
stable release. If the only criterion is that the release compiles
the (extended) regtests satisfactorily, then I doubt that adequate
attention will be directed to bugs discovered after the release
that would be deemed critical on the current definition. That
would seriously degrade the quality of our stable releases.
This is a valid concern.
when a regression against latest stable is found, it's not marked as
critical (as Graham suggests). However, when we make a stable
release, all regressions present in the tracker become critical. In
other words, if current unstable is, say, 2.17.x, regressions against
2.16 aren't critical (don't prevent releasing 2.18), but still-unfixed
regressions against 2.14 are critical.
I don't think that makes much sense. It means that regressions become
important _after_ stable releases. Also it means that it becomes hard
to classify newly discovered regressions: at the current point of time,
LilyPond 2.12 does not even compile on current GCC compilers.

I still maintain that the main problem right now is that all regressions
are considered equally important, and uniformly more important than
bugs.

For example, when the stems don't reach shapenote heads on quarternotes
(while the situation before the stem-inducing regression was that they
did, but instead did not connect on eighths), that is enough to not give
users access to LilyPond 2.16. That typesetting in connection with
grace timing is still broken is much more severe, but it does not
preclude us from making a release.
Post by Janek Warchoł
This way we are "forced" to fix all regressions (sooner or later), but
we eliminate the element of surprise that is so annoying in current
system.
Regressions are something we want to keep in check, but not at any cost.
Of course, the way to absolutely guarantee no regression is to never
release. That is the course we are currently taking.
Post by Janek Warchoł
Things may break, but we won't leave them broken too long. Since the
lack of "surprises" should allow frequent stable releases, all
regressions should be fixed pretty quickly. We could actually adopt a
policy of aiming to have a stable release each 3 months, which should
help with that.
How do you like it?
3 months is not a useful cycle for a stable release. How do you expect
something like the skyline patches to stabilize in such an amount of
time? It needs retuning hosts of parameters as one consequence. It
would also mean that if we want to provide reasonable update paths to
software distributions with a life time of a year, we would need to
maintain backports for about four stable versions at a time.
--
David Kastrup
Janek Warchoł
2012-07-11 17:04:42 UTC
Permalink
Post by David Kastrup
Post by Janek Warchoł
when a regression against latest stable is found, it's not marked as
critical (as Graham suggests). However, when we make a stable
release, all regressions present in the tracker become critical. In
other words, if current unstable is, say, 2.17.x, regressions against
2.16 aren't critical (don't prevent releasing 2.18), but still-unfixed
regressions against 2.14 are critical.
I don't think that makes much sense. It means that regressions become
important _after_ stable releases.
Well, this isn't what i meant to be the "spirit" of my proposal.
My idea for the "spirit" of the new policy is:
"Regressions are bad, we cannot ignore them - we have to fix them.
But we aren't their slaves, we don't have to fix them immediately."
Post by David Kastrup
Also it means that it becomes hard
to classify newly discovered regressions: at the current point of time,
LilyPond 2.12 does not even compile on current GCC compilers.
I still maintain that the main problem right now is that all regressions
are considered equally important, and uniformly more important than
bugs.
For example, when the stems don't reach shapenote heads on quarternotes
(while the situation before the stem-inducing regression was that they
did, but instead did not connect on eighths), that is enough to not give
users access to LilyPond 2.16. That typesetting in connection with
grace timing is still broken is much more severe, but it does not
preclude us from making a release.
ok, good points. So, i suggest to additionally give Project Manager
the power to "upgrade" any regression against current stable
discovered to critical.

Let me rephrase my suggestion then:
Regressions can be critical (prevent new stable) or not. The default
behaviour is to mark regressions against current stable as not
critical (?), and regressions against previous stable as critical.
(This is to ensure that regressions won't disrupt release process, but
also won't stay around forever.)
Any developer (i.e. has push access) is free to upgrade/downgrade a
regression's status to critical/non-critical. Other developers are
free to discuss this. If in doubt, the decision should be made by
Project Manager, who is advised to ask senior developers for their
opinions. Project Manager's decision should be considered final for
at least a month (to avoid endless debates, but also allow reopening
if necessary).

What about this? It gives people power, yet allows
"i'm-feeling-monkeyish" following of guidelines (with hopefully
reasonable outcome nevertheless). It also gives Project Manager
control, but not too much of it and not too much responsibility.
I feel this is a significantly better proposal than my first one.
Post by David Kastrup
3 months is not a useful cycle for a stable release. How do you expect
something like the skyline patches to stabilize in such an amount of
time? It needs retuning hosts of parameters as one consequence. It
would also mean that if we want to provide reasonable update paths to
software distributions with a life time of a year, we would need to
maintain backports for about four stable versions at a time.
Good point. I suggest 6 months, then, with preferred release times to
be May and November (that's because summer tends to be most active
time - let's give big summer projects time to mature and stabilise).
Janek
Graham Percival
2012-07-14 04:31:04 UTC
Permalink
Post by Janek Warchoł
On Tue, Jun 26, 2012 at 10:55 PM, Graham Percival
Post by Graham Percival
To avoid slowing down programming to a crawl, I figure that we’ll
identify some subset of these regtests and have a separate make
regtests-quick command which only evaluates that subset.
As a rule of thumb, I’d say that the regtests-quick target should
take as long as a make from scratch.
*very* good idea! +20
Well, there's no reason why this needs to be tied to a specific
release policy. There's certainly no harm in implementing the
basic infrastructure of "make regtests-quick", leaving any debate
about exacty which files qualify for the "quick" test.

Problem is, somebody needs to sit down and do it. Do you feel
like adding that?
Post by Janek Warchoł
We could actually adopt a policy of aiming to have a stable
release each 3 months, which should help with that.
I definitely think we should have stable releases much faster,
although I'd target 3-4 rather than simply every 3 months. But
that won't happen until/unless the regression tests have much
better coverage.

This problem breaks down to:
- 10-20 hours of build system work. (to add the regtests-quick
target)
- 10-100 (?) hours of programmers to investigate+discuss code
paths (?)
- 10-100 (?) hours of helpful users organizing and writing .ly
files to cover all (? or most?) functionality
- possibly 10-20 hours of python programming to extend Patchy
and/or the Paris university server
- somebody to organize the entire effort

I'm sadly not volunteering for any of those tasks. I'm happy to
organize policy discussions about what to do with the results of
those tests (specifically for the 2.18 or 3.0 or other releases),
but I think we need some real effort on the fundamental
infrastructure before it makes sense to have policy discussions on
this.

- Graham
Janek Warchoł
2012-07-14 08:02:48 UTC
Permalink
On Sat, Jul 14, 2012 at 6:31 AM, Graham Percival
Post by Graham Percival
Post by Janek Warchoł
On Tue, Jun 26, 2012 at 10:55 PM, Graham Percival
Post by Graham Percival
To avoid slowing down programming to a crawl, I figure that we’ll
identify some subset of these regtests and have a separate make
regtests-quick command which only evaluates that subset.
As a rule of thumb, I’d say that the regtests-quick target should
take as long as a make from scratch.
*very* good idea! +20
Well, there's no reason why this needs to be tied to a specific
release policy. There's certainly no harm in implementing the
basic infrastructure of "make regtests-quick", leaving any debate
about exacty which files qualify for the "quick" test.
Problem is, somebody needs to sit down and do it. Do you feel
like adding that?
ok, but not until GSoC finishes. If anyone else is interested, go for it.
Post by Graham Percival
- 10-100 (?) hours of helpful users organizing and writing .ly
files to cover all (? or most?) functionality
I can help here.
Post by Graham Percival
- somebody to organize the entire effort
I can help here.

cheers,
Janek
Janek Warchoł
2012-07-14 11:09:45 UTC
Permalink
Hi,
Post by Keith OHara
For any changed test then, it is probably worth reading the header, to
see if a subtle change that looks harmless happens to be the point of
the test (and would presumably cause other trouble).
I was thinking about it, too (see "regtests about very small
differencies" thread on devel). My conclusion is that using bigger
staff (or font) size should be a good solution: it makes the changes
more visible and immediately focuses reader's attention. See commit
e8fc7813b17822c138150807484197ef8d4e7c21.

Oh, and there's one more thing about regtests that came to my mind: we
should perhaps have some sort of "special" category for regtests
requiring manual attention (example: issue 2656 concerns only
Windows). Of course these tests would be ran seldom (only when a new
dev release happens perhaps?).

cheers,
Janek

Trevor Daniels
2012-07-10 21:50:45 UTC
Permalink
Graham Percival wrote Tuesday, June 26, 2012 9:55 PM
Post by Graham Percival
*** Summary
Let’s drop the “any unintended change” thing, and go totally with
the regression tests. Tests pass? We can make a stable release.
Also, let’s have an official roadmap.
Rather than discussing each point separately below I prefer to
make some general observations. The present proposal is too
detailed and not sufficiently radical. We need to consider
wider options and consequences first, before honing the details.

1. So far there have been c. 75 critical regressions under the
current definition of 'critical' since 2.14. All but one have been
fixed, many of them promptly. This prompt attention IMO
is due only to the fact that they were deemed to block a
stable release. If the only criterion is that the release compiles
the (extended) regtests satisfactorily, then I doubt that adequate
attention will be directed to bugs discovered after the release
that would be deemed critical on the current definition. That
would seriously degrade the quality of our stable releases.

2. To complete the discussion David and I were having about
the possibility of using revert as an option to fix a critical bug,
I looked at a few recent critical regressions, namely those
which caused Release Canditates 6 and 7 to be abandoned.
None of these could have been easily fixed by reversion,
either because the fix was complicated, the original source
was too old for revert to be safe, or the cause was external
to LP. So reversion offers no easy answer.

3. I rather like the idea of leaving the decision about the time
to make a stable release to an individual. That's what Han-Wen
used to do in the old days, IIRC. That individual could use
tests of his own devising to help him make a decision, but I
would expect him/her to at least canvas views from developers
before deciding. But I worry that this would still suffer from
the problem I outlined in (1) above. Perhaps that release-
meister could identify bugs which (s)he considers are blocking
a new stable. That would get round (1), and ensure serious
bugs are attended to promptly.

4. The other possibility is to adopt timed releases. Say every
6 months. The .0 release would be made with a statement
about any critical bugs, which would be fixed in a .1 release.
Still suffers from (1), so I don't favour this.

On balance, assuming such a person could be found, I would
favour a solution along the lin
Loading...