[openstack-dev] [keystone][nova][neutron][all] Rolling upgrades: database triggers and oslo.versionedobjects

Discussion:

[openstack-dev] [keystone][nova][neutron][all] Rolling upgrades: database triggers and oslo.versionedobjects

Steve Martinelli

2016-08-25 17:13:35 UTC

The keystone team is pursuing a trigger-based approach to support rolling,
zero-downtime upgrades. The proposed operator experience is documented here:

http://docs.openstack.org/developer/keystone/upgrading.html

This differs from Nova and Neutron's approaches to solve for rolling
upgrades (which use oslo.versionedobjects), however Keystone is one of the
few services that doesn't need to manage communication between multiple
releases of multiple service components talking over the message bus (which
is the original use case for oslo.versionedobjects, and for which it is
aptly suited). Keystone simply scales horizontally and every node talks
directly to the database.

Database triggers are obviously a new challenge for developers to write,
honestly challenging to debug (being side effects), and are made even more
difficult by having to hand write triggers for MySQL, PostgreSQL, and
SQLite independently (SQLAlchemy offers no assistance in this case), as
seen in this patch:

https://review.openstack.org/#/c/355618/

However, implementing an application-layer solution with
oslo.versionedobjects is not an easy task either; refer to Neutron's
implementation:

https://review.openstack.org/#/q/topic:bp/adopt-oslo-versioned-objects-for-db

Our primary concern at this point are how to effectively test the triggers
we write against our supported database systems, and their various
deployment variations. We might be able to easily drop SQLite support (as
it's only supported for our own test suite), but should we expect variation
in support and/or actual behavior of triggers across the MySQLs, MariaDBs,
Perconas, etc, of the world that would make it necessary to test each of
them independently? If you have operational experience working with
triggers at scale: are there landmines that we need to be aware of? What is
it going to take for us to say we support *zero* dowtime upgrades with
confidence?

Steve & Dolph

Sean M. Collins

2016-08-25 17:59:51 UTC

Personally I had very bad experiences with stored procedures and
triggers in previous jobs, where the amount of side effects that
occurred and the overall lack of maintainability of triggers and stored
procedures scared me off.

We handed off changes to stored procedures and
triggers to the DBAs, who had a tendency to not apply them correctly or
forget to apply them at a site. Then it was a total nightmare to try and
figure out why things wouldn't work, until we discovered that the
changes to an SP or Trigger wasn't actually applied.

Now, I don't think OpenStack as a project suffers the same
organizational dysfunction as my previous jobs, but just overall they're
hard to debug and maintain and I don't like to use them.

/rant

--
Sean M. Collins

Matt Fischer

2016-08-25 18:13:21 UTC

Post by Steve Martinelli
The keystone team is pursuing a trigger-based approach to support rolling,
http://docs.openstack.org/developer/keystone/upgrading.html
This differs from Nova and Neutron's approaches to solve for rolling
upgrades (which use oslo.versionedobjects), however Keystone is one of the
few services that doesn't need to manage communication between multiple
releases of multiple service components talking over the message bus (which
is the original use case for oslo.versionedobjects, and for which it is
aptly suited). Keystone simply scales horizontally and every node talks
directly to the database.
Database triggers are obviously a new challenge for developers to write,
honestly challenging to debug (being side effects), and are made even more
difficult by having to hand write triggers for MySQL, PostgreSQL, and
SQLite independently (SQLAlchemy offers no assistance in this case), as
https://review.openstack.org/#/c/355618/
However, implementing an application-layer solution with
oslo.versionedobjects is not an easy task either; refer to Neutron's
https://review.openstack.org/#/q/topic:bp/adopt-oslo-
versioned-objects-for-db
Our primary concern at this point are how to effectively test the triggers
we write against our supported database systems, and their various
deployment variations. We might be able to easily drop SQLite support (as
it's only supported for our own test suite), but should we expect variation
in support and/or actual behavior of triggers across the MySQLs, MariaDBs,
Perconas, etc, of the world that would make it necessary to test each of
them independently? If you have operational experience working with
triggers at scale: are there landmines that we need to be aware of? What is
it going to take for us to say we support *zero* dowtime upgrades with
confidence?
Steve & Dolph

No experience to add for triggers, but I'm happy to help test this on a
MySQL Galera cluster. I'd also like to add thanks for looking into this. A
keystone outage is a cloud outage and being able to eliminate them from
upgrades will be beneficial to everyone.

Dan Smith

2016-08-25 18:32:24 UTC

Post by Steve Martinelli
This differs from Nova and Neutron's approaches to solve for rolling
upgrades (which use oslo.versionedobjects), however Keystone is one of
the few services that doesn't need to manage communication between
multiple releases of multiple service components talking over the
message bus (which is the original use case for oslo.versionedobjects,
and for which it is aptly suited). Keystone simply scales horizontally
and every node talks directly to the database.

Yeah, o.vo gives you nothing really if all you want is a facade behind
which to hide the application-level migrations. That doesn't mean it
would be a bad thing to use, but maybe overkill vs. just writing a
couple wrappers.

Post by Steve Martinelli
Database triggers are obviously a new challenge for developers to write,
honestly challenging to debug (being side effects), and are made even
more difficult by having to hand write triggers for MySQL, PostgreSQL,
and SQLite independently (SQLAlchemy offers no assistance in this case),
https://review.openstack.org/#/c/355618/
However, implementing an application-layer solution with
oslo.versionedobjects is not an easy task either; refer to Neutron's

Yeah, it's not trivial at the application level either but at least it
is in python and write-once for any kind of compatible backend. My
(extremely limited) experience with stored procedures is that they are
very difficult to get right, even as an expert in the technology, which
almost none of us are. Application-level migrations are significantly
simpler and exist closer to the domain of the rest of the code for a
specific new feature.

I will offer one bit of anecdotal information that may be relevant:
Several of the migrations that nova has done in the past have required
things like parsing/generating JSON, and making other API calls to look
up information needed to translate from one format to another. That
would (AFAIK) be quite difficult to do in the database itself, and may
mean you end up with a combination of both approaches in the long run.

I don't think that keystone necessarily needs to adopt the same approach
as the other projects (especially in the absence of things like
cross-version RPC compatibility) and so if stored procedures are really
the best fit then that's cool. They will themselves be a landmine in
front of me should I ever have to debug such a problem, but if they are
significantly better for the most part then so be it.

--Dan

gordon chung

2016-08-25 19:51:39 UTC

On 25/08/16 01:13 PM, Steve Martinelli wrote:
The keystone team is pursuing a trigger-based approach to support rolling, zero-downtime upgrades. The proposed operator experience is documented here:

http://docs.openstack.org/developer/keystone/upgrading.html

This differs from Nova and Neutron's approaches to solve for rolling upgrades (which use oslo.versionedobjects), however Keystone is one of the few services that doesn't need to manage communication between multiple releases of multiple service components talking over the message bus (which is the original use case for oslo.versionedobjects, and for which it is aptly suited). Keystone simply scales horizontally and every node talks directly to the database.

just curious, but does Keystone have any IPC or is it still just a single service interacting with db? if the latter, you should be able to just apply migration with no downtime as long as you don't modify/delete existing columns. similar experience as others, haven't really used stored procedures in a while but it's a pain wrt to portability. considering OpenStack has a habit of supporting every driver under the sun, i'm guessing driver specific solutions will get more difficult over time.

cheers,

--
gord

Sean Dague

2016-08-25 21:14:13 UTC

Post by Steve Martinelli
The keystone team is pursuing a trigger-based approach to support
rolling, zero-downtime upgrades. The proposed operator experience is
http://docs.openstack.org/developer/keystone/upgrading.html
This differs from Nova and Neutron's approaches to solve for rolling
upgrades (which use oslo.versionedobjects), however Keystone is one of
the few services that doesn't need to manage communication between
multiple releases of multiple service components talking over the
message bus (which is the original use case for oslo.versionedobjects,
and for which it is aptly suited). Keystone simply scales horizontally
and every node talks directly to the database.
Database triggers are obviously a new challenge for developers to write,
honestly challenging to debug (being side effects), and are made even
more difficult by having to hand write triggers for MySQL, PostgreSQL,
and SQLite independently (SQLAlchemy offers no assistance in this case),
https://review.openstack.org/#/c/355618/
However, implementing an application-layer solution with
oslo.versionedobjects is not an easy task either; refer to Neutron's
https://review.openstack.org/#/q/topic:bp/adopt-oslo-versioned-objects-for-db
Our primary concern at this point are how to effectively test the
triggers we write against our supported database systems, and their
various deployment variations. We might be able to easily drop SQLite
support (as it's only supported for our own test suite), but should we
expect variation in support and/or actual behavior of triggers across
the MySQLs, MariaDBs, Perconas, etc, of the world that would make it
necessary to test each of them independently? If you have operational
experience working with triggers at scale: are there landmines that we
need to be aware of? What is it going to take for us to say we support
*zero* dowtime upgrades with confidence?

I would really hold off doing anything triggers related until there was
sufficient testing for that, especially with potentially dirty data.

Triggers also really bring in a whole new DSL that people need to learn
and understand, not just across this boundary, but in the future
debugging issues. And it means that any errors happening here are now in
a place outside of normal logging / recovery mechanisms.

There is a lot of value that in these hard problem spaces like zero down
uptime we keep to common patterns between projects because there are
limited folks with the domain knowledge, and splitting that even further
makes it hard to make this more universal among projects.

-Sean

--
Sean Dague
http://dague.net

Monty Taylor

2016-08-31 22:18:03 UTC

Post by Sean Dague

Post by Steve Martinelli
The keystone team is pursuing a trigger-based approach to support
rolling, zero-downtime upgrades. The proposed operator experience is
http://docs.openstack.org/developer/keystone/upgrading.html
This differs from Nova and Neutron's approaches to solve for rolling
upgrades (which use oslo.versionedobjects), however Keystone is one of
the few services that doesn't need to manage communication between
multiple releases of multiple service components talking over the
message bus (which is the original use case for oslo.versionedobjects,
and for which it is aptly suited). Keystone simply scales horizontally
and every node talks directly to the database.
Database triggers are obviously a new challenge for developers to write,
honestly challenging to debug (being side effects), and are made even
more difficult by having to hand write triggers for MySQL, PostgreSQL,
and SQLite independently (SQLAlchemy offers no assistance in this case),
https://review.openstack.org/#/c/355618/
However, implementing an application-layer solution with
oslo.versionedobjects is not an easy task either; refer to Neutron's
https://review.openstack.org/#/q/topic:bp/adopt-oslo-versioned-objects-for-db
Our primary concern at this point are how to effectively test the
triggers we write against our supported database systems, and their
various deployment variations. We might be able to easily drop SQLite
support (as it's only supported for our own test suite), but should we
expect variation in support and/or actual behavior of triggers across
the MySQLs, MariaDBs, Perconas, etc, of the world that would make it
necessary to test each of them independently? If you have operational
experience working with triggers at scale: are there landmines that we
need to be aware of? What is it going to take for us to say we support
*zero* dowtime upgrades with confidence?

I would really hold off doing anything triggers related until there was
sufficient testing for that, especially with potentially dirty data.
Triggers also really bring in a whole new DSL that people need to learn
and understand, not just across this boundary, but in the future
debugging issues. And it means that any errors happening here are now in
a place outside of normal logging / recovery mechanisms.
There is a lot of value that in these hard problem spaces like zero down
uptime we keep to common patterns between projects because there are
limited folks with the domain knowledge, and splitting that even further
makes it hard to make this more universal among projects.

I said this the other day in the IRC channel, and I'm going to say it
again here. I'm going to do it as bluntly as I can - please keeping in
mind that I respect all of the humans involved.

I think this is a monstrously terrible idea.

There are MANY reasons for this -but I'm going to limit myself to two.

OpenStack is One Project
========================

Nova and Neutron have an approach for this. It may or may not be ideal -
but it exists right now. While it can be satisfying to discount the
existing approach and write a new one, I do not believe that is in the
best interests of OpenStack as a whole. To diverge in _keystone_ - which
is one of the few projects that must exist in every OpenStack install -
when there exists an approach in the two other most commonly deployed
projects - is such a terrible example of the problems inherent in
Conway's Law that it makes me want to push up a proposal to dissolve all
of the individual project teams and merge all of the repos into a single
repo.

Make the oslo libraries Nova and Neutron are using better. Work with the
Nova and Neutron teams on a consolidated approach. We need to be driving
more towards an OpenStack that behaves as if it wasn't written by
warring factions of developers who barely communicate.

Even if the idea was one I thought was good technically, the above would
still trump that. Work with Nova and Neutron. Be more similar.

PLEASE

BUT - I also don't think it's a good technical solution. That isn't
because triggers don't work in MySQL (they do) - but because we've spent
the last six years explicitly NOT writing raw SQL. We've chosen an
abstraction layer (SQLAlchemy) which does its job well.

IF this were going to be accompanied by a corresponding shift in
approach to not support any backends by MySQL and to start writing our
database interactions directly in SQL in ALL of our projects - I could
MAYBE be convinced. Even then I think doing it in triggers is the wrong
place to put logic.

"Database triggers are obviously a new challenge for developers to
write, honestly challenging to debug (being side effects), and are made
even more difficult by having to hand write triggers for MySQL,
PostgreSQL, and SQLite independently (SQLAlchemy offers no assistance in
this case)"

If you look at:

https://review.openstack.org/#/c/355618/40/keystone/common/sql/expand_repo/versions/002_add_key_hash_and_encrypted_blob_to_credential.py

You will see the three different SQL dialects this. Not only that, but
some of the more esoteric corners of those backends. We can barely get
_indexes_ right in our database layers ... now we think we're going to
get triggers right? Consistently? And handle things like Galera?

The other option is app level, which is what nova and neutron are doing.
It's a good option, because it puts the logic in python, which is a
thing we have 2500 developers fairly well versed in. It's also scalable,
as the things executing whatever the logic is are themselves a scale-out
set of servers. Finally, it's a known and accepted pattern in large
scale MySQL shops ... Roll out a new version of the app code which
understands both the old and the new schema version, then roll out a
no-downtime additive schema change to the database, then have the app
layer process and handle on the fly transformation if needed.

SO ...

Just do what Nova and Neutron are doing - and if it's not good enough,
fix it. Having some projects use triggers and other projects not use
triggers is one of the more epically crazypants things I've heard around
here ... and I lived through the twisted/eventlet argument.

Monty

David Stanek

2016-09-01 13:57:40 UTC

Post by Monty Taylor
Nova and Neutron have an approach for this. It may or may not be ideal -
but it exists right now. While it can be satisfying to discount the
existing approach and write a new one, I do not believe that is in the
best interests of OpenStack as a whole. To diverge in _keystone_ - which
is one of the few projects that must exist in every OpenStack install -
when there exists an approach in the two other most commonly deployed
projects - is such a terrible example of the problems inherent in
Conway's Law that it makes me want to push up a proposal to dissolve all
of the individual project teams and merge all of the repos into a single
repo.

That's a bit overly dramatic. I think having some innovation is a good
thing. Specifically in this case where our needs appear to be a little
simpler than those of nova.

Post by Monty Taylor
Make the oslo libraries Nova and Neutron are using better. Work with the
Nova and Neutron teams on a consolidated approach. We need to be driving
more towards an OpenStack that behaves as if it wasn't written by
warring factions of developers who barely communicate.

I believe we tried to keep with the same extract/migrate/contract
patterns. Sure our implementation differs, but I don't see operators
caring about that as long as it works.

Post by Monty Taylor
Even if the idea was one I thought was good technically, the above would
still trump that. Work with Nova and Neutron. Be more similar.
PLEASE
BUT - I also don't think it's a good technical solution. That isn't
because triggers don't work in MySQL (they do) - but because we've spent
the last six years explicitly NOT writing raw SQL. We've chosen an
abstraction layer (SQLAlchemy) which does its job well.
IF this were going to be accompanied by a corresponding shift in
approach to not support any backends by MySQL and to start writing our
database interactions directly in SQL in ALL of our projects - I could
MAYBE be convinced. Even then I think doing it in triggers is the wrong
place to put logic.
"Database triggers are obviously a new challenge for developers to
write, honestly challenging to debug (being side effects), and are made
even more difficult by having to hand write triggers for MySQL,
PostgreSQL, and SQLite independently (SQLAlchemy offers no assistance in
this case)"
https://review.openstack.org/#/c/355618/40/keystone/common/sql/expand_repo/versions/002_add_key_hash_and_encrypted_blob_to_credential.py
You will see the three different SQL dialects this. Not only that, but
some of the more esoteric corners of those backends. We can barely get
_indexes_ right in our database layers ... now we think we're going to
get triggers right? Consistently? And handle things like Galera?
The other option is app level, which is what nova and neutron are doing.
It's a good option, because it puts the logic in python, which is a
thing we have 2500 developers fairly well versed in. It's also scalable,
as the things executing whatever the logic is are themselves a scale-out
set of servers. Finally, it's a known and accepted pattern in large
scale MySQL shops ... Roll out a new version of the app code which
understands both the old and the new schema version, then roll out a
no-downtime additive schema change to the database, then have the app
layer process and handle on the fly transformation if needed.

I've done both types of migrations in the past, but with one imporant
exception. We could roll out our application on Tuesday and then the
cleanup on Thursday. We didn't carry baggage for 6 months to a year. My
fear with keystone is that we'd slow development even more by adding
more cruft and cruft on top of cuft.

Post by Monty Taylor
SO ...
Just do what Nova and Neutron are doing - and if it's not good enough,
fix it. Having some projects use triggers and other projects not use
triggers is one of the more epically crazypants things I've heard around
here ... and I lived through the twisted/eventlet argument.

--
David Stanek
web: http://dstanek.com
blog: http://traceback.org

Mike Bayer

2016-09-01 14:39:09 UTC

Post by Monty Taylor
I said this the other day in the IRC channel, and I'm going to say it
again here. I'm going to do it as bluntly as I can - please keeping in
mind that I respect all of the humans involved.
I think this is a monstrously terrible idea.
There are MANY reasons for this -but I'm going to limit myself to two.
OpenStack is One Project
========================
Nova and Neutron have an approach for this. It may or may not be ideal -
but it exists right now. While it can be satisfying to discount the
existing approach and write a new one, I do not believe that is in the
best interests of OpenStack as a whole. To diverge in _keystone_ - which
is one of the few projects that must exist in every OpenStack install -
when there exists an approach in the two other most commonly deployed
projects - is such a terrible example of the problems inherent in
Conway's Law that it makes me want to push up a proposal to dissolve all
of the individual project teams and merge all of the repos into a single
repo.

So that is fine. However, correct me if I'm wrong but you're proposing
just that these projects migrate to also use a new service layer with
oslo.versionedobjects, because IIUC Nova/Neutron's approach is dependent
on that area of indirection being present. Otherwise, if you meant
something like, "use an approach that's kind of like what Nova does w/
versionedobjects but without actually having to use versionedobjects",
that still sounds like, "come up with a new idea".

I suppose if you're thinking more at the macro level, where "current
approach" means "do whatever you have to on the app side", then your
position is consistent, but I think there's still a lot of confusion in
that area when the indirection of a versioned service layer is not
present. It gets into the SQL nastiness I was discussing w/ Clint and
I don't see anyone doing anything like that yet.

Triggers aside since it clearly is "triggering" (ahem) allergic
reactions, what's the approach when new approaches are devised that are
alternatives to what "exists right now"? E.g. I have yet another
proposal in the works that allows for SQL-level translations but runs in
the Python application space and does not use triggers. Should I stop
right now because Nova/Neutron already have a system that's "good
enough"? This would be fine. I find it uncomfortable working in this
ambiguous space where some projects rightly proclaim they've solved a
problem, and others continue to disregard that and plow forward with
other approaches without a universally accepted reason why the current
solution is not feasible.

Post by Monty Taylor
BUT - I also don't think it's a good technical solution. That isn't
because triggers don't work in MySQL (they do) - but because we've spent
the last six years explicitly NOT writing raw SQL. We've chosen an
abstraction layer (SQLAlchemy) which does its job well.

There's a canard in there which is that all along I've been proposing to
start adding systems to oslo.db to help produce and maintain triggers
which certainly would have among its goals that consuming projects
wouldn't be writing raw SQL. That part of the discomfort is more
manageable than Clint's, which is that he doesn't want the database
doing things with the data other than storing it, and I totally know
where he's coming from on that.

The "be more similar" argument would be the only one you have to make.
It basically says, "problem X is 'solved', other approaches are now
unnecessary". I'm skeptical that I am reading that correctly. I have
another approach to the issue of "rolling upgrades where we really need
to translate at the SQL layer" that is in some ways similar to what
triggers do, but entirely within the abstraction layer that you so
appropriately appreciate :). I have a binary decision to make here,
"do i work on this new idea that Glance has already expressed an
interest in and Keystone might like also? Or do I not, because this
problem is solved?". I have other projects to work on, so it's not
like I'm looking for more. It's just I'd like to see Glance and others
have their rolling upgrades problem solved, at least with the benefit of
a fixed and predictable pattern, rather than every schema change being
an ongoing seat-of-the-pants type of operation as it is right now.

Finally, it's a known and accepted pattern in large

Post by Monty Taylor
scale MySQL shops ... Roll out a new version of the app code which
understands both the old and the new schema version, then roll out a
no-downtime additive schema change to the database, then have the app
layer process and handle on the fly transformation if needed.

Right, as I've mentioned previously, I only take issue with the
"monolithic app code that speaks both versions of the schema" part.
Assuming there's no layer of service indirection where migration issues
can be finessed outside of the SQL interaction layer, it means every
migration has to be accompanied by a game plan for all the places the
SQL interaction logic must be changed and complicated, in totally ad-hoc
ways that need to be devised and tested from scratch each time. "Large
scale MySQL shops" at least have the benefit that they themselves are
running "staging" and "production" in-house and can just push out live
patches all day long to stumble around this kind of thing.

Dan Smith

2016-09-01 15:52:01 UTC

Post by Mike Bayer
So that is fine. However, correct me if I'm wrong but you're
proposing just that these projects migrate to also use a new service
layer with oslo.versionedobjects, because IIUC Nova/Neutron's
approach is dependent on that area of indirection being present.
Otherwise, if you meant something like, "use an approach that's kind
of like what Nova does w/ versionedobjects but without actually
having to use versionedobjects", that still sounds like, "come up
with a new idea".

If you don't need the RPC bits, versionedobjects is nothing more than an
object facade for you to insulate your upper layers from such change.
Writing your facade using versionedobjects just means inheriting from a
superclass that does a bunch of stuff you don't need. So I would not say
that taking the same general approach without that inheritance is "come
up with a new idea".

Using triggers and magic to solve this instead of an application-level
facade is a substantially different approach to the problem.

Post by Mike Bayer
I suppose if you're thinking more at the macro level, where "current
approach" means "do whatever you have to on the app side", then your
position is consistent, but I think there's still a lot of
confusion in that area when the indirection of a versioned service
layer is not present. It gets into the SQL nastiness I was discussing
w/ Clint and I don't see anyone doing anything like that yet.

The indirection service is really unrelated to this discussion, IMHO. If
you take RPC out of the picture, all you have left is a
direct-to-the-database facade to handle the fact that schema has
expanded underneath you. As Clint (et al) have said -- designing the
application to expect schema expansion (and avoiding unnecessary
contraction) is the key here.

--Dan

Mike Bayer

2016-09-01 17:50:14 UTC

Post by Dan Smith
The indirection service is really unrelated to this discussion, IMHO. If
you take RPC out of the picture, all you have left is a
direct-to-the-database facade to handle the fact that schema has
expanded underneath you. As Clint (et al) have said -- designing the
application to expect schema expansion (and avoiding unnecessary
contraction) is the key here.

pretty much. there's no fixed pattern in how to do these. Every
version of a data access API will be weighed down with baggage from the
previous version and an inability to take full advantage of new
improvements until the next release, and background migrations are
complicated by the old application undoing their work. Even small
migrations mean all these issues have to be considered each time on a
case-by-case basis. These are the problems people are hoping to
improve upon if possible. The spec at
https://review.openstack.org/#/c/331740/ is discussing these issues in
detail and is the first such specification I've seen that tries to get
into it at this level.

Jeremy Stanley

2016-09-01 18:01:13 UTC

[...]

Post by Mike Bayer

Post by Monty Taylor
OpenStack is One Project
========================
Nova and Neutron have an approach for this. It may or may not be
ideal - but it exists right now. While it can be satisfying to
discount the existing approach and write a new one, I do not
believe that is in the best interests of OpenStack as a whole.
To diverge in _keystone_ - which is one of the few projects that
must exist in every OpenStack install - when there exists an
approach in the two other most commonly deployed projects - is
such a terrible example of the problems inherent in Conway's Law
that it makes me want to push up a proposal to dissolve all of
the individual project teams and merge all of the repos into a
single repo.

[...]

Post by Mike Bayer
The "be more similar" argument would be the only one you have to
make. It basically says, "problem X is 'solved', other approaches
are now unnecessary". I'm skeptical that I am reading that
correctly. I have another approach to the issue of "rolling
upgrades where we really need to translate at the SQL layer" that
is in some ways similar to what triggers do, but entirely within
the abstraction layer that you so appropriately appreciate :). I
have a binary decision to make here, "do i work on this new idea
that Glance has already expressed an interest in and Keystone
might like also? Or do I not, because this problem is solved?". I
have other projects to work on, so it's not like I'm looking for
more. It's just I'd like to see Glance and others have their
rolling upgrades problem solved, at least with the benefit of a
fixed and predictable pattern, rather than every schema change
being an ongoing seat-of-the-pants type of operation as it is
right now.

[...]

You (presumably accidentally) snipped the next paragraph of context,

Post by Mike Bayer

Post by Monty Taylor
Make the oslo libraries Nova and Neutron are using better. Work
with the Nova and Neutron teams on a consolidated approach.

[...]

I don't read that at all as suggesting "the problem is solved, go
away" but rather "help us make it better for everyone, don't just
take one project off in a new direction and leave the others
behind."

--
Jeremy Stanley

Michael Bayer

2016-09-01 18:40:37 UTC

Post by Jeremy Stanley
I don't read that at all as suggesting "the problem is solved, go
away" but rather "help us make it better for everyone, don't just
take one project off in a new direction and leave the others
behind."

I can clarify. I don't work directly on glance or keystone, I do oslo.db,
sqlalchemy, and alembic development. If it's decided that the approach is
"no special technique, just query more columns and tables in your data
access layer and straddle across API versions", that does not indicate any
new patterns or tools in Oslo or further up, hence "solved" in that the
techniques are already available. If OTOH we are getting into triggers or
this idea I have to do Python level translation events at the write side,
that indicates the need for new library features and patterns.

I've been tasked with being ready to assist Nova and Neutron with online
migrations for over a year. Other than helping Neutron get
expand/contract going, I've not been involved at all, and not with anything
related to data migrations. There hasn't been any need.

Post by Jeremy Stanley
--
Jeremy Stanley
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Mike Bayer

2016-08-26 15:50:24 UTC

Post by Steve Martinelli
The keystone team is pursuing a trigger-based approach to support
rolling, zero-downtime upgrades. The proposed operator experience is
http://docs.openstack.org/developer/keystone/upgrading.html
This differs from Nova and Neutron's approaches to solve for rolling
upgrades (which use oslo.versionedobjects), however Keystone is one of
the few services that doesn't need to manage communication between
multiple releases of multiple service components talking over the
message bus (which is the original use case for oslo.versionedobjects,
and for which it is aptly suited). Keystone simply scales horizontally
and every node talks directly to the database.

Hi Steve -

I'm a strong proponent of looking into the use of triggers to smooth
upgrades between database versions. Even in the case of projects
using versioned objects, it still means a SQL layer has to include
functionality for both versions of a particular schema change which
itself is awkward. I'm also still a little worried that not every case
of this can be handled by orchestration at the API level, and not as a
single SQL layer method that integrates both versions of a schema change.

Using triggers would resolve the issue of SQL-specific application code
needing to refer to two versions of a schema at once, at least for those
areas where triggers and SPs can handle it. In the "ideal", it means
all the Python code can just refer to one version of a schema, and nuts
and bolts embedded into database migrations would handle all the
movement between schema versions, including the phase between expand and
contract. Not that I think the "ideal" is ever going to be realized
100%, but maybe in some / many places, this can work.

So if Keystone wants to be involved in paving the way for working with
triggers, IMO this would benefit other projects in that they could
leverage this kind of functionality in those places where it makes sense.

The problem of "zero downtime database migrations" is an incredibly
ambitious goal and I think it would be wrong to exclude any one
particular technique in pursuing this. A real-world success story would
likely integrate many different techniques as they apply to specific
scenarios, and triggers and SPs IMO are a really major one which I
believe can be supported.

Post by Steve Martinelli
Database triggers are obviously a new challenge for developers to write,
honestly challenging to debug (being side effects), and are made even
more difficult by having to hand write triggers for MySQL, PostgreSQL,
and SQLite independently (SQLAlchemy offers no assistance in this case),

So I would also note that we've been working on the availability of
triggers and stored functions elsewhere, a very raw patch that is to be
largely rolled into oslo.db is here:

https://review.openstack.org/#/c/314054/

This patch makes use of an Alembic pattern called "replaceable object",
which is intended specifically as a means of versioning things like
triggers and stored procedures:

http://alembic.zzzcomputing.com/en/latest/cookbook.html#replaceable-objects

Within the above Neutron patch, one thing I want to move towards is that
things like triggers and SPs would only need to be specified once, in
the migration layer, and not within the model. To achieve this, tests
that work against MySQL and Postgresql would need to ensure that the
test schema is built up using migrations, and not create_all. This is
already the case in some places and not in others. There is work
ongoing in oslo.db to provide a modernized fixture system that supports
enginefacade cleanly as well as allows for migrations to be used
efficiently (read: once per many tests) for all MySQL/Postgresql test
suites, athttps://review.openstack.org/#/c/351411/ .

As far as SQLite, I have a simple opinion with SQLite which is that
migrations, triggers, and SPs should not be anywhere near a SQLite
database. SQLite should be used strictly for simple model unit tests,
the schema is created using create_all(), and that's it. The test
fixture system accommodates this as well.

Post by Steve Martinelli
Our primary concern at this point are how to effectively test the
triggers we write against our supported database systems, and their
various deployment variations. We might be able to easily drop SQLite
support (as it's only supported for our own test suite), but should we
expect variation in support and/or actual behavior of triggers across
the MySQLs, MariaDBs, Perconas, etc, of the world that would make it
necessary to test each of them independently? If you have operational
experience working with triggers at scale: are there landmines that we
need to be aware of? What is it going to take for us to say we support
*zero* dowtime upgrades with confidence?

*zero* downtime is an extremely difficult goal. I appreciate that
people are generally nervous about making more use of relational
database features in order to help with this, however as long as the
goal includes an application that can communicate with a database that
is literally in flux as far as its schema, this is already an exotic
goal. Triggers and stored procedures are in fact very boring.

SQLAlchemy does support these features, it's just the specific trigger
and SP languages themselves are written as strings, not as elaborate
Python expressions. This should be seen as a good thing. Trigger and
SP languages are not like SQL in that they are not very declarative at
all, they are imperative. SQLAlchemy's abstraction of SQL into Python
expressions only works to the degree that the SQL itself is primarily
declarative. It would not be feasible to take on the task of producing
an imperative stored procedure / trigger language that compiles into
vendor-specific dialects.

For the use case of Openstack database migrations, I would hope that a
subset of the triggers and SPs that support live migrations would fall
into a fixed "vocabulary", each meeting various requirements, that are
of general use across projects and could be maintained in oslo.db
itself. For more elaborate cases, like "version A has the data in
three tables, version B has them in a single JSON blob", that probably
needs more specific logic.

I would encourage project teams to please loop me in on discussions
about triggers, SPs and migrations.

Post by Steve Martinelli
Steve & Dolph
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Clint Byrum

2016-08-30 13:57:57 UTC

Post by Mike Bayer

Post by Steve Martinelli
The keystone team is pursuing a trigger-based approach to support
rolling, zero-downtime upgrades. The proposed operator experience is
http://docs.openstack.org/developer/keystone/upgrading.html
This differs from Nova and Neutron's approaches to solve for rolling
upgrades (which use oslo.versionedobjects), however Keystone is one of
the few services that doesn't need to manage communication between
multiple releases of multiple service components talking over the
message bus (which is the original use case for oslo.versionedobjects,
and for which it is aptly suited). Keystone simply scales horizontally
and every node talks directly to the database.

Hi Steve -
I'm a strong proponent of looking into the use of triggers to smooth
upgrades between database versions. Even in the case of projects
using versioned objects, it still means a SQL layer has to include
functionality for both versions of a particular schema change which
itself is awkward. I'm also still a little worried that not every case
of this can be handled by orchestration at the API level, and not as a
single SQL layer method that integrates both versions of a schema change.

Speaking as an operator, I'd rather have awkwardness happen in safe, warm
development, rather than in the cold, dirty, broken world of operations.

Speaking as a former DBA: Triggers introduce emergent behaviors and
complicate scaling and reasonable debugging in somewhat hidden ways that
can frustrate even the most experienced DBA. We've discussed FK's before,
and how they are a 1:1 trade-off of integrity vs. performance, and thus
deserve more scrutiny than they're typically given. Well IMO, triggers are
a 1:10 trade off between development complexity, and debugging complexity.

Speaking as a developer: Every case can in fact be handled simply and
in code without the database's help if we're willing to accept a small
level of imperfection and redundancy.

Post by Mike Bayer
Using triggers would resolve the issue of SQL-specific application code
needing to refer to two versions of a schema at once, at least for those
areas where triggers and SPs can handle it. In the "ideal", it means
all the Python code can just refer to one version of a schema, and nuts
and bolts embedded into database migrations would handle all the
movement between schema versions, including the phase between expand and
contract. Not that I think the "ideal" is ever going to be realized
100%, but maybe in some / many places, this can work.

As someone else brought up, this is an unnecessarily bleak view of how database
migrations work.

It's simple, these are the holy SQL schema commandments:

Don't delete columns, ignore them.
Don't change columns, create new ones.
When you create a column, give it a default that makes sense.
Do not add new foreign key constraints.

Following these commandments, one can run schema changes at any time. A
new schema should be completely ignorable by older code, because their
columns keep working, and no new requirements are introduced. New code
can deal with defaulted new columns gracefully.

Of course, once one can be certain that all app code is updated, one can
drop old columns and tables, and add FK constraints (if you so desire,
I personally think they're a waste of precious DB resources, but that
is a much more religious debate and I accept that it's not part of
this debate).

Post by Mike Bayer
So if Keystone wants to be involved in paving the way for working with
triggers, IMO this would benefit other projects in that they could
leverage this kind of functionality in those places where it makes sense.
The problem of "zero downtime database migrations" is an incredibly
ambitious goal and I think it would be wrong to exclude any one
particular technique in pursuing this. A real-world success story would
likely integrate many different techniques as they apply to specific
scenarios, and triggers and SPs IMO are a really major one which I
believe can be supported.

I don't think it's all that ambitious to think we can just use tried and
tested schema evolution techniques that work for everyone else.

Post by Mike Bayer

Post by Steve Martinelli
Database triggers are obviously a new challenge for developers to write,
honestly challenging to debug (being side effects), and are made even
more difficult by having to hand write triggers for MySQL, PostgreSQL,
and SQLite independently (SQLAlchemy offers no assistance in this case),

So I would also note that we've been working on the availability of
triggers and stored functions elsewhere, a very raw patch that is to be
https://review.openstack.org/#/c/314054/
This patch makes use of an Alembic pattern called "replaceable object",
which is intended specifically as a means of versioning things like
http://alembic.zzzcomputing.com/en/latest/cookbook.html#replaceable-objects
Within the above Neutron patch, one thing I want to move towards is that
things like triggers and SPs would only need to be specified once, in
the migration layer, and not within the model. To achieve this, tests
that work against MySQL and Postgresql would need to ensure that the
test schema is built up using migrations, and not create_all. This is
already the case in some places and not in others. There is work
ongoing in oslo.db to provide a modernized fixture system that supports
enginefacade cleanly as well as allows for migrations to be used
efficiently (read: once per many tests) for all MySQL/Postgresql test
suites, athttps://review.openstack.org/#/c/351411/ .
As far as SQLite, I have a simple opinion with SQLite which is that
migrations, triggers, and SPs should not be anywhere near a SQLite
database. SQLite should be used strictly for simple model unit tests,
the schema is created using create_all(), and that's it. The test
fixture system accommodates this as well.

Agreed on this. Spin up a DB server in the functional tests if you want
to test any actual data manipulation.

Post by Mike Bayer

Post by Steve Martinelli
Our primary concern at this point are how to effectively test the
triggers we write against our supported database systems, and their
various deployment variations. We might be able to easily drop SQLite
support (as it's only supported for our own test suite), but should we
expect variation in support and/or actual behavior of triggers across
the MySQLs, MariaDBs, Perconas, etc, of the world that would make it
necessary to test each of them independently? If you have operational
experience working with triggers at scale: are there landmines that we
need to be aware of? What is it going to take for us to say we support
*zero* dowtime upgrades with confidence?

*zero* downtime is an extremely difficult goal. I appreciate that
people are generally nervous about making more use of relational
database features in order to help with this, however as long as the
goal includes an application that can communicate with a database that
is literally in flux as far as its schema, this is already an exotic
goal. Triggers and stored procedures are in fact very boring.

Even more boring is just testing upgraded schemas with old versions of
code.

Dan Smith

2016-08-30 15:56:29 UTC

Post by Clint Byrum

Even in the case of projects using versioned objects, it still
means a SQL layer has to include functionality for both versions of
a particular schema change which itself is awkward.

That's not true. Nova doesn't have multiple models to straddle a
particular change. We just...

Post by Clint Byrum
Don't delete columns, ignore them.
Don't change columns, create new ones.
When you create a column, give it a default that makes sense.
Do not add new foreign key constraints.

...do this ^ :)

We can drop columns once they're long-since-unused, but we still don't
need duplicate models for that.

--Dan

Mike Bayer

2016-08-30 18:56:15 UTC

Post by Clint Byrum
As someone else brought up, this is an unnecessarily bleak view of how database
migrations work.

We aren't talking about database migrations. We are talking about
*online* database migrations, where we would like both the *old* and
*new* versions of the code, talking to the database at the same time.

If I write code that does this:

SELECT foo, bar FROM table

then I do a migration that replaces "bar" with some new table, the new
SQL is:

SELECT table.foo, othertable.bar FROM table JOIN othertable ON
table.id == othertable.foo_id

Those two SQL statements are incompatible. The "new" version of the
code must expect and maintain the old "bar" column for the benefit of
the "old" version of the code still reading and writing to it. To me,
this seems to contradict your suggestion "don't delete columns, ignore
them". We can't ignore "bar" above.

Post by Clint Byrum
Following these commandments, one can run schema changes at any time. A
new schema should be completely ignorable by older code, because their
columns keep working, and no new requirements are introduced. New code
can deal with defaulted new columns gracefully.

You need to specify how new code deals with the above two totally
different SQL statements "gracefully", except that it has to accommodate
for both versions of the schema at the same time. This may be
"graceful" in operator land but in developer land, there is no easy

Post by Clint Byrum
I don't think it's all that ambitious to think we can just use tried and
tested schema evolution techniques that work for everyone else.

People have been asking me for over a year how to do this, and I have no
easy answer, I'm glad that you do. I would like to see some examples of
these techniques.

If you can show me the SQL access code that deals with the above change,
that would help a lot.

If the answer is, "oh well just don't do a schema change like that",
then we're basically saying we aren't really changing our schemas
anymore except for totally new features that otherwise aren't accessed
by the older version of the code. That's fine. It's not what people
coming to me are saying, though.

Dan Smith

2016-08-30 19:48:03 UTC

Post by Mike Bayer

Post by Clint Byrum
I don't think it's all that ambitious to think we can just use
tried and tested schema evolution techniques that work for everyone
else.

People have been asking me for over a year how to do this, and I have
no easy answer, I'm glad that you do. I would like to see some
examples of these techniques.

I'm not sure how to point you at the examples we have today because
they're not on a single line (or set of lines) in a single file. Nova
has moved a lot of data around at runtime using this approach in the
last year or so with good success.

Post by Mike Bayer
If you can show me the SQL access code that deals with the above
change, that would help a lot.

We can't show you that, because as you said, there isn't a way to do
it...in SQL. That is in fact the point though: don't do it in SQL.

Post by Mike Bayer
If the answer is, "oh well just don't do a schema change like that",
then we're basically saying we aren't really changing our schemas
anymore except for totally new features that otherwise aren't
accessed by the older version of the code.

We _are_ saying "don't change schema like that", but it's not a very
limiting requirement. It means you can't move things in a schema
migration, but that's all. Nova changes schema all the time.

In the last year or so, off the top of my head, nova has:

1. Moved instance flavors from row=value metadata storage to a JSON
blob in another table
2. Moved core flavors, aggregates, keypairs and other structures from
the cell database to the api database
3. Added uuid to aggregates
4. Added a parent_addr linkage in PCI device

...all online. Those are just the ones I have in my head that have
required actual data migrations. We've had dozens of schema changes that
enable new features that are all just new data and don't require any of
this.

Post by Mike Bayer
That's fine. It's not what people coming to me are saying, though.

Not sure who is coming to you or what they're saying, but.. okay :)

If keystone really wants to use triggers to do this, then that's fine.
But I think the overwhelming response from this thread (which is asking
people's opinions on the matter) seems to be that they're an unnecessary
complication that will impede people debugging and working on that part
of the code base. We have such impediments elsewhere, but I think we
generally try to avoid doing one thing a hundred different ways to keep
the playing field as level as possible.

--Dan

Clint Byrum

2016-08-30 20:43:35 UTC

Post by Mike Bayer

Post by Clint Byrum
As someone else brought up, this is an unnecessarily bleak view of how database
migrations work.

We aren't talking about database migrations. We are talking about
*online* database migrations, where we would like both the *old* and
*new* versions of the code, talking to the database at the same time.
SELECT foo, bar FROM table
then I do a migration that replaces "bar" with some new table, the new
SELECT table.foo, othertable.bar FROM table JOIN othertable ON
table.id == othertable.foo_id
Those two SQL statements are incompatible. The "new" version of the
code must expect and maintain the old "bar" column for the benefit of
the "old" version of the code still reading and writing to it. To me,
this seems to contradict your suggestion "don't delete columns, ignore
them". We can't ignore "bar" above.

It's hard to think about what you're saying without concrete examples,
but I'll try.

As I said, don't remove columns, ignore them. Of course, you can't ignore
them on writes, they still exist. If you have a new relationship for that
data, then yes, you still have to write to the old columns and tables so
that older versions of the code can find the data it needs.

Your join needs to be a left join, so that you get the data from the old
table when it was written by old code.

Post by Mike Bayer

Post by Clint Byrum
Following these commandments, one can run schema changes at any time. A
new schema should be completely ignorable by older code, because their
columns keep working, and no new requirements are introduced. New code
can deal with defaulted new columns gracefully.

You need to specify how new code deals with the above two totally
different SQL statements "gracefully", except that it has to accommodate
for both versions of the schema at the same time. This may be
"graceful" in operator land but in developer land, there is no easy

Correct, it is harder for development. Since the database server has all
of the potential for the worst problems, being a stateful service, then
I believe moving complexity _out_ of it, is generally an operational
win, at the expense of some development effort. The development effort,
however, is mostly on the front of the pipeline where timelines can be
longer. Operations typically is operating under SLA's and with
requirements to move slowly in defense of peoples' data and performance
of the system. So I suggest that paying costs in dev, vs. at the
database is usually the highest value choice.

This is of course not the case if timelines are short for development as
well, but I can't really answer the question in that case. For OpenStack,
we nearly always find ourselves with more time to develop, than operators
do to operate.

Post by Mike Bayer

Post by Clint Byrum
I don't think it's all that ambitious to think we can just use tried and
tested schema evolution techniques that work for everyone else.

People have been asking me for over a year how to do this, and I have no
easy answer, I'm glad that you do. I would like to see some examples of
these techniques.
If you can show me the SQL access code that deals with the above change,
that would help a lot.

So schema changes fall into several categories. But basically, the only
one that is hard, is a relationship change. Basically, a new PK. Here's
an example:

Book.isbn was the PK, but we want to have a record per edition, so the
new primary key is (isbn, edition).

Solution: Maintain two tables. You have created an entirely new object!

CREATE TABLE book (
isbn varchar(30) not null primary key,
description text,
)

CREATE TABLE book_editions (
isbn varchar(30) not null,
edition int not null,
description text,
primary key (isbn, edition),
)

And now on read, your new code has to do this:

SELECT b.isbn,
COALESCE(be.edition, 0) AS edition,
COALESCE(be.description, b.description) AS description
FROM book b
LEFT OUTER JOIN book_editions be
ON b.isbn = be.isbn
WHERE b.isbn = 'fooisbn'

And now, if a book has only ever been written by old code, you get one
record with a 0 edition. And if it were written by the new system, the
new system would need to go ahead and duplicate the book description into
the old table for as long as we have code that might expect it.

Most other things are simpler and have quite obvious solutions.

Post by Mike Bayer
If the answer is, "oh well just don't do a schema change like that",
then we're basically saying we aren't really changing our schemas
anymore except for totally new features that otherwise aren't accessed
by the older version of the code. That's fine. It's not what people
coming to me are saying, though.

I mean, yes and no. We should pay some respect to operators who have to
deal with our desire for the schema to be "Right". If it scales well,
maintains integrity, and is 98% clear and well formed, then that 2%
where we store the "project id" redundantly in the "tenant_id" column
for a few releases, that isn't really a bother to me.

Mike Bayer

2016-08-30 22:15:14 UTC

Post by Clint Byrum
Correct, it is harder for development. Since the database server has all
of the potential for the worst problems, being a stateful service, then
I believe moving complexity _out_ of it, is generally an operational
win, at the expense of some development effort. The development effort,
however, is mostly on the front of the pipeline where timelines can be
longer. Operations typically is operating under SLA's and with
requirements to move slowly in defense of peoples' data and performance
of the system. So I suggest that paying costs in dev, vs. at the
database is usually the highest value choice.
This is of course not the case if timelines are short for development as
well, but I can't really answer the question in that case. For OpenStack,
we nearly always find ourselves with more time to develop, than operators
do to operate.

So the idea of triggers is hey, for easy things like column X is now
column Y elsewhere, instead of complicating the code, use a trigger to
maintain that value. Your argument against triggers is: "Triggers
introduce emergent behaviors and complicate scaling and reasonable
debugging in somewhat hidden ways that
can frustrate even the most experienced DBA."

I'd wager that triggers probably work a little more smoothly in modern
MySQL/Postgresql than a more classical "DBA" platform like a crusty old
MS SQL Server or Oracle, but more examples on these emergent behaviors
would be useful, as well as evidence that they apply to current versions
of database software that are in use within Openstack, and are
disruptive enough that even the most clear-cut case for triggers vs.
in-application complexity should favor in-app complexity without question.

Post by Clint Byrum

Post by Mike Bayer

Post by Clint Byrum
I don't think it's all that ambitious to think we can just use tried and
tested schema evolution techniques that work for everyone else.

People have been asking me for over a year how to do this, and I have no
easy answer, I'm glad that you do. I would like to see some examples of
these techniques.
If you can show me the SQL access code that deals with the above change,
that would help a lot.

So schema changes fall into several categories. But basically, the only
one that is hard, is a relationship change. Basically, a new PK. Here's
Book.isbn was the PK, but we want to have a record per edition, so the
new primary key is (isbn, edition).
Solution: Maintain two tables. You have created an entirely new object!
CREATE TABLE book (
isbn varchar(30) not null primary key,
description text,
)
CREATE TABLE book_editions (
isbn varchar(30) not null,
edition int not null,
description text,
primary key (isbn, edition),
)
SELECT b.isbn,
COALESCE(be.edition, 0) AS edition,
COALESCE(be.description, b.description) AS description
FROM book b
LEFT OUTER JOIN book_editions be
ON b.isbn = be.isbn
WHERE b.isbn = 'fooisbn'
And now, if a book has only ever been written by old code, you get one
record with a 0 edition. And if it were written by the new system, the
new system would need to go ahead and duplicate the book description into
the old table for as long as we have code that might expect it.

So some pain points here are:

1. you really can't ever trust what's in book_editions.description as
long as any "old" application is running, since it can put new data into
book.description at any time. You shouldn't bother reading from it at
all, just write to it. You won't be able to use it until the next
version of the application, e.g. "new" + 1. Or if you support some kind
of "old app is gone! " flag that modifies the behavior of "new" app to
modify all its queries, which is even more awkward.

2. deletes by "old" app of entries in "book" have to be synchronized
offline by a background script of some kind. You at least need to run a
final, authoritative "clean up all the old book deletions" job before
you go into "old app is gone" mode and the new app begins reading from
book_editions alone.

3. LEFT OUTER JOINs can be a major performance hit. You can't turn it
off here until you go to version "new + 1" (bad performance locked in
for a whole release cycle) or your app has a "turn off old app mode"
flag (basically you have to write two different database access layers).

Contrast to the trigger approach, which removes all the SELECT pain and
moves it all to writes:

1. new application has no code whatsoever referring to old application

2. no performance hit on SELECT

3. no "wait til version "new+1"" and/or "old app is gone" switch

If we have evidence that triggers are always, definitely, universally
going to make even this extremely simple use case non-feasible, great,
let's measure and test for that. But in a case like this they look
very attractive and I'd hate to just dispense with them unilaterally
without a case-by-case examination.

As I wrote this, I did begin to come up with yet another approach. I'd
be pessimistic about acceptance here because instead of using
scary-and-mistrusted triggers, it would use
even-scarier-and-more-mistrusted SQLAlchemy. That is, write a
triggering system as a SQLAlchemy library that embeds into either "old"
or "new" application as a plugin. Pull it in via the SQLAlchemy URL in
the .conf file, and it would apply events to all tables that do the same
things that server-side triggers would do, except you can write the
usual platform-agnostic SQLAlchemy Core / ORM code to do it. The code
within this layer would either be created custom for each app migration,
or perhaps some of it could be somewhat abstracted into a series of
common "triggers" like "column move", "column rename", etc.

Since the real trick is, how to we get data written from the "old" app
to be compatible with "new". Triggers are one way to put something "in
the middle", but an in-process Python plugin could be another. The
"new" app would use the same technique and the plugin could be removed
once the application is fully migrated.

This does meet your criteria of the transition happening in "safe, warm
development". It meets mine of, "the new application can look
forwards and SELECT at will without worrying about backwards
compatibility". I'm not too optimistic that I could convince anyone of
this approach though. As Dan notes, the kinds of migrations Nova is
doing don't have these problems at all. For others, like the one
Keystone is working on, it apparently is more about getting a DEFAULT to
work on all platforms (MySQL is being picky) and definitely a DEFAULT is
better than any other approach if it is all that's needed.

Going to put this one in my back pocket though, stay tuned...

Post by Clint Byrum
Most other things are simpler and have quite obvious solutions.

Post by Mike Bayer
If the answer is, "oh well just don't do a schema change like that",
then we're basically saying we aren't really changing our schemas
anymore except for totally new features that otherwise aren't accessed
by the older version of the code. That's fine. It's not what people
coming to me are saying, though.

I mean, yes and no. We should pay some respect to operators who have to
deal with our desire for the schema to be "Right".If it scales well,
maintains integrity, and is 98% clear and well formed, then that 2%
where we store the "project id" redundantly in the "tenant_id" column
for a few releases, that isn't really a bother to me.

I've no problem with downvoting changes that incur painful schema
changes for no clear operational benefit and I don't think anyone else
has a problem with that either. IMO an application that wants to do
online upgrades has to be mostly done with their "getting the schema
right" problems.

Post by Clint Byrum
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Lance Bragstad

2016-08-30 23:44:25 UTC

Since the encrypted credential work is currently based on triggers, I spent
most of today documenting a walk-though migration from Mitaka to Newton
[0]. Regardless of the outcome discussed here - figured it would be worth
sharing since it's relevant to the thread. Most of the gist contains stuff
not directly related to the upgrade from Mitaka to Newton, like config
files and install processes. I included them anyway since I started with a
green field deployment on Ubuntu 16.04.

Everything is technically still up for review so if you notice anything
fishy about the implementation via the walkthrough feel free to leave a
comment.

[0]
https://gist.github.com/lbragstad/ddfb10f9f9048414d1f781ba006e95d1#file-migration-md

Post by Mike Bayer

Post by Clint Byrum
Correct, it is harder for development. Since the database server has all
of the potential for the worst problems, being a stateful service, then
I believe moving complexity _out_ of it, is generally an operational
win, at the expense of some development effort. The development effort,
however, is mostly on the front of the pipeline where timelines can be
longer. Operations typically is operating under SLA's and with
requirements to move slowly in defense of peoples' data and performance
of the system. So I suggest that paying costs in dev, vs. at the
database is usually the highest value choice.
This is of course not the case if timelines are short for development as
well, but I can't really answer the question in that case. For OpenStack,
we nearly always find ourselves with more time to develop, than operators
do to operate.

So the idea of triggers is hey, for easy things like column X is now
column Y elsewhere, instead of complicating the code, use a trigger to
maintain that value. Your argument against triggers is: "Triggers
introduce emergent behaviors and complicate scaling and reasonable
debugging in somewhat hidden ways that
can frustrate even the most experienced DBA."
I'd wager that triggers probably work a little more smoothly in modern
MySQL/Postgresql than a more classical "DBA" platform like a crusty old MS
SQL Server or Oracle, but more examples on these emergent behaviors would
be useful, as well as evidence that they apply to current versions of
database software that are in use within Openstack, and are disruptive
enough that even the most clear-cut case for triggers vs. in-application
complexity should favor in-app complexity without question.

Post by Clint Byrum
I don't think it's all that ambitious to think we can just use tried and

Post by Mike Bayer

Post by Clint Byrum
tested schema evolution techniques that work for everyone else.

People have been asking me for over a year how to do this, and I have no
easy answer, I'm glad that you do. I would like to see some examples of
these techniques.
If you can show me the SQL access code that deals with the above change,
that would help a lot.

So schema changes fall into several categories. But basically, the only
one that is hard, is a relationship change. Basically, a new PK. Here's
Book.isbn was the PK, but we want to have a record per edition, so the
new primary key is (isbn, edition).
Solution: Maintain two tables. You have created an entirely new object!
CREATE TABLE book (
isbn varchar(30) not null primary key,
description text,
)
CREATE TABLE book_editions (
isbn varchar(30) not null,
edition int not null,
description text,
primary key (isbn, edition),
)
SELECT b.isbn,
COALESCE(be.edition, 0) AS edition,
COALESCE(be.description, b.description) AS description
FROM book b
LEFT OUTER JOIN book_editions be
ON b.isbn = be.isbn
WHERE b.isbn = 'fooisbn'
And now, if a book has only ever been written by old code, you get one
record with a 0 edition. And if it were written by the new system, the
new system would need to go ahead and duplicate the book description into
the old table for as long as we have code that might expect it.

1. you really can't ever trust what's in book_editions.description as long
as any "old" application is running, since it can put new data into
book.description at any time. You shouldn't bother reading from it at all,
just write to it. You won't be able to use it until the next version of the
application, e.g. "new" + 1. Or if you support some kind of "old app is
gone! " flag that modifies the behavior of "new" app to modify all its
queries, which is even more awkward.
2. deletes by "old" app of entries in "book" have to be synchronized
offline by a background script of some kind. You at least need to run a
final, authoritative "clean up all the old book deletions" job before you
go into "old app is gone" mode and the new app begins reading from
book_editions alone.
3. LEFT OUTER JOINs can be a major performance hit. You can't turn it
off here until you go to version "new + 1" (bad performance locked in for a
whole release cycle) or your app has a "turn off old app mode" flag
(basically you have to write two different database access layers).
Contrast to the trigger approach, which removes all the SELECT pain and
1. new application has no code whatsoever referring to old application
2. no performance hit on SELECT
3. no "wait til version "new+1"" and/or "old app is gone" switch
If we have evidence that triggers are always, definitely, universally
going to make even this extremely simple use case non-feasible, great,
let's measure and test for that. But in a case like this they look very
attractive and I'd hate to just dispense with them unilaterally without a
case-by-case examination.
As I wrote this, I did begin to come up with yet another approach. I'd be
pessimistic about acceptance here because instead of using
scary-and-mistrusted triggers, it would use even-scarier-and-more-mistrusted
SQLAlchemy. That is, write a triggering system as a SQLAlchemy library
that embeds into either "old" or "new" application as a plugin. Pull it
in via the SQLAlchemy URL in the .conf file, and it would apply events to
all tables that do the same things that server-side triggers would do,
except you can write the usual platform-agnostic SQLAlchemy Core / ORM code
to do it. The code within this layer would either be created custom for
each app migration, or perhaps some of it could be somewhat abstracted into
a series of common "triggers" like "column move", "column rename", etc.
Since the real trick is, how to we get data written from the "old" app to
be compatible with "new". Triggers are one way to put something "in the
middle", but an in-process Python plugin could be another. The "new" app
would use the same technique and the plugin could be removed once the
application is fully migrated.
This does meet your criteria of the transition happening in "safe, warm
development". It meets mine of, "the new application can look forwards
and SELECT at will without worrying about backwards compatibility". I'm
not too optimistic that I could convince anyone of this approach though.
As Dan notes, the kinds of migrations Nova is doing don't have these
problems at all. For others, like the one Keystone is working on, it
apparently is more about getting a DEFAULT to work on all platforms (MySQL
is being picky) and definitely a DEFAULT is better than any other approach
if it is all that's needed.
Going to put this one in my back pocket though, stay tuned...

Post by Clint Byrum
Most other things are simpler and have quite obvious solutions.
If the answer is, "oh well just don't do a schema change like that",

Post by Mike Bayer
then we're basically saying we aren't really changing our schemas
anymore except for totally new features that otherwise aren't accessed
by the older version of the code. That's fine. It's not what people
coming to me are saying, though.

I mean, yes and no. We should pay some respect to operators who have to
deal with our desire for the schema to be "Right".If it scales well,
maintains integrity, and is 98% clear and well formed, then that 2%
where we store the "project id" redundantly in the "tenant_id" column
for a few releases, that isn't really a bother to me.

I've no problem with downvoting changes that incur painful schema changes
for no clear operational benefit and I don't think anyone else has a
problem with that either. IMO an application that wants to do online
upgrades has to be mostly done with their "getting the schema right"
problems.

Post by Clint Byrum
____________________________________________________________
______________
OpenStack Development Mailing List (not for usage questions)
e
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Clint Byrum

2016-08-31 00:04:32 UTC

Post by Mike Bayer

Post by Clint Byrum
Correct, it is harder for development. Since the database server has all
of the potential for the worst problems, being a stateful service, then
I believe moving complexity _out_ of it, is generally an operational
win, at the expense of some development effort. The development effort,
however, is mostly on the front of the pipeline where timelines can be
longer. Operations typically is operating under SLA's and with
requirements to move slowly in defense of peoples' data and performance
of the system. So I suggest that paying costs in dev, vs. at the
database is usually the highest value choice.
This is of course not the case if timelines are short for development as
well, but I can't really answer the question in that case. For OpenStack,
we nearly always find ourselves with more time to develop, than operators
do to operate.

So the idea of triggers is hey, for easy things like column X is now
column Y elsewhere, instead of complicating the code, use a trigger to
maintain that value. Your argument against triggers is: "Triggers
introduce emergent behaviors and complicate scaling and reasonable
debugging in somewhat hidden ways that
can frustrate even the most experienced DBA."
I'd wager that triggers probably work a little more smoothly in modern
MySQL/Postgresql than a more classical "DBA" platform like a crusty old
MS SQL Server or Oracle, but more examples on these emergent behaviors
would be useful, as well as evidence that they apply to current versions
of database software that are in use within Openstack, and are
disruptive enough that even the most clear-cut case for triggers vs.
in-application complexity should favor in-app complexity without question.

My direct experience with this was MySQL 5.0 and 5.1. They worked as
documented, and no I don't think they've changed much since then.

When they were actually installed into the schema and up to date with
the code that expected them, and the debugging individual was aware of them, things were fine.

However, every other imperative part of the code was asserted with git,
package managers, ansible, puppet, pick your choice of thing that puts
file on disk and restarts daemons. These things all have obvious entry
points too. X is where wsgi starts running code. Y is where flask hands
off to the app, etc. But triggers are special and go in the database at
whatever time they go in. This means you lose all the benefit of all of
the tools you're used to using to debug and operate on imperative code.

Post by Mike Bayer

Post by Clint Byrum

Post by Mike Bayer

Post by Clint Byrum
I don't think it's all that ambitious to think we can just use tried and
tested schema evolution techniques that work for everyone else.

People have been asking me for over a year how to do this, and I have no
easy answer, I'm glad that you do. I would like to see some examples of
these techniques.
If you can show me the SQL access code that deals with the above change,
that would help a lot.

So schema changes fall into several categories. But basically, the only
one that is hard, is a relationship change. Basically, a new PK. Here's
Book.isbn was the PK, but we want to have a record per edition, so the
new primary key is (isbn, edition).
Solution: Maintain two tables. You have created an entirely new object!
CREATE TABLE book (
isbn varchar(30) not null primary key,
description text,
)
CREATE TABLE book_editions (
isbn varchar(30) not null,
edition int not null,
description text,
primary key (isbn, edition),
)
SELECT b.isbn,
COALESCE(be.edition, 0) AS edition,
COALESCE(be.description, b.description) AS description
FROM book b
LEFT OUTER JOIN book_editions be
ON b.isbn = be.isbn
WHERE b.isbn = 'fooisbn'
And now, if a book has only ever been written by old code, you get one
record with a 0 edition. And if it were written by the new system, the
new system would need to go ahead and duplicate the book description into
the old table for as long as we have code that might expect it.

1. you really can't ever trust what's in book_editions.description as
long as any "old" application is running, since it can put new data into
book.description at any time. You shouldn't bother reading from it at
all, just write to it. You won't be able to use it until the next
version of the application, e.g. "new" + 1. Or if you support some kind
of "old app is gone! " flag that modifies the behavior of "new" app to
modify all its queries, which is even more awkward.

Of course, you can have books that get their edition 0 updated in book
while you're upgrading. But the editions feature code always treats
that old update as an update to edition 0. It's still the same object
it always was, your app just makes some assumptions about it. You can
use a union in some cases where you need to see them all for instance,
and just select a literal '0' for the edition column of your union.

And one can say "old app is gone" when one knows it's gone. At that point,
one can run a migration that inserts 0 editions into book_edition, and
drops the book table. For OpenStack, we can say "all releases that used
that old schema are EOL, so we can simplify the code now". Our 6 month
pace and short EOL windows are built for this kind of thing.

Post by Mike Bayer
2. deletes by "old" app of entries in "book" have to be synchronized
offline by a background script of some kind. You at least need to run a
final, authoritative "clean up all the old book deletions" job before
you go into "old app is gone" mode and the new app begins reading from
book_editions alone.

You'll notice the query was a left join from book -> book_edition.
Deletes will work just fine. The migration mentioned above would need to
clean up any orphaned book editions that lack a 0 edition.

Post by Mike Bayer
3. LEFT OUTER JOINs can be a major performance hit. You can't turn it
off here until you go to version "new + 1" (bad performance locked in
for a whole release cycle) or your app has a "turn off old app mode"
flag (basically you have to write two different database access layers).

I'm not sure I agree that they're that big of a performance hit. When
they're returning lots and lots of nulled out rows and gumming up the
optimizer, yes. But this is a specific use case, where one is _always_
going to be doing a many to one lookup, and so, it results in a single
key lookup on a table's PK, which is the fastest thing we can do in a
database. This specific case is a generic solution to the problem though,
and works if you replace books with users and editions with domains,
for instance.

Post by Mike Bayer
Contrast to the trigger approach, which removes all the SELECT pain and
1. new application has no code whatsoever referring to old application

Except it does, hidden in the schema as triggers.

Post by Mike Bayer
2. no performance hit on SELECT

I'll take extra single key lookups that I can read the code for over
hidden triggering any day.

Post by Mike Bayer
3. no "wait til version "new+1"" and/or "old app is gone" switch

One will need to drop the triggers at some point.

Post by Mike Bayer
If we have evidence that triggers are always, definitely, universally
going to make even this extremely simple use case non-feasible, great,
let's measure and test for that. But in a case like this they look
very attractive and I'd hate to just dispense with them unilaterally
without a case-by-case examination.

It's pretty hard to measure and test complexity.

But we can implement a system with one way or another, and then measure
how often it breaks, how long it takes to fix it, and how skilled the
workers must be to address issues. Then after that, we should take the
approach that we found has the higher break/fix efficiency, and use
that. I'm suggesting that my experience has found triggers to have an
undesirable level of break/fix complexity compared to code that deals
with legacy schemas.

Post by Mike Bayer
As I wrote this, I did begin to come up with yet another approach. I'd
be pessimistic about acceptance here because instead of using
scary-and-mistrusted triggers, it would use
even-scarier-and-more-mistrusted SQLAlchemy. That is, write a
triggering system as a SQLAlchemy library that embeds into either "old"
or "new" application as a plugin. Pull it in via the SQLAlchemy URL in
the .conf file, and it would apply events to all tables that do the same
things that server-side triggers would do, except you can write the
usual platform-agnostic SQLAlchemy Core / ORM code to do it. The code
within this layer would either be created custom for each app migration,
or perhaps some of it could be somewhat abstracted into a series of
common "triggers" like "column move", "column rename", etc.
Since the real trick is, how to we get data written from the "old" app
to be compatible with "new". Triggers are one way to put something "in
the middle", but an in-process Python plugin could be another. The
"new" app would use the same technique and the plugin could be removed
once the application is fully migrated.
This does meet your criteria of the transition happening in "safe, warm
development". It meets mine of, "the new application can look
forwards and SELECT at will without worrying about backwards
compatibility". I'm not too optimistic that I could convince anyone of
this approach though. As Dan notes, the kinds of migrations Nova is
doing don't have these problems at all. For others, like the one
Keystone is working on, it apparently is more about getting a DEFAULT to
work on all platforms (MySQL is being picky) and definitely a DEFAULT is
better than any other approach if it is all that's needed.
Going to put this one in my back pocket though, stay tuned...

It does sound interesting, no doubt.

Mike Bayer

2016-08-31 02:31:22 UTC

Post by Clint Byrum
My direct experience with this was MySQL 5.0 and 5.1. They worked as
documented, and no I don't think they've changed much since then.
When they were actually installed into the schema and up to date with
the code that expected them, and the debugging individual was aware of them, things were fine.
However, every other imperative part of the code was asserted with git,
package managers, ansible, puppet, pick your choice of thing that puts
file on disk and restarts daemons. These things all have obvious entry
points too. X is where wsgi starts running code. Y is where flask hands
off to the app, etc. But triggers are special and go in the database at
whatever time they go in. This means you lose all the benefit of all of
the tools you're used to using to debug and operate on imperative code.

to use your phrasing, I'd characterize this as "an unnecessarily bleak
view" of the use of triggers as a whole. I've no doubt you worked with
some horrible trigger code (just as I've worked with some horrible
application code, but I've worked with horrible stored procedure /
trigger stuff too).

The triggers that have been in play in the current Keystone proposals as
well as the one we were working with in Neutron were simple one liners
that essentially act as custom constraints - they check a condition then
raise an error if it fails. In particular, MySQL doesn't have support
for CHECK constraints, so if you want to assert that values going into a
row have some quality more exotic than "not null", you might have to use
a trigger to get this effect.

Clearly, a trigger that is so complex that it is invoking a whole series
of imperative steps is not a trigger any of us should be considering.
IMO these are not those triggers.

Post by Clint Byrum
Of course, you can have books that get their edition 0 updated in book
while you're upgrading. But the editions feature code always treats
that old update as an update to edition 0. It's still the same object
it always was, your app just makes some assumptions about it. You can
use a union in some cases where you need to see them all for instance,
and just select a literal '0' for the edition column of your union.

I find unions to be very awkward and really subject to poor performance.
Of course this can be made to work but I'm sticking to my preference
for getting the data in the right shape on the write side, not the read
side.

Post by Clint Byrum
And one can say "old app is gone" when one knows it's gone. At that point,
one can run a migration that inserts 0 editions into book_edition, and
drops the book table. For OpenStack, we can say "all releases that used
that old schema are EOL, so we can simplify the code now". Our 6 month
pace and short EOL windows are built for this kind of thing.

Assuming we aren't able to use Nova's approach and we're stuck
translating in the data access layer, we can simplify the code and put
out a new release, although that "simplified" code now has to be
"unsimplified" by all the *new* schema changes - code will always be
carrying along junk to try and adapt it to the previous version of the
software. There's no problem if projects in this situation want to do
it this way and I will gladly support everyone's efforts in going this
route. However, I still think it's worth looking into approaches that
can push the interaction between old and new app version into the write
side instead of the read side, and if that interaction can be removed
from the primary database access code into a separate layer.

To the degree that teams can just emulate Nova's finessing of the issue
at the service level, that's even better. This thread is just in
response to particular teams who *want* to use triggers for a specific
problem. Hopefully I will have time to flesh out my alternative
technique for "application-level translation triggers" and maybe those
folks might want to try that kind of thing too someday.

Robert Collins

2016-09-01 08:45:22 UTC

Post by Clint Byrum
Don't delete columns, ignore them.
Don't change columns, create new ones.
When you create a column, give it a default that makes sense.

I'm sure you're aware of this but I think its worth clarifying for non
DBAish folk: non-NULL values can change a DDL statements execution
time from O(1) to O(N) depending on the DB in use. E.g. for Postgres
DDL requires an exclusive table lock, and adding a column with any
non-NULL value (including constants) requires calculating a new value
for every row, vs just updating the metadata - see
https://www.postgresql.org/docs/9.5/static/sql-altertable.html
"""
When a column is added with ADD COLUMN, all existing rows in the table
are initialized with the column's default value (NULL if no DEFAULT
clause is specified). If there is no DEFAULT clause, this is merely a
metadata change and does not require any immediate update of the
table's data; the added NULL values are supplied on readout, instead.
"""

Post by Clint Byrum
Do not add new foreign key constraints.

What's the reason for this - if it's to avoid exclusive locks, I'd
note that the other rules above don't avoid exclusive locks - again,
DB specific, and for better or worse we are now testing on multiple DB
engines via 3rd party testing.

https://dev.launchpad.net/Database/LivePatching has some info from our
experience doing online and very fast offline patches in Launchpad.

-Rob

Henry Nash

2016-09-01 12:29:19 UTC

So as the person who drove the rolling upgrade requirements into keystone in this cycle (because we have real customers that need it), and having first written the keystone upgrade process to be “versioned object ready” (because I assumed we would do this the same as everyone else), and subsequently re-written it to be “DB Trigger ready”…and written migration scripts for both these cases for the (in fact very minor) DB changes that keystone has in Newton…I guess I should also weigh in here :-)

For me, the argument comes down to:

a) Is the pain that needs to cured by the rolling upgrade requirement broadly in the same place in the various projects (i.e. nova, glance, keystone etc.)? If it is, then working towards a common solution is always preferable (whatever that solution is)
b) I would characterise the difference between the trigger approach, the versioned objects approach and the “n-app approach as: do we want a small amount of very nasty complexity vs. spreading that complexity out to be not as bad, but over a broader area. Probably fewer people can (successfully) write the nasty complexity trigger work, than they can, say, the “do it all in the app” work. LOC (which, of course, isn’t always a good measure) is also reflected in this characterisation, with the trigger code having probably the fewest LOC, and the app code having the greatest.
c) I don’t really follow the argument that somehow the trigger code in migrations is less desirable because we use higher level sqla abstractions in our main-line code - I’ve always seen migration as different and expected that we might have to do strange things there. Further, we should be aware of the time-preiods…the migration cycle is a small % of elapsed time the cloud is running (well, hopefully) - so again, do we solve the “issues of migration” as part of the migration cycle (which is what the trigger approach does) or make our code be (effectively) continually migration aware (using versioned objects or in-app code)
d) The actual process (for an operator) is simpler for a rolling upgrade process with Triggers than the alternative (since you don’t require several of the checkpoints, e.g. when you know you can move out of compatibility mode etc.). Operator error is also a cause of problems in upgrades (especially as the complexity of a cloud increases).

From a purely keystone perspective, my gut feeling is that actually the trigger approach is likely to lead to a more robust, not less, solution - due to the fact that we solve the very specific problems of a given migration (i.e. need to keep column A in sync with Column B) or a short period of time, right at the point of pain, with well established techniques - albeit they be complex ones that need experienced coders in those techniques. I actually prefer the small locality of complexity (marked with “there be dragons there, be careful”), as opposed to spreading medium pain over a large area, which by definition is updated by many…and may do the wrong thing inadvertently. It is simpler for operators.

I do recognise, however, the “let’s not do different stuff for a core project like keytsone” as a powerful argument. I just don’t know how to square this with the fact that although I started in the “versioned objects camp”, having worked through many of the issues have come to believe that the Trigger approach will be more reliable overall for this specific use case. From the other reaction to this thread, I don’t detect a lot of support for the Trigger approach becoming our overall, cross-project solution.

The actual migrations in Keystone needed for Newton are minor, so one possibility is we use keystone as a guinea pig for this approach in Newton…if we had to undo this in a subsequent release, we are not talking about rafts of migration code to redo.

Henry

Post by Robert Collins

Post by Clint Byrum
Don't delete columns, ignore them.
Don't change columns, create new ones.
When you create a column, give it a default that makes sense.

I'm sure you're aware of this but I think its worth clarifying for non
DBAish folk: non-NULL values can change a DDL statements execution
time from O(1) to O(N) depending on the DB in use. E.g. for Postgres
DDL requires an exclusive table lock, and adding a column with any
non-NULL value (including constants) requires calculating a new value
for every row, vs just updating the metadata - see
https://www.postgresql.org/docs/9.5/static/sql-altertable.html
"""
When a column is added with ADD COLUMN, all existing rows in the table
are initialized with the column's default value (NULL if no DEFAULT
clause is specified). If there is no DEFAULT clause, this is merely a
metadata change and does not require any immediate update of the
table's data; the added NULL values are supplied on readout, instead.
"""

Post by Clint Byrum
Do not add new foreign key constraints.

What's the reason for this - if it's to avoid exclusive locks, I'd
note that the other rules above don't avoid exclusive locks - again,
DB specific, and for better or worse we are now testing on multiple DB
engines via 3rd party testing.
https://dev.launchpad.net/Database/LivePatching has some info from our
experience doing online and very fast offline patches in Launchpad.
-Rob
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Mike Bayer

2016-09-01 14:58:26 UTC

Post by Henry Nash
From a purely keystone perspective, my gut feeling is that actually the
trigger approach is likely to lead to a more robust, not less, solution - due
to the fact that we solve the very specific problems of a given migration
(i.e. need to keep column A in sync with Column B) or a short period of time,
right at the point of pain, with well established techniques - albeit they be
complex ones that need experienced coders in those techniques.

this is really the same philosophy I'm going for, that is, make a schema
migration, then accompany it by a data migration, and then you're done.
The rest of the world need not be concerned.

It's not as much about "triggers" as it is, "handle the data difference
on the write side, not the read side". That is, writing data to a SQL
database is squeezed through exactly three very boring forms of
statement, the INSERT, UPDATE, and DELETE. These are easy to intercept
in the database, and since we use an abstraction like SQLAlchemy they
are easy to intercept in the application layer too (foreshadowing....).
When you put it on the read side, reading is of course (mostly)
through just one statement, the SELECT, but it is a crazy beast in
practice and it is all over the place in an unlimited number of forms.

If you can get your migrations to be, hey, we can just read JSON records
from version 1.0 of the service and pump them into version 2.0, then
you're doing read-side, but you've solved the problem at the service
layer. This only works for those situations where it "works", and the
dual-layer service architecture has to be feasibly present as well.

Jay Pipes

2016-09-14 22:08:33 UTC

Post by Henry Nash
So as the person who drove the rolling upgrade requirements into
keystone in this cycle (because we have real customers that need it),
and having first written the keystone upgrade process to be
“versioned object ready” (because I assumed we would do this the same
as everyone else), and subsequently re-written it to be “DB Trigger
ready”…and written migration scripts for both these cases for the (in
fact very minor) DB changes that keystone has in Newton…I guess I
should also weigh in here :-)

Sorry for delayed response. PTO and all... I'd just like to make a
clarification here. Henry, you are not referring to *rolling upgrades*
but rather *online database migrations*. There's an important
distinction between the two concepts.

Online schema migrations, as discussed in this thread, are all about
minimizing the time that a database server is locked or otherwise busy
performing the tasks of changing SQL schemas and moving the underlying
stored data from their old location/name to their new location/name. As
noted in this thread, there's numerous ways of reducing the downtime
experienced during these data and schema migrations.

Rolling upgrades are not the same thing, however. What rolling upgrades
refer to is the ability of a *distributed system* to have its
distributed component services running different versions of the
software and still be able to communicate with the other components of
the system. This time period during which the components of the
distributed system may run different versions of the software may be
quite lengthy (days or weeks long). The "rolling" part of "rolling
upgrade" refers to the fact that in a distributed system of thousands of
components or nodes, the upgraded software must be "rolled out" to those
thousands of nodes over a period of time.

Glance and Keystone do not participate in a rolling upgrade, because
Keystone and Glance do not have a distributed component architecture.
Online data migrations will reduce total downtime experienced during an
*overall upgrade procedure* for an OpenStack cloud, but Nova, Neutron
and Cinder are the only parts of OpenStack that are going to participate
in a rolling upgrade because they are the services that are distributed
across all the many compute nodes.

Best,
-jay

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-***@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailm

Bashmakov, Alexander

2016-09-14 23:24:48 UTC

Post by Jay Pipes
Glance and Keystone do not participate in a rolling upgrade, because
Keystone and Glance do not have a distributed component architecture.
Online data migrations will reduce total downtime experienced during an
*overall upgrade procedure* for an OpenStack cloud, but Nova, Neutron and
Cinder are the only parts of OpenStack that are going to participate in a rolling
upgrade because they are the services that are distributed across all the
many compute nodes.

Hi Jay,
I'd like to better understand why your definition of rolling upgrades excludes Glance and Keystone? Granted they don't run multiple disparate components over distributed systems, however, they can still run the same service on multiple distributed nodes. So a rolling upgrade can still be applied on a large cloud that has, for instance 50 Glance nodes. In this case different versions of the same service will run on different nodes simultaneously.
Regards,
Alex

Jay Pipes

2016-10-14 20:55:39 UTC

Alex, so sorry for the long delayed response! :( This just crept to
the back of my inbox unfortunately. Answer inline...

Post by Jay Pipes
Glance and Keystone do not participate in a rolling upgrade,
because Keystone and Glance do not have a distributed component
architecture. Online data migrations will reduce total downtime
experienced during an *overall upgrade procedure* for an OpenStack
cloud, but Nova, Neutron and Cinder are the only parts of OpenStack
that are going to participate in a rolling upgrade because they are
the services that are distributed across all the many compute
nodes.

Hi Jay, I'd like to better understand why your definition of rolling
upgrades excludes Glance and Keystone? Granted they don't run
multiple disparate components over distributed systems, however, they
can still run the same service on multiple distributed nodes. So a
rolling upgrade can still be applied on a large cloud that has, for
instance 50 Glance nodes.

If you've seen a cloud with 50 Glance nodes, I would be astonished :)
That said, the number 50 doesn't really have to do with my definition of
rolling... lemme explain.

The primary thing that, to me at least, differentiates rolling upgrades
of distributed software is that different nodes can contain multiple
versions of the software and continue to communicate with other nodes in
the system without issue.

In the case of Glance, you cannot have different versions of the Glance
service running simultaneously within an environment, because those
Glance services each directly interface with the Glance database and
therefore expect the Glance DB schema to look a particular way for a
specific version of the Glance service software.

In contrast, Nova's distributed service nodes -- the nova-compute
services and (mostly) the nova-api services do *not* talk directly to
the Nova database. If those services need to get or set data in the
database, they communicate with the nova-conductor services which are
responsible for translating (called back-versioning) the most updated
object model schema that matches the Nova database to the schema that
the calling node understands. This means that Nova deployers can update
the Nova database schema and not have to at the same time update the
software on the distributed compute nodes. In this way deployers can
"roll out" an upgrade of the Nova software across many hundreds of
compute nodes over an extended period of time without needing to
restart/upgrade services all at once.

Hope this clarifies things.

Best,
-jay

p.s. I see various information on the web referring to "rolling updates"
or "rolling releases" as simply the process of continuously applying new
versions of software to a deployment. This is decidedly *not* what I
refer to as a "rolling upgrade". Perhaps we should invent a different
term from "rolling upgrade" to refer to the attributes involved in being
able to run multiple versions of distributed software with no impact on
the control plane? Is that what folks call a "partial upgrade"? Not sure...

In this case different versions of the
same service will run on different nodes simultaneously. Regards,
Alex

Clint Byrum

2016-10-15 17:57:44 UTC

Post by Jay Pipes
Alex, so sorry for the long delayed response! :( This just crept to
the back of my inbox unfortunately. Answer inline...

Post by Jay Pipes
Glance and Keystone do not participate in a rolling upgrade,
because Keystone and Glance do not have a distributed component
architecture. Online data migrations will reduce total downtime
experienced during an *overall upgrade procedure* for an OpenStack
cloud, but Nova, Neutron and Cinder are the only parts of OpenStack
that are going to participate in a rolling upgrade because they are
the services that are distributed across all the many compute
nodes.

Hi Jay, I'd like to better understand why your definition of rolling
upgrades excludes Glance and Keystone? Granted they don't run
multiple disparate components over distributed systems, however, they
can still run the same service on multiple distributed nodes. So a
rolling upgrade can still be applied on a large cloud that has, for
instance 50 Glance nodes.

If you've seen a cloud with 50 Glance nodes, I would be astonished :)
That said, the number 50 doesn't really have to do with my definition of
rolling... lemme explain.
The primary thing that, to me at least, differentiates rolling upgrades
of distributed software is that different nodes can contain multiple
versions of the software and continue to communicate with other nodes in
the system without issue.

Database are often (mis)used to communicate.

Post by Jay Pipes
In the case of Glance, you cannot have different versions of the Glance
service running simultaneously within an environment, because those
Glance services each directly interface with the Glance database and
therefore expect the Glance DB schema to look a particular way for a
specific version of the Glance service software.

That's not a constraint of Glance, but a constraint of the way Glance
has been interfacing with the database. The argument of the thread was
that one can make schema changes in such a way that one can have
multiple versions of the same component running during an update.

Post by Jay Pipes
In contrast, Nova's distributed service nodes -- the nova-compute
services and (mostly) the nova-api services do *not* talk directly to
the Nova database. If those services need to get or set data in the
database, they communicate with the nova-conductor services which are
responsible for translating (called back-versioning) the most updated
object model schema that matches the Nova database to the schema that
the calling node understands. This means that Nova deployers can update
the Nova database schema and not have to at the same time update the
software on the distributed compute nodes. In this way deployers can
"roll out" an upgrade of the Nova software across many hundreds of
compute nodes over an extended period of time without needing to
restart/upgrade services all at once.
Hope this clarifies things.

It clarifies your thinking, so thanks for that. However, I'm not so sure
there's any difference between components that are the same software,
and components that are different software, if they end up interacting
anyway because one version can write and read data that another version
does.

What I think is important is understanding the interfaces, and how they
can be tested to ensure that rolling/partial/0-downtime updates can be
done safely.

Duncan Thomas

2016-10-16 09:52:07 UTC

The primary thing that, to me at least, differentiates rolling upgrades of
distributed software is that different nodes can contain multiple versions
of the software and continue to communicate with other nodes in the system
without issue.
In the case of Glance, you cannot have different versions of the Glance
service running simultaneously within an environment, because those Glance
services each directly interface with the Glance database and therefore
expect the Glance DB schema to look a particular way for a specific version
of the Glance service software.

Cinder services can run N+-1 versions in a mixed manner, all talking to the
same database, no conductor required.

--
Duncan Thomas

Michał Dulko

2016-10-17 09:00:10 UTC

Post by Jay Pipes
The primary thing that, to me at least, differentiates rolling
upgrades of distributed software is that different nodes can
contain multiple versions of the software and continue to
communicate with other nodes in the system without issue.
In the case of Glance, you cannot have different versions of the
Glance service running simultaneously within an environment,
because those Glance services each directly interface with the
Glance database and therefore expect the Glance DB schema to look
a particular way for a specific version of the Glance service
software.
Cinder services can run N+-1 versions in a mixed manner, all talking
to the same database, no conductor required.

That's true, keeping DB schema compatible is possible even if all the
services are talking to the DB, but this requires very careful
programming (and reviewing) of the DB migrations. That's why our DB
migrations guidelines are quite complicated [1]. I think conductor-based
design is superior but definitely isn't the only way.

[1]
http://docs.openstack.org/developer/cinder/devref/rolling.upgrades.html#database-schema-and-data-migrations

Bashmakov, Alexander

2016-10-27 17:24:56 UTC

Hi Jay,

Thanks for the explanation. While I agree that there is a distinction between a distributed architecture like Nova and a centralized one like Glance, I would respectfully disagree with the statement that Glance cannot participate in rolling upgrades in a very similar fashion. We are currently working on a rolling upgrade POC in Glance (https://review.openstack.org/331740/). To date, we've successfully been able to run through a simple scenario with two Glance nodes running Newton and Ocata code base respectively. The latter introduces schema changes which are reconciled in the DB via a two-way trigger.

Regards,
Alex

-----Original Message-----
From: Jay Pipes [mailto:***@gmail.com]
Sent: Friday, October 14, 2016 1:56 PM
To: openstack-***@lists.openstack.org
Subject: Re: [openstack-dev] [keystone][nova][neutron][all] Rolling upgrades: database triggers and oslo.versionedobjects

Alex, so sorry for the long delayed response! :( This just crept to the back of my inbox unfortunately. Answer inline...

Post by Jay Pipes
Glance and Keystone do not participate in a rolling upgrade, because
Keystone and Glance do not have a distributed component architecture.
Online data migrations will reduce total downtime experienced during
an *overall upgrade procedure* for an OpenStack cloud, but Nova,
Neutron and Cinder are the only parts of OpenStack that are going to
participate in a rolling upgrade because they are the services that
are distributed across all the many compute nodes.

Hi Jay, I'd like to better understand why your definition of rolling
upgrades excludes Glance and Keystone? Granted they don't run multiple
disparate components over distributed systems, however, they can still
run the same service on multiple distributed nodes. So a rolling
upgrade can still be applied on a large cloud that has, for instance
50 Glance nodes.

If you've seen a cloud with 50 Glance nodes, I would be astonished :) That said, the number 50 doesn't really have to do with my definition of rolling... lemme explain.

The primary thing that, to me at least, differentiates rolling upgrades of distributed software is that different nodes can contain multiple versions of the software and continue to communicate with other nodes in the system without issue.

In the case of Glance, you cannot have different versions of the Glance service running simultaneously within an environment, because those Glance services each directly interface with the Glance database and therefore expect the Glance DB schema to look a particular way for a specific version of the Glance service software.

In contrast, Nova's distributed service nodes -- the nova-compute services and (mostly) the nova-api services do *not* talk directly to the Nova database. If those services need to get or set data in the database, they communicate with the nova-conductor services which are responsible for translating (called back-versioning) the most updated object model schema that matches the Nova database to the schema that the calling node understands. This means that Nova deployers can update the Nova database schema and not have to at the same time update the software on the distributed compute nodes. In this way deployers can "roll out" an upgrade of the Nova software across many hundreds of compute nodes over an extended period of time without needing to restart/upgrade services all at once.

Hope this clarifies things.

Best,
-jay

p.s. I see various information on the web referring to "rolling updates"
or "rolling releases" as simply the process of continuously applying new versions of software to a deployment. This is decidedly *not* what I refer to as a "rolling upgrade". Perhaps we should invent a different term from "rolling upgrade" to refer to the attributes involved in being able to run multiple versions of distributed software with no impact on the control plane? Is that what folks call a "partial upgrade"? Not sure...

In this case different versions of the
same service will run on different nodes simultaneously. Regards, Alex

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-***@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Henry Nash

2016-09-14 23:29:44 UTC

Jay,

I agree with your distinction - and when I am referring to rolling upgrades for keystone I am referring to when you are running a cluster of keystones (for performance and/or redundancy), and you want to roll the upgrade across the cluster without creating downtime of the overall keystone service. Such a keystone cluster deployment will be common in large clouds - and prior to Newton, keystone did not support such a rolling upgrade (you had to take all the nodes down, upgrade the DB and then boot them all back up). In order to support such a rolling upgrade you either need to have code that can work on different DB versions (either explicitly or via versioned objects), or you hide the schema changes by “data synchronisation via Triggers”, which is where this whole thread came from.

Henry

Post by Henry Nash
So as the person who drove the rolling upgrade requirements into
keystone in this cycle (because we have real customers that need it),
and having first written the keystone upgrade process to be
“versioned object ready” (because I assumed we would do this the same
as everyone else), and subsequently re-written it to be “DB Trigger
ready”…and written migration scripts for both these cases for the (in
fact very minor) DB changes that keystone has in Newton…I guess I
should also weigh in here :-)

Sorry for delayed response. PTO and all... I'd just like to make a clarification here. Henry, you are not referring to *rolling upgrades* but rather *online database migrations*. There's an important distinction between the two concepts.
Online schema migrations, as discussed in this thread, are all about minimizing the time that a database server is locked or otherwise busy performing the tasks of changing SQL schemas and moving the underlying stored data from their old location/name to their new location/name. As noted in this thread, there's numerous ways of reducing the downtime experienced during these data and schema migrations.
Rolling upgrades are not the same thing, however. What rolling upgrades refer to is the ability of a *distributed system* to have its distributed component services running different versions of the software and still be able to communicate with the other components of the system. This time period during which the components of the distributed system may run different versions of the software may be quite lengthy (days or weeks long). The "rolling" part of "rolling upgrade" refers to the fact that in a distributed system of thousands of components or nodes, the upgraded software must be "rolled out" to those thousands of nodes over a period of time.
Glance and Keystone do not participate in a rolling upgrade, because Keystone and Glance do not have a distributed component architecture. Online data migrations will reduce total downtime experienced during an *overall upgrade procedure* for an OpenStack cloud, but Nova, Neutron and Cinder are the only parts of OpenStack that are going to participate in a rolling upgrade because they are the services that are distributed across all the many compute nodes.
Best,
-jay
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Clint Byrum

2016-09-14 23:45:49 UTC

Post by Henry Nash
Jay,
I agree with your distinction - and when I am referring to rolling upgrades for keystone I am referring to when you are running a cluster of keystones (for performance and/or redundancy), and you want to roll the upgrade across the cluster without creating downtime of the overall keystone service. Such a keystone cluster deployment will be common in large clouds - and prior to Newton, keystone did not support such a rolling upgrade (you had to take all the nodes down, upgrade the DB and then boot them all back up). In order to support such a rolling upgrade you either need to have code that can work on different DB versions (either explicitly or via versioned objects), or you hide the schema changes by “data synchronisation via Triggers”, which is where this whole thread came from.

It doesn't always need to be explicit or through versioned objects. One
can often manipulate the schema and even migrate data without disturbing
old code.

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-***@lists.openstack.org?subject:unsubscribe
http://li

Clint Byrum

2016-09-01 18:12:28 UTC

Post by Robert Collins

Post by Clint Byrum
Don't delete columns, ignore them.
Don't change columns, create new ones.
When you create a column, give it a default that makes sense.

I'm sure you're aware of this but I think its worth clarifying for non
DBAish folk: non-NULL values can change a DDL statements execution
time from O(1) to O(N) depending on the DB in use. E.g. for Postgres
DDL requires an exclusive table lock, and adding a column with any
non-NULL value (including constants) requires calculating a new value
for every row, vs just updating the metadata - see
https://www.postgresql.org/docs/9.5/static/sql-altertable.html
"""
When a column is added with ADD COLUMN, all existing rows in the table
are initialized with the column's default value (NULL if no DEFAULT
clause is specified). If there is no DEFAULT clause, this is merely a
metadata change and does not require any immediate update of the
table's data; the added NULL values are supplied on readout, instead.
"""

InnoDB (via MySQL) has no such restrictions for online DDL:

https://dev.mysql.com/doc/refman/5.6/en/innodb-create-index-overview.html#innodb-online-ddl-summary-grid

Basically what the link above says is that anything except these
operations can be done without locking up the table:

- Fulltext index creation
- Change column data type
- Convert or specify column character sets

Specifically, defaults are only ever stored in the rows if they're
changed. The current default is kept in the table definition, so the
rows end up with NULL physically unless the default is changed. An alter
that does a default change is just like a big update to set the current
NULL's to the old default.

Post by Robert Collins

Post by Clint Byrum
Do not add new foreign key constraints.

What's the reason for this - if it's to avoid exclusive locks, I'd
note that the other rules above don't avoid exclusive locks - again,
DB specific, and for better or worse we are now testing on multiple DB
engines via 3rd party testing.
https://dev.launchpad.net/Database/LivePatching has some info from our
experience doing online and very fast offline patches in Launchpad.

The reason is to avoid the old code running into new restrictions. If
you add a FK constraint to an existing table, old code will insert into
it and fail because it doesn't add the FK rows needed.

Flavio Percoco

2016-08-30 08:42:55 UTC

Post by Steve Martinelli
The keystone team is pursuing a trigger-based approach to support rolling,
http://docs.openstack.org/developer/keystone/upgrading.html
This differs from Nova and Neutron's approaches to solve for rolling
upgrades (which use oslo.versionedobjects), however Keystone is one of the
few services that doesn't need to manage communication between multiple
releases of multiple service components talking over the message bus (which
is the original use case for oslo.versionedobjects, and for which it is
aptly suited). Keystone simply scales horizontally and every node talks
directly to the database.
Database triggers are obviously a new challenge for developers to write,
honestly challenging to debug (being side effects), and are made even more
difficult by having to hand write triggers for MySQL, PostgreSQL, and
SQLite independently (SQLAlchemy offers no assistance in this case), as
https://review.openstack.org/#/c/355618/
However, implementing an application-layer solution with
oslo.versionedobjects is not an easy task either; refer to Neutron's
https://review.openstack.org/#/q/topic:bp/adopt-oslo-versioned-objects-for-db
Our primary concern at this point are how to effectively test the triggers
we write against our supported database systems, and their various
deployment variations. We might be able to easily drop SQLite support (as
it's only supported for our own test suite), but should we expect variation
in support and/or actual behavior of triggers across the MySQLs, MariaDBs,
Perconas, etc, of the world that would make it necessary to test each of
them independently? If you have operational experience working with
triggers at scale: are there landmines that we need to be aware of? What is
it going to take for us to say we support *zero* dowtime upgrades with
confidence?

Hey Steve, Dolph,

Thanks for sending this out. There's been some discussions in the Glance
community about how we can implement rolling upgrades and it seems like Glance's
case is very similar to keystone's.

I'll make sure folks in the glance community are aware of this thread and reach
out.

Flavio

--
@flaper87
Flavio Percoco

David Stanek

2016-09-01 13:45:44 UTC

Post by Steve Martinelli
The keystone team is pursuing a trigger-based approach to support rolling,
http://docs.openstack.org/developer/keystone/upgrading.html

I wanted to mention a few things. One of the reasons I suggested this
approach for keystone is that I've had success in the past using a
combination of triggers and code to do live, online migrations. Many
times using completely different schemas.

In keystone we are just talking about some simple data transformations
between columns and things like that. The triggers themselves shouldn't
get too complicated. If there are cases where triggers won't work, then
we won't force them. (A current example of this is encrypting
credentials.)

The online migrations are not required. Operators can still go the old
route and db_sync while others help test out the cutting edge features.

The triggers are not there during the entire lifecycle of the
application. The expand phase adds them and the contract removes them.

--
David Stanek
web: http://dstanek.com
blog: http://traceback.org

Sean Dague

2016-09-01 14:52:52 UTC

Post by David Stanek

Post by Steve Martinelli
The keystone team is pursuing a trigger-based approach to support rolling,
http://docs.openstack.org/developer/keystone/upgrading.html

I wanted to mention a few things. One of the reasons I suggested this
approach for keystone is that I've had success in the past using a
combination of triggers and code to do live, online migrations. Many
times using completely different schemas.
In keystone we are just talking about some simple data transformations
between columns and things like that. The triggers themselves shouldn't
get too complicated. If there are cases where triggers won't work, then
we won't force them. (A current example of this is encrypting
credentials.)
The online migrations are not required. Operators can still go the old
route and db_sync while others help test out the cutting edge features.
The triggers are not there during the entire lifecycle of the
application. The expand phase adds them and the contract removes them.

But you did that for an application where you were on call to handle any
issues, and you knew the data somewhat in advance.

In OpenStack this code would get committed. It would get executed 12 to
18 months later (the average current OpenStack level at the ops meetup
was Kilo/Liberty). It would be executed by people far away, possibly
running in different locales, without an idea about what's in the data set.

Part of OpenStack being a successful open source project is that the
mean expertise of our operators will keep decreasing over time. It will
be deployed and maintained by less and less skilled operators in each
release, because it will be deployed and maintained by more total
operators each release.

Putting DB trigger failure analysis into the toolkit required to manage
an upgrade failure is a really high bar for new ops.

-Sean

--
Sean Dague
http://dague.net

Thierry Carrez

2016-09-02 10:15:33 UTC

Post by Sean Dague
Putting DB trigger failure analysis into the toolkit required to manage
an upgrade failure is a really high bar for new ops.

I agree with Sean: increasing the variety of technologies used increases
the system complexity, which in turn requires more skills to fully
understand and maintain operationally. It should only be done as a last
resort, with pros and cons carefully weighted. We really should involve
operators in this discussion to get the full picture of arguments for
and against.

--
Thierry Carrez (ttx)

Doug Hellmann

2016-09-02 17:53:19 UTC

Post by Thierry Carrez

Post by Sean Dague
Putting DB trigger failure analysis into the toolkit required to manage
an upgrade failure is a really high bar for new ops.

I agree with Sean: increasing the variety of technologies used increases
the system complexity, which in turn requires more skills to fully
understand and maintain operationally. It should only be done as a last
resort, with pros and cons carefully weighted. We really should involve
operators in this discussion to get the full picture of arguments for
and against.

Yes, I would like to understand better what aspect of the approach
taken elsewhere is leading to the keystone team exploring other
options. So far I'm not seeing much upside to being different, and I'm
hearing a lot of cons.

Doug

Mike Bayer

2016-09-02 21:58:42 UTC

Post by Doug Hellmann

Post by Thierry Carrez

Post by Sean Dague
Putting DB trigger failure analysis into the toolkit required to manage
an upgrade failure is a really high bar for new ops.

I agree with Sean: increasing the variety of technologies used increases
the system complexity, which in turn requires more skills to fully
understand and maintain operationally. It should only be done as a last
resort, with pros and cons carefully weighted. We really should involve
operators in this discussion to get the full picture of arguments for
and against.

Yes, I would like to understand better what aspect of the approach
taken elsewhere is leading to the keystone team exploring other
options. So far I'm not seeing much upside to being different, and I'm
hearing a lot of cons.

I continue to maintain that the problems themselves being discussed at
https://review.openstack.org/#/c/331740/ are different than what has
been discussed in detail before. To be "not different", this spec
would need to no longer discuss the concept of "we need N to be reading
from and writing to the old column to be compatible with N-1 as shown in
the below diagram...Once all the N-1 services are upgraded, N services
should be moved out of compatibility mode to use the new column. ".
To my knowledge, there are no examples of code in Openstack that
straddles table and column changes directly in the SQL access layer as
this document describes. There's still a handful of folks including
myself that think this is a new kind of awkwardness we've not had to
deal with yet. My only ideas on how to reduce it is to put the N-1/N
differences on the write side, not the read side, and triggers are *not*
the only way to do it. But if "being different" means, "doing it on
the write side", then it seems like that overall concept is being
vetoed. Which I actually appreciate knowing up front before I spend a
lot of time on it.

Post by Doug Hellmann
Doug
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Steve Martinelli

2016-09-02 23:22:04 UTC

Post by Thierry Carrez
I agree with Sean: increasing the variety of technologies used increases

Post by Thierry Carrez
the system complexity, which in turn requires more skills to fully
understand and maintain operationally. It should only be done as a last
resort, with pros and cons carefully weighted. We really should involve
operators in this discussion to get the full picture of arguments for
and against.

Two quick remarks about involving operators. First, see Matt Fischer's
reply to the thread, we have a great operator-developer experience with
Matt (he was one of the first folks looking at Fernet tokens), he
volunteered to out any triggers we write on his MySQL Galera cluster.
Secondly, the use of triggers was brought up at the OpenStack Ansible
midcycle, where several operators were present, and as I understand it,
felt positive about the idea.

Clint Byrum

2016-09-05 01:26:07 UTC

Post by Mike Bayer

Post by Doug Hellmann

Post by Thierry Carrez

Post by Sean Dague
Putting DB trigger failure analysis into the toolkit required to manage
an upgrade failure is a really high bar for new ops.

I agree with Sean: increasing the variety of technologies used increases
the system complexity, which in turn requires more skills to fully
understand and maintain operationally. It should only be done as a last
resort, with pros and cons carefully weighted. We really should involve
operators in this discussion to get the full picture of arguments for
and against.

Yes, I would like to understand better what aspect of the approach
taken elsewhere is leading to the keystone team exploring other
options. So far I'm not seeing much upside to being different, and I'm
hearing a lot of cons.

I continue to maintain that the problems themselves being discussed at
https://review.openstack.org/#/c/331740/ are different than what has
been discussed in detail before. To be "not different", this spec
would need to no longer discuss the concept of "we need N to be reading
from and writing to the old column to be compatible with N-1 as shown in
the below diagram...Once all the N-1 services are upgraded, N services
should be moved out of compatibility mode to use the new column. ".
To my knowledge, there are no examples of code in Openstack that
straddles table and column changes directly in the SQL access layer as
this document describes. There's still a handful of folks including
myself that think this is a new kind of awkwardness we've not had to
deal with yet. My only ideas on how to reduce it is to put the N-1/N
differences on the write side, not the read side, and triggers are *not*
the only way to do it. But if "being different" means, "doing it on
the write side", then it seems like that overall concept is being
vetoed. Which I actually appreciate knowing up front before I spend a
lot of time on it.

The example for glance shows where two entirely new objects have been
created for the database (community and shared images). The compatibility
mode flag in config is cool, I think operators deal with things like
that all the time, like when a new API version arrives and they might
not be ready to support it. I'd hope that having it turned off would
also restrict the API microversion if such a thing exists so that the
community/shared image types aren't allowed yet. This seems straight
forward, and I feel like the spec was good except for the addition of
extra layers.

In this case, I'd just create the new column nullable, and maintain
both.

* Add visibility column to schema (in spec, 'glance-manage db_expand')

* upgrade all API nodes

* run the migration code to resolve the null visibility columns
(traditional "glance-manage db_migrate")

* advance compatibility mode to lowest commit that exists running
against DB

* set visibility to be not null (I think this would be 'glance-manage
db_contract latest_commit_desired')

Where, in this scheme, do triggers come in?

43 Replies
6 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Steve Martinelli 2016-08-25 17:13:35 UTC

Sean M. Collins 2016-08-25 17:59:51 UTC

Matt Fischer 2016-08-25 18:13:21 UTC

Dan Smith 2016-08-25 18:32:24 UTC

gordon chung 2016-08-25 19:51:39 UTC

Sean Dague 2016-08-25 21:14:13 UTC

Monty Taylor 2016-08-31 22:18:03 UTC

David Stanek 2016-09-01 13:57:40 UTC

Mike Bayer 2016-09-01 14:39:09 UTC

Dan Smith 2016-09-01 15:52:01 UTC

Mike Bayer 2016-09-01 17:50:14 UTC

Jeremy Stanley 2016-09-01 18:01:13 UTC

Michael Bayer 2016-09-01 18:40:37 UTC

Mike Bayer 2016-08-26 15:50:24 UTC

Clint Byrum 2016-08-30 13:57:57 UTC

Dan Smith 2016-08-30 15:56:29 UTC

Mike Bayer 2016-08-30 18:56:15 UTC

Dan Smith 2016-08-30 19:48:03 UTC

Clint Byrum 2016-08-30 20:43:35 UTC

Mike Bayer 2016-08-30 22:15:14 UTC

Lance Bragstad 2016-08-30 23:44:25 UTC

Clint Byrum 2016-08-31 00:04:32 UTC

Mike Bayer 2016-08-31 02:31:22 UTC

Robert Collins 2016-09-01 08:45:22 UTC

Henry Nash 2016-09-01 12:29:19 UTC

Mike Bayer 2016-09-01 14:58:26 UTC

Jay Pipes 2016-09-14 22:08:33 UTC

Bashmakov, Alexander 2016-09-14 23:24:48 UTC

Jay Pipes 2016-10-14 20:55:39 UTC

Clint Byrum 2016-10-15 17:57:44 UTC

Duncan Thomas 2016-10-16 09:52:07 UTC

Michał Dulko 2016-10-17 09:00:10 UTC

Bashmakov, Alexander 2016-10-27 17:24:56 UTC

Henry Nash 2016-09-14 23:29:44 UTC

Clint Byrum 2016-09-14 23:45:49 UTC

Clint Byrum 2016-09-01 18:12:28 UTC

Flavio Percoco 2016-08-30 08:42:55 UTC

David Stanek 2016-09-01 13:45:44 UTC

Sean Dague 2016-09-01 14:52:52 UTC

Thierry Carrez 2016-09-02 10:15:33 UTC

Doug Hellmann 2016-09-02 17:53:19 UTC

Mike Bayer 2016-09-02 21:58:42 UTC

Steve Martinelli 2016-09-02 23:22:04 UTC

Clint Byrum 2016-09-05 01:26:07 UTC

about - legalese

Loading...