[openstack-dev] [tripleo] tripleo upstream gate outtage,

Discussion:

[openstack-dev] [tripleo] tripleo upstream gate outtage,

Matt Young

2018-05-13 20:09:46 UTC

Re: resolving network latency issue on the promotion server in
tripleo-infra tenant, that's great news!

Re: retrospective on this class of issue, I'll reach out directly early
this week to get something on the calendar for our two teams. We clearly
need to brainstorm/hash out together how we can reduce the turbulence
moving forward.

In addition, as a result of working these issues over the past few days
we've identified a few pieces of low hanging (tooling) fruit that are ripe
for for improvements that will speed diagnosis / debug in the future.
We'll capture these as RFE's and get them into our backlog.

Matt

2. Shortly after #1 was resolved CentOS released 7.5 which comes
directly into the upstream repos untested and ungated. Additionally the
associated qcow2 image and container-base images were not updated at the
same time as the yum repos. https://bugs.launchpad.net/tripleo/+bug/
1770355

Why do we have this situation everytime the OS is upgraded to a major
version? Can't we test the image before actually using it? We could have
experimental jobs testing latest image and pin gate images to a specific
one?
Like we could configure infra to deploy centos 7.4 in our gate and 7.5 in
experimental, so we can take our time to fix eventual problems and make the
switch when we're ready, instead of dealing with fires (that usually come
all together).
It would be great to make a retrospective on this thing between tripleo
ci & infra folks, and see how we can improve things.

I agree,
We need to in coordination with the infra team be able to pin / lock
content for production check and gate jobs while also have the ability to
stage new content e.g. centos 7.5 with experimental or periodic jobs.
In this particular case the ci team did check the tripleo deployment w/
centos 7.5 updates, however we did not stage or test what impact the centos
minor update would have on the upstream job workflow.
The key issue is that the base centos image used upstream can not be
pinned by the ci team, if say we could pin that image the ci team could pin
the centos repos used in ci and run staging jobs on the latest centos
content.
I'm glad that you also see the need for some amount of coordination here,
I've been in contact with a few folks to initiate the conversation.
In an unrelated note, Sagi and I just fixed the network latency issue on
our promotion server, it was related to DNS. Automatic promotions should
be back online.
Thanks all.

--
Emilien Macchi
____________________________________________________________
______________
OpenStack Development Mailing List (not for usage questions)
unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Jeremy Stanley

2018-05-13 15:24:35 UTC

On 2018-05-13 08:25:25 -0600 (-0600), Wesley Hayutin wrote:
[...]

We need to in coordination with the infra team be able to pin / lock
content for production check and gate jobs while also have the ability to
stage new content e.g. centos 7.5 with experimental or periodic jobs.

[...]

It looks like adjustments would be needed to DIB's centos-minimal
element if we want to be able to pin it to specific minor releases.
However, having to rotate out images in the fashion described would
be a fair amount of manual effort and seems like it would violate
our support expectations in governance if we end up pinning to older
minor versions (for major LTS versions on the other hand, we expect
to undergo this level of coordination but they come at a much slower
pace with a lot more advance warning). If we need to add controlled
roll-out of CentOS minor version updates, this is really no better
than Fedora from the Infra team's perspective and we've already said
we can't make stable branch testing guarantees for Fedora due to the
complexity involved in using different releases for each branch and
the need to support our stable branches longer than the distros are
supporting the releases on which we're testing.

For example, how long would the distro maintainers have committed to
supporting RHEL 7.4 after 7.5 was released? Longer than we're
committing to extended maintenance on our stable/queens branches? Or
would you expect projects to still continue to backport support for
these minor platform bumps to all their stable branches too? And
what sort of grace period should we give them before we take away
the old versions? Also, how many minor versions of CentOS should we
expect to end up maintaining in parallel? (Remember, every
additional image means that much extra time to build and upload to
all our providers, as well as that much more storage on our builders
and in our Glance quotas.)

--
Jeremy Stanley

Wesley Hayutin

2018-05-14 02:44:25 UTC

Post by Jeremy Stanley
[...]

We need to in coordination with the infra team be able to pin / lock
content for production check and gate jobs while also have the ability to
stage new content e.g. centos 7.5 with experimental or periodic jobs.

[...]
It looks like adjustments would be needed to DIB's centos-minimal
element if we want to be able to pin it to specific minor releases.
However, having to rotate out images in the fashion described would
be a fair amount of manual effort and seems like it would violate
our support expectations in governance if we end up pinning to older
minor versions (for major LTS versions on the other hand, we expect
to undergo this level of coordination but they come at a much slower
pace with a lot more advance warning). If we need to add controlled
roll-out of CentOS minor version updates, this is really no better
than Fedora from the Infra team's perspective and we've already said
we can't make stable branch testing guarantees for Fedora due to the
complexity involved in using different releases for each branch and
the need to support our stable branches longer than the distros are
supporting the releases on which we're testing.

This is good insight Jeremy, thanks for replying.

Post by Jeremy Stanley
For example, how long would the distro maintainers have committed to
supporting RHEL 7.4 after 7.5 was released? Longer than we're
committing to extended maintenance on our stable/queens branches? Or
would you expect projects to still continue to backport support for
these minor platform bumps to all their stable branches too? And
what sort of grace period should we give them before we take away
the old versions? Also, how many minor versions of CentOS should we
expect to end up maintaining in parallel? (Remember, every
additional image means that much extra time to build and upload to
all our providers, as well as that much more storage on our builders
and in our Glance quotas.)
--
Jeremy Stanley

I think you may be describing a level of support that is far greater than
what I was thinking. I also don't want to tax the infra team w/ n+ versions
of the baseos to support.
I do think it would be helpful to say have a one week change window where
folks are given the opportunity to preflight check a new image and the
potential impact on the job workflow the updated image may have. If I
could update or create a non-voting job w/ the new image that would provide
two things.

1. The first is the head's up, this new minor version of centos is coming
into the system and you have $x days to deal with it.
2. The ability to build a few non-voting jobs w/ the new image to see what
kind of impact it has on the workflow and deployments.

In this case the updated 7.5 CentOS image worked fine w/ TripleO, however
it did cause our gates to go red because..
a. when we update containers w/ zuul dependendencies all the base-os
updates were pulled in and jobs timed out.
b. a kernel bug workaround with virt-customize failed to work due the
kernel packages changed ( 3rd party job )
c. the containers we use were not yet at CentOS 7.5 but the bm image was
causing issues w/ pacemaker.
d. there may be a few more that I am forgetting, but hopefully the point is
made.

We can fix a lot of the issues and I'm not blaming anyone because if we
(tripleo ) thought of all the corner cases with our workflow we would have
been able to avoid some of these issues. However it does seem like we get
hit by $something every time we update a minor version of the baseos. My
preference would be to have a heads up and work through the issues than to
go immediately red and unable to merge patches. I don't know if other
teams get impacted in similiar ways, and I understand this is a big ship
and updating CentOS may work just fine for everyone else.

Thanks all for your time and effort!

Post by Jeremy Stanley
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Jeremy Stanley

2018-05-14 03:29:45 UTC

On 2018-05-13 20:44:25 -0600 (-0600), Wesley Hayutin wrote:
[...]

Post by Wesley Hayutin
I do think it would be helpful to say have a one week change
window where folks are given the opportunity to preflight check a
new image and the potential impact on the job workflow the updated
image may have. If I could update or create a non-voting job w/
the new image that would provide two things.
1. The first is the head's up, this new minor version of centos is
coming into the system and you have $x days to deal with it.
2. The ability to build a few non-voting jobs w/ the new image to
see what kind of impact it has on the workflow and deployments.

[...]

While I can see where you're coming from, right now even the Infra
team doesn't know immediately when a new CentOS minor release starts
to be used. The packages show up in the mirrors automatically and
images begin to be built with them right away. There isn't a
conscious "switch" which is thrown by anyone. This is essentially
the same way we treat Ubuntu LTS point releases as well. If this is
_not_ the way RHEL/CentOS are intended to be consumed (i.e. just
upgrade to and run the latest packages available for a given major
release series) then we should perhaps take a step back and
reevaluate this model. For now we have some fairly deep-driven
assumptions in that regard which are reflected in the Linux
distributions support policy of our project testing interface as
documented in OpenStack governance.

--
Jeremy Stanley

Wesley Hayutin

2018-05-14 13:07:03 UTC

Post by Jeremy Stanley
[...]

Post by Wesley Hayutin
I do think it would be helpful to say have a one week change
window where folks are given the opportunity to preflight check a
new image and the potential impact on the job workflow the updated
image may have. If I could update or create a non-voting job w/
the new image that would provide two things.
1. The first is the head's up, this new minor version of centos is
coming into the system and you have $x days to deal with it.
2. The ability to build a few non-voting jobs w/ the new image to
see what kind of impact it has on the workflow and deployments.

[...]
While I can see where you're coming from, right now even the Infra
team doesn't know immediately when a new CentOS minor release starts
to be used. The packages show up in the mirrors automatically and
images begin to be built with them right away. There isn't a
conscious "switch" which is thrown by anyone. This is essentially
the same way we treat Ubuntu LTS point releases as well. If this is
_not_ the way RHEL/CentOS are intended to be consumed (i.e. just
upgrade to and run the latest packages available for a given major
release series) then we should perhaps take a step back and
reevaluate this model.

I think you may be conflating the notion that ubuntu or rhel/cent can be
updated w/o any issues to applications that run atop of the distributions
with what it means to introduce a minor update into the upstream openstack
ci workflow.

If jobs could execute w/o a timeout the tripleo jobs would have not gone
red. Since we do have constraints in the upstream like a timeouts and
others we have to prepare containers, images etc to work efficiently in the
upstream. For example, if our jobs had the time to yum update the roughly
120 containers in play in each job the tripleo jobs would have just
worked. I am not advocating for not having timeouts or constraints on
jobs, however I am saying this is an infra issue, not a distribution or
distribution support issue.

I think this is an important point to consider and I view it as mostly
unrelated to the support claims by the distribution. Does that make sense?
Thanks

Post by Jeremy Stanley
For now we have some fairly deep-driven
assumptions in that regard which are reflected in the Linux
distributions support policy of our project testing interface as
documented in OpenStack governance.
--
Jeremy Stanley
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Jeremy Stanley

2018-05-14 14:35:24 UTC

On 2018-05-14 07:07:03 -0600 (-0600), Wesley Hayutin wrote:
[...]

Post by Wesley Hayutin
I think you may be conflating the notion that ubuntu or rhel/cent
can be updated w/o any issues to applications that run atop of the
distributions with what it means to introduce a minor update into
the upstream openstack ci workflow.
If jobs could execute w/o a timeout the tripleo jobs would have
not gone red. Since we do have constraints in the upstream like a
timeouts and others we have to prepare containers, images etc to
work efficiently in the upstream. For example, if our jobs had
the time to yum update the roughly 120 containers in play in each
job the tripleo jobs would have just worked. I am not advocating
for not having timeouts or constraints on jobs, however I am
saying this is an infra issue, not a distribution or distribution
support issue.
I think this is an important point to consider and I view it as
mostly unrelated to the support claims by the distribution. Does
that make sense?

[...]

Thanks, the thread jumped straight to suggesting costly fixes
(separate images for each CentOS point release, adding an evaluation
period or acceptance testing for new point releases, et cetera)
without coming anywhere close to exploring the problem space. Is
your only concern that when your jobs started using CentOS 7.5
instead of 7.4 they took longer to run? What was the root cause? Are
you saying your jobs consume externally-produced artifacts which lag
behind CentOS package updates? Couldn't a significant burst of new
packages cause the same symptoms even without it being tied to a
minor version increase?

This _doesn't_ sound to me like a problem with how we've designed
our infrastructure, unless there are additional details you're
omitting. It sounds like a problem with how the jobs are designed
and expectations around distros slowly trickling package updates
into the series without occasional larger bursts of package deltas.
I'd like to understand more about why you upgrade packages inside
your externally-produced container images at job runtime at all,
rather than relying on the package versions baked into them. It
seems like you're arguing that the existence of lots of new package
versions which aren't already in your container images is the
problem, in which case I have trouble with the rationalization of it
being "an infra issue" insofar as it requires changes to the
services as provided by the OpenStack Infra team.

Just to be clear, we didn't "introduce a minor update into the
upstream openstack ci workflow." We continuously pull CentOS 7
packages into our package mirrors, and continuously rebuild our
centos-7 images from whatever packages the distro says are current.
Our automation doesn't know that there's a difference between
packages which were part of CentOS 7.4 and 7.5 any more than it
knows that there's a difference between Ubuntu 16.04.2 and 16.04.3.
Even if we somehow managed to pause our CentOS image updates
immediately prior to 7.5, jobs would still try to upgrade those
7.4-based images to the 7.5 packages in our mirror, right?

--
Jeremy Stanley

Wesley Hayutin

2018-05-14 15:57:17 UTC

Post by Jeremy Stanley
[...]

Post by Wesley Hayutin
I think you may be conflating the notion that ubuntu or rhel/cent
can be updated w/o any issues to applications that run atop of the
distributions with what it means to introduce a minor update into
the upstream openstack ci workflow.
If jobs could execute w/o a timeout the tripleo jobs would have
not gone red. Since we do have constraints in the upstream like a
timeouts and others we have to prepare containers, images etc to
work efficiently in the upstream. For example, if our jobs had
the time to yum update the roughly 120 containers in play in each
job the tripleo jobs would have just worked. I am not advocating
for not having timeouts or constraints on jobs, however I am
saying this is an infra issue, not a distribution or distribution
support issue.
I think this is an important point to consider and I view it as
mostly unrelated to the support claims by the distribution. Does
that make sense?

[...]
Thanks, the thread jumped straight to suggesting costly fixes
(separate images for each CentOS point release, adding an evaluation
period or acceptance testing for new point releases, et cetera)
without coming anywhere close to exploring the problem space. Is
your only concern that when your jobs started using CentOS 7.5
instead of 7.4 they took longer to run?

Yes, If they had unlimited time to run, our workflow would have everything
updated to CentOS 7.5 in the job itself and I would expect everything to
just work.

Post by Jeremy Stanley
What was the root cause? Are
you saying your jobs consume externally-produced artifacts which lag
behind CentOS package updates?

Yes, TripleO has externally produced overcloud images, and containers both
of which can be yum updated but we try to ensure they are frequently
recreated so the yum transaction is small.

Post by Jeremy Stanley
Couldn't a significant burst of new
packages cause the same symptoms even without it being tied to a
minor version increase?

Yes, certainly this could happen outside of a minor update of the baseos.

Post by Jeremy Stanley
This _doesn't_ sound to me like a problem with how we've designed
our infrastructure, unless there are additional details you're
omitting.

So the only thing out of our control is the package set on the base
nodepool image.
If that suddenly gets updated with too many packages, then we have to
scramble to ensure the images and containers are also udpated.
If there is a breaking change in the nodepool image for example [a], we
have to react to and fix that as well.

Post by Jeremy Stanley
It sounds like a problem with how the jobs are designed
and expectations around distros slowly trickling package updates
into the series without occasional larger bursts of package deltas.
I'd like to understand more about why you upgrade packages inside
your externally-produced container images at job runtime at all,
rather than relying on the package versions baked into them.

We do that to ensure the gerrit review itself and it's dependencies are
built via rpm and injected into the build.
If we did not do this the job would not be testing the change at all.
This is a result of being a package based deployment for better or worse.

Post by Jeremy Stanley
It
seems like you're arguing that the existence of lots of new package
versions which aren't already in your container images is the
problem, in which case I have trouble with the rationalization of it
being "an infra issue" insofar as it requires changes to the
services as provided by the OpenStack Infra team.
Just to be clear, we didn't "introduce a minor update into the
upstream openstack ci workflow." We continuously pull CentOS 7
packages into our package mirrors, and continuously rebuild our
centos-7 images from whatever packages the distro says are current.

Understood, which I think is fine and probably works for most projects.
An enhancement could be to stage the new images for say one week or so.
Do we need the CentOS updates immediately? Is there a possible path that
does not create a lot of work for infra, but also provides some space for
projects
to prep for the consumption of the updates?

Post by Jeremy Stanley
Our automation doesn't know that there's a difference between
packages which were part of CentOS 7.4 and 7.5 any more than it
knows that there's a difference between Ubuntu 16.04.2 and 16.04.3.
Even if we somehow managed to pause our CentOS image updates
immediately prior to 7.5, jobs would still try to upgrade those
7.4-based images to the 7.5 packages in our mirror, right?

Understood, I suspect this will become a more widespread issue as
more projects start to use containers ( not sure ). It's my understanding
that
there are some mechanisms in place to pin packages in the centos nodepool
image so
there has been some thoughts generally in the area of this issue.

TripleO may be the exception to the rule here and that is fine, I'm more
interested in exploring
the possibilities of delivering updates in a staged fashion than anything.
I don't have insight into
what the possibilities are, or if other projects have similiar issues or
requests. Perhaps the TripleO
project could share the details of our job workflow with the community and
this would make more sense.

I appreciate your time, effort and thoughts you have shared in the thread.

Post by Jeremy Stanley
--
Jeremy Stanley

[a] https://bugs.launchpad.net/tripleo/+bug/1770298

Post by Jeremy Stanley
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Clark Boylan

2018-05-14 16:08:18 UTC

Post by Wesley Hayutin

Post by Jeremy Stanley
[...]

snip

Post by Wesley Hayutin

Post by Jeremy Stanley
This _doesn't_ sound to me like a problem with how we've designed
our infrastructure, unless there are additional details you're
omitting.

So the only thing out of our control is the package set on the base
nodepool image.
If that suddenly gets updated with too many packages, then we have to
scramble to ensure the images and containers are also udpated.
If there is a breaking change in the nodepool image for example [a], we
have to react to and fix that as well.

Aren't the container images independent of the hosting platform (eg what infra hosts)? I'm not sure I understand why the host platform updating implies all the container images must also be updated.

Post by Wesley Hayutin

Post by Jeremy Stanley
It sounds like a problem with how the jobs are designed
and expectations around distros slowly trickling package updates
into the series without occasional larger bursts of package deltas.
I'd like to understand more about why you upgrade packages inside
your externally-produced container images at job runtime at all,
rather than relying on the package versions baked into them.

We do that to ensure the gerrit review itself and it's dependencies are
built via rpm and injected into the build.
If we did not do this the job would not be testing the change at all.
This is a result of being a package based deployment for better or worse.

You'd only need to do that for the change in review, not the entire system right?
snip

Post by Wesley Hayutin

Post by Jeremy Stanley
Our automation doesn't know that there's a difference between
packages which were part of CentOS 7.4 and 7.5 any more than it
knows that there's a difference between Ubuntu 16.04.2 and 16.04.3.
Even if we somehow managed to pause our CentOS image updates
immediately prior to 7.5, jobs would still try to upgrade those
7.4-based images to the 7.5 packages in our mirror, right?

Understood, I suspect this will become a more widespread issue as
more projects start to use containers ( not sure ). It's my understanding
that
there are some mechanisms in place to pin packages in the centos nodepool
image so
there has been some thoughts generally in the area of this issue.

Again, I think we need to understand why containers would make this worse not better. Seems like the big feature everyone talks about when it comes to containers is isolating packaging whether that be python packages so that nova and glance can use a different version of oslo or cohabitating software that would otherwise conflict. Why do the packages on the host platform so strongly impact your container package lists?

Post by Wesley Hayutin
TripleO may be the exception to the rule here and that is fine, I'm more
interested in exploring
the possibilities of delivering updates in a staged fashion than anything.
I don't have insight into
what the possibilities are, or if other projects have similiar issues or
requests. Perhaps the TripleO
project could share the details of our job workflow with the community and
this would make more sense.
I appreciate your time, effort and thoughts you have shared in the thread.

Post by Jeremy Stanley
--
Jeremy Stanley

[a] https://bugs.launchpad.net/tripleo/+bug/1770298

I think understanding the questions above may be the important aspect of understanding what the underlying issue is here and how we might address it.

Clark

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-***@lists.openstack.org?subject:unsubscribe
http://lists.

Wesley Hayutin

2018-05-14 17:11:14 UTC

Post by Clark Boylan

Post by Wesley Hayutin

Post by Jeremy Stanley
[...]

snip

Post by Wesley Hayutin

Post by Jeremy Stanley
This _doesn't_ sound to me like a problem with how we've designed
our infrastructure, unless there are additional details you're
omitting.

So the only thing out of our control is the package set on the base
nodepool image.
If that suddenly gets updated with too many packages, then we have to
scramble to ensure the images and containers are also udpated.
If there is a breaking change in the nodepool image for example [a], we
have to react to and fix that as well.

Aren't the container images independent of the hosting platform (eg what
infra hosts)? I'm not sure I understand why the host platform updating
implies all the container images must also be updated.

You make a fine point here, I think as with anything there are some bits
that are still being worked on. At this moment it's my understanding that
pacemaker and possibly a few others components are not 100% containerized
atm. I'm not an expert in the subject and my understanding may not be
correct. Untill you are 100% containerized there may still be some
dependencies on the base image and an impact from changes.

Post by Clark Boylan

Post by Wesley Hayutin

Post by Jeremy Stanley
It sounds like a problem with how the jobs are designed
and expectations around distros slowly trickling package updates
into the series without occasional larger bursts of package deltas.
I'd like to understand more about why you upgrade packages inside
your externally-produced container images at job runtime at all,
rather than relying on the package versions baked into them.

We do that to ensure the gerrit review itself and it's dependencies are
built via rpm and injected into the build.
If we did not do this the job would not be testing the change at all.
This is a result of being a package based deployment for better or

worse.
You'd only need to do that for the change in review, not the entire system right?

Correct there is no intention of updating the entire distribution in run
time, the intent is to have as much updated in our jobs that build the
containers and images.
Only the rpm built zuul change should be included in the update, however
some zuul changes require a CentOS base package that was not previously
installed on the container e.g. a new python dependency introduced in a
zuul change. Previously we had not enabled any CentOS repos in the
container update, but found that was not viable 100% of the time.

We have a change to further limit the scope of the update which should help
[1], especialy when facing a minor version update.

[1] https://review.openstack.org/#/c/567550/

Post by Clark Boylan
snip

Post by Wesley Hayutin

Post by Jeremy Stanley
Our automation doesn't know that there's a difference between
packages which were part of CentOS 7.4 and 7.5 any more than it
knows that there's a difference between Ubuntu 16.04.2 and 16.04.3.
Even if we somehow managed to pause our CentOS image updates
immediately prior to 7.5, jobs would still try to upgrade those
7.4-based images to the 7.5 packages in our mirror, right?

Understood, I suspect this will become a more widespread issue as
more projects start to use containers ( not sure ). It's my

understanding

Post by Wesley Hayutin
that
there are some mechanisms in place to pin packages in the centos nodepool
image so
there has been some thoughts generally in the area of this issue.

Again, I think we need to understand why containers would make this worse
not better. Seems like the big feature everyone talks about when it comes
to containers is isolating packaging whether that be python packages so
that nova and glance can use a different version of oslo or cohabitating
software that would otherwise conflict. Why do the packages on the host
platform so strongly impact your container package lists?

I'll let others comment on that, however my thought is you don't move from
A -> Z in one step and containers do not make everything easier
immediately. Like most things, it takes a little time.

Post by Clark Boylan

Post by Wesley Hayutin
TripleO may be the exception to the rule here and that is fine, I'm more
interested in exploring
the possibilities of delivering updates in a staged fashion than

anything.

Post by Wesley Hayutin
I don't have insight into
what the possibilities are, or if other projects have similiar issues or
requests. Perhaps the TripleO
project could share the details of our job workflow with the community

and

Post by Wesley Hayutin
this would make more sense.
I appreciate your time, effort and thoughts you have shared in the

thread.

Post by Wesley Hayutin

Post by Jeremy Stanley
--
Jeremy Stanley

[a] https://bugs.launchpad.net/tripleo/+bug/1770298

I think understanding the questions above may be the important aspect of
understanding what the underlying issue is here and how we might address it.
Clark

Thanks Clark, let me know if I did not get everything on your list there.
Thanks again for your time.

Post by Clark Boylan
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Clark Boylan

2018-05-14 18:05:29 UTC

snip

Post by Wesley Hayutin

Post by Wesley Hayutin

Post by Wesley Hayutin

Post by Jeremy Stanley
Our automation doesn't know that there's a difference between
packages which were part of CentOS 7.4 and 7.5 any more than it
knows that there's a difference between Ubuntu 16.04.2 and 16.04.3.
Even if we somehow managed to pause our CentOS image updates
immediately prior to 7.5, jobs would still try to upgrade those
7.4-based images to the 7.5 packages in our mirror, right?

Understood, I suspect this will become a more widespread issue as
more projects start to use containers ( not sure ). It's my

understanding

Post by Wesley Hayutin
that
there are some mechanisms in place to pin packages in the centos nodepool
image so
there has been some thoughts generally in the area of this issue.

Again, I think we need to understand why containers would make this worse
not better. Seems like the big feature everyone talks about when it comes
to containers is isolating packaging whether that be python packages so
that nova and glance can use a different version of oslo or cohabitating
software that would otherwise conflict. Why do the packages on the host
platform so strongly impact your container package lists?

I'll let others comment on that, however my thought is you don't move from
A -> Z in one step and containers do not make everything easier
immediately. Like most things, it takes a little time.

If the main issue is being caught in a transition period at the same time a minor update happens can we treat this as a temporary state? Rather than attempting to for solve this particular case happening again the future we might be better served testing that upcoming CentOS releases won't break tripleo due to changes in the packaging using the centos-release-cr repo as Tristan suggests. That should tell you if something like pacemaker were to stop working. Note this wouldn't require any infra side updates, you would just have these jobs configure the additional repo and go from there.

Then on top of that get through the transition period so that the containers isolate you from these changes in the way they should. Then when 7.6 happens you'll have hopefully identified all the broken packaging ahead of time and worked with upstream to address those problems (which should be important for a stable long term support distro) and your containers can update at whatever pace they choose?

I don't think it would be appropriate for Infra to stage centos minor versions for a couple reasons. The first is we don't support specific minor versions of CentOS/RHEL, we support the major version and if it updates and OpenStack stops working that is CI doing its job and providing that info. The other major concern is CentOS specifically says "We are trying to make sure people understand they can NOT use older minor versions and still be secure." Similarly to how we won't support Ubuntu 12.04 because it is no longer supported we shouldn't support CentOS 7.4 at this point. These are no longer secure platforms.

However, I think testing using the pre release repo as proposed above should allow you to catch issues before updates happen just as well as a staged minor version update would. The added benefit of using this process is you should know as soon as possible and not after the release has been made (helping other users of CentOS by not releasing broken packages in the first place).

Clark

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-***@lists.openstack.org?subject:unsubscribe
http://lists.openstack.o

Jeremy Stanley

2018-05-14 16:37:03 UTC

[...]

Post by Wesley Hayutin

Couldn't a significant burst of new packages cause the same
symptoms even without it being tied to a minor version increase?

Yes, certainly this could happen outside of a minor update of the baseos.

Thanks for confirming. So this is not specifically a CentOS minor
version increase issue, it's just more likely to occur at minor
version boundaries.

Post by Wesley Hayutin
So the only thing out of our control is the package set on the
base nodepool image. If that suddenly gets updated with too many
packages, then we have to scramble to ensure the images and
containers are also udpated.

It's still unclear to me why the packages on the test instance image
(i.e. the "container host") are related to the packages in the
container guest images at all. That would seem to be the whole point
of having containers?

Post by Wesley Hayutin
If there is a breaking change in the nodepool image for example
[a], we have to react to and fix that as well.

I would argue that one is a terrible workaround which happened to
show its warts. We should fix DIB's pip-and-virtualenv element
rather than continue rely on side effects of pinning RPM versions.
I've commented to that effect on https://launchpad.net/bugs/1770298
just now.

Post by Wesley Hayutin

It sounds like a problem with how the jobs are designed
and expectations around distros slowly trickling package updates
into the series without occasional larger bursts of package deltas.
I'd like to understand more about why you upgrade packages inside
your externally-produced container images at job runtime at all,
rather than relying on the package versions baked into them.

We do that to ensure the gerrit review itself and it's
dependencies are built via rpm and injected into the build. If we
did not do this the job would not be testing the change at all.
This is a result of being a package based deployment for better or worse.

[...]

Now I'll risk jumping to proposing solutions, but have you
considered building those particular packages in containers too?
That way they're built against the same package versions as will be
present in the other container images you're using rather than to
the package versions on the host, right? Seems like it would
completely sidestep the problem.

Post by Wesley Hayutin
An enhancement could be to stage the new images for say one week
or so. Do we need the CentOS updates immediately? Is there a
possible path that does not create a lot of work for infra, but
also provides some space for projects to prep for the consumption
of the updates?

[...]

Nodepool builds new images constantly, but at least daily. Part of
this is to prevent the delta of available packages/indices and other
files baked into those images from being more than a day or so stale
at any given point in time. The older the image, the more packages
(on average) jobs will need to download if they want to test with
latest package versions and the more strain it will put on our
mirrors and on our bandwidth quotas/donors' networks.

There's also a question of retention, if we're building images at
least daily but keeping them around for 7 days (storage on the
builders, tenant quotas for Glance in our providers) as well as the
explosion of additional nodes we'd need since we pre-boot nodes with
each of our images (and the idea as I understand it is that you
would want jobs to be able to select between any of them). One
option, I suppose, would be to switch to building images weekly
instead of daily, but that only solves the storage and node count
problem not the additional bandwidth and mirror load. And of course,
nodepool would need to learn to be able to boot nodes from older
versions of an image on record which is not a feature it has right
now.

Post by Wesley Hayutin
Understood, I suspect this will become a more widespread issue as
more projects start to use containers ( not sure ).

I'm still confused as to what makes this a container problem in the
general sense, rather than just a problem (leaky abstraction) with
how you've designed the job framework in which you're using them.

Post by Wesley Hayutin
It's my understanding that there are some mechanisms in place to
pin packages in the centos nodepool image so there has been some
thoughts generally in the area of this issue.

[...]

If this is a reference back to bug 1770298, as mentioned already I
think that's a mistake in diskimage-builder's stdlib which should be
corrected, not a pattern we should propagate.

--
Jeremy Stanley

Wesley Hayutin

2018-05-14 18:00:05 UTC

Post by Jeremy Stanley
[...]

Post by Wesley Hayutin

Couldn't a significant burst of new packages cause the same
symptoms even without it being tied to a minor version increase?

Yes, certainly this could happen outside of a minor update of the baseos.

Thanks for confirming. So this is not specifically a CentOS minor
version increase issue, it's just more likely to occur at minor
version boundaries.

Correct, you got it

Post by Jeremy Stanley

Post by Wesley Hayutin
So the only thing out of our control is the package set on the
base nodepool image. If that suddenly gets updated with too many
packages, then we have to scramble to ensure the images and
containers are also udpated.

It's still unclear to me why the packages on the test instance image
(i.e. the "container host") are related to the packages in the
container guest images at all. That would seem to be the whole point
of having containers?

You are right, just note some services are not 100% containerized yet.
This doesn't happen overnight it's a process and we're getting there.

Post by Jeremy Stanley

Post by Wesley Hayutin
If there is a breaking change in the nodepool image for example
[a], we have to react to and fix that as well.

I would argue that one is a terrible workaround which happened to
show its warts. We should fix DIB's pip-and-virtualenv element
rather than continue rely on side effects of pinning RPM versions.
I've commented to that effect on https://launchpad.net/bugs/1770298
just now.

k.. thanks

Post by Jeremy Stanley

Post by Wesley Hayutin

It sounds like a problem with how the jobs are designed
and expectations around distros slowly trickling package updates
into the series without occasional larger bursts of package deltas.
I'd like to understand more about why you upgrade packages inside
your externally-produced container images at job runtime at all,
rather than relying on the package versions baked into them.

We do that to ensure the gerrit review itself and it's
dependencies are built via rpm and injected into the build. If we
did not do this the job would not be testing the change at all.
This is a result of being a package based deployment for better or worse.

[...]
Now I'll risk jumping to proposing solutions, but have you
considered building those particular packages in containers too?
That way they're built against the same package versions as will be
present in the other container images you're using rather than to
the package versions on the host, right? Seems like it would
completely sidestep the problem.

So a little background. The containers and images used in TripleO are
rebuilt multiple times each day via periodic jobs, when they pass our
criteria they are pushed out and used upstream.
Each zuul change and it's dependencies can potentially impact a few or all
the containers in play. We can not rebuild all the containers due to time
constraints in each job. We have been able to mount and yum update the
containers involved with the zuul change.

Latest patch to fine tune that process is here
https://review.openstack.org/#/c/567550/

Post by Jeremy Stanley

Post by Wesley Hayutin
An enhancement could be to stage the new images for say one week
or so. Do we need the CentOS updates immediately? Is there a
possible path that does not create a lot of work for infra, but
also provides some space for projects to prep for the consumption
of the updates?

[...]
Nodepool builds new images constantly, but at least daily. Part of
this is to prevent the delta of available packages/indices and other
files baked into those images from being more than a day or so stale
at any given point in time. The older the image, the more packages
(on average) jobs will need to download if they want to test with
latest package versions and the more strain it will put on our
mirrors and on our bandwidth quotas/donors' networks.

Sure that makes perfect sense. We do the same with our containers and
images.

Post by Jeremy Stanley
There's also a question of retention, if we're building images at
least daily but keeping them around for 7 days (storage on the
builders, tenant quotas for Glance in our providers) as well as the
explosion of additional nodes we'd need since we pre-boot nodes with
each of our images (and the idea as I understand it is that you
would want jobs to be able to select between any of them). One
option, I suppose, would be to switch to building images weekly
instead of daily, but that only solves the storage and node count
problem not the additional bandwidth and mirror load. And of course,
nodepool would need to learn to be able to boot nodes from older
versions of an image on record which is not a feature it has right
now.

OK.. thanks for walking me through that. It totally makes sense to be
concerned with updating the image to save time, bandwidth etc.
It would be interesting to see if we could come up with something to protect
projects from changes to the new images and maintain images with fresh
updates.

Project non-voting check jobs on the node-pool image creation job perhaps
could be the canary in the coal mine we
are seeking. Maybe we could see if that would be something that could be
useful to both infra
and to various OpenStack projects?

Post by Jeremy Stanley

Post by Wesley Hayutin
Understood, I suspect this will become a more widespread issue as
more projects start to use containers ( not sure ).

I'm still confused as to what makes this a container problem in the
general sense, rather than just a problem (leaky abstraction) with
how you've designed the job framework in which you're using them.

Post by Wesley Hayutin
It's my understanding that there are some mechanisms in place to
pin packages in the centos nodepool image so there has been some
thoughts generally in the area of this issue.

[...]
If this is a reference back to bug 1770298, as mentioned already I
think that's a mistake in diskimage-builder's stdlib which should be
corrected, not a pattern we should propagate.

Cool, good to know and thank you!

Post by Jeremy Stanley
--
Jeremy Stanley
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Jeremy Stanley

2018-05-14 18:56:51 UTC

On 2018-05-14 12:00:05 -0600 (-0600), Wesley Hayutin wrote:
[...]

Post by Wesley Hayutin
Project non-voting check jobs on the node-pool image creation job
perhaps could be the canary in the coal mine we are seeking. Maybe
we could see if that would be something that could be useful to
both infra and to various OpenStack projects?

[...]

This presumes that Nodepool image builds are Zuul jobs, which they
aren't (at least not today). Long, long ago in a CI system not so
far away, our DevStack-specific image builds were in fact CI jobs
and for a while back then we did run DevStack's "smoke" tests as an
acceptance test before putting a new image into service. At the time
we discovered that even deploying DevStack was too complex and racy
to make for a viable acceptance test. The lesson we learned is that
most of the image regressions we were concerned with preventing
required testing complex enough to be a significant regression
magnet itself (GÃ¶del's completeness theorem at work, I expect?).

That said, the idea of turning more of Nodepool's tasks into Zuul
jobs is an interesting one worthy of lengthy discussion sometime.

--
Jeremy Stanley

Jeremy Stanley

2018-05-14 19:03:41 UTC

On 2018-05-14 18:56:51 +0000 (+0000), Jeremy Stanley wrote:
[...]

Post by Jeremy Stanley
GÃ¶del's completeness theorem at work

[...]

More accurately, GÃ¶del's first incompleteness theorem, I suppose. ;)

--
Jeremy Stanley

Tristan Cacqueray

2018-05-14 03:50:01 UTC

On May 14, 2018 2:44 am, Wesley Hayutin wrote:
[snip]

Post by Wesley Hayutin
I do think it would be helpful to say have a one week change window where
folks are given the opportunity to preflight check a new image and the
potential impact on the job workflow the updated image may have.

[snip]

How about adding a periodic job that setup centos-release-cr in a pre
task? This should highlight issues with up-coming updates:
https://wiki.centos.org/AdditionalResources/Repositories/CR

-Tristan

Wesley Hayutin

2018-05-14 21:42:20 UTC

Post by Tristan Cacqueray
[snip]

Post by Wesley Hayutin
I do think it would be helpful to say have a one week change window where
folks are given the opportunity to preflight check a new image and the
potential impact on the job workflow the updated image may have.

[snip]
How about adding a periodic job that setup centos-release-cr in a pre
https://wiki.centos.org/AdditionalResources/Repositories/CR
-Tristan

Thanks for the suggestion Tristan, going to propose using this repo at the
next TripleO mtg.

Thanks

Post by Tristan Cacqueray
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Sergii Golovatiuk

2018-05-15 07:57:57 UTC

Wesley,

For Ubuntu I suggest to enable 'proposed' repo to catch the problem
before package will be moved to 'updates'.

Post by Wesley Hayutin

Post by Tristan Cacqueray
[snip]

Post by Wesley Hayutin
I do think it would be helpful to say have a one week change window where
folks are given the opportunity to preflight check a new image and the
potential impact on the job workflow the updated image may have.

[snip]
How about adding a periodic job that setup centos-release-cr in a pre
https://wiki.centos.org/AdditionalResources/Repositories/CR
-Tristan

Thanks for the suggestion Tristan, going to propose using this repo at the
next TripleO mtg.
Thanks

Post by Tristan Cacqueray
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Best Regards,
Sergii Golovatiuk

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-***@lists.openstack.org?subject:unsubscribe
http://lists.

Jeremy Stanley

2018-05-13 12:34:03 UTC

On 2018-05-12 20:44:04 -0700 (-0700), Emilien Macchi wrote:
[...]

Why do we have this situation everytime the OS is upgraded to a major
version? Can't we test the image before actually using it? We could have
experimental jobs testing latest image and pin gate images to a specific
one?
Like we could configure infra to deploy centos 7.4 in our gate and 7.5 in
experimental, so we can take our time to fix eventual problems and make the
switch when we're ready, instead of dealing with fires (that usually come
all together).
It would be great to make a retrospective on this thing between tripleo ci
& infra folks, and see how we can improve things.

In the past we've trusted statements from Red Hat that you should be
able to upgrade to newer point releases without experiencing
backward-incompatible breakage. Right now all our related tooling is
based on the assumption we made in governance that we can just
treat, e.g., RHEL/CentOS 7 as a long-term stable release
distribution similar to an Ubuntu LTS and not have to worry about
tracking individual point releases.

If this is not actually the case any longer, we should likely
reevaluate our support claims.

--
Jeremy Stanley

17 Replies
17 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Matt Young 2018-05-13 20:09:46 UTC

Jeremy Stanley 2018-05-13 15:24:35 UTC

Wesley Hayutin 2018-05-14 02:44:25 UTC

Jeremy Stanley 2018-05-14 03:29:45 UTC

Wesley Hayutin 2018-05-14 13:07:03 UTC

Jeremy Stanley 2018-05-14 14:35:24 UTC

Wesley Hayutin 2018-05-14 15:57:17 UTC

Clark Boylan 2018-05-14 16:08:18 UTC

Wesley Hayutin 2018-05-14 17:11:14 UTC

Clark Boylan 2018-05-14 18:05:29 UTC

Jeremy Stanley 2018-05-14 16:37:03 UTC

Wesley Hayutin 2018-05-14 18:00:05 UTC

Jeremy Stanley 2018-05-14 18:56:51 UTC

Jeremy Stanley 2018-05-14 19:03:41 UTC

Tristan Cacqueray 2018-05-14 03:50:01 UTC

Wesley Hayutin 2018-05-14 21:42:20 UTC

Sergii Golovatiuk 2018-05-15 07:57:57 UTC

Jeremy Stanley 2018-05-13 12:34:03 UTC

about - legalese

Loading...