[openstack-dev] [ci][infra][tripleo] Multi-staged check pipelines for Zuul v3 proposal

Discussion:

Bogdan Dobrelya

2018-05-14 16:15:04 UTC

An update for your review please folks

Hello.
As Zuul documentation [0] explains, the names "check", "gate", and
"post" may be altered for more advanced pipelines. Is it doable to
introduce, for particular openstack projects, multiple check
stages/steps as check-1, check-2 and so on? And is it possible to make
the consequent steps reusing environments from the previous steps
finished with?
Narrowing down to tripleo CI scope, the problem I'd want we to solve
with this "virtual RFE", and using such multi-staged check pipelines,
is reducing (ideally, de-duplicating) some of the common steps for
existing CI jobs.

What you're describing sounds more like a job graph within a pipeline.
See: https://docs.openstack.org/infra/zuul/user/config.html#attr-job.dependencies
for how to configure a job to run only after another job has completed.
There is also a facility to pass data between such jobs.
... (skipped) ...
Creating a job graph to have one job use the results of the previous job
can make sense in a lot of cases. It doesn't always save *time*
however.
It's worth noting that in OpenStack's Zuul, we have made an explicit
choice not to have long-running integration jobs depend on shorter pep8
or tox jobs, and that's because we value developer time more than CPU
time. We would rather run all of the tests and return all of the
results so a developer can fix all of the errors as quickly as possible,
rather than forcing an iterative workflow where they have to fix all the
whitespace issues before the CI system will tell them which actual tests
broke.
-Jim

I proposed a few zuul dependencies [0], [1] to tripleo CI pipelines for
undercloud deployments vs upgrades testing (and some more). Given that
those undercloud jobs have not so high fail rates though, I think
Emilien is right in his comments and those would buy us nothing.

From the other side, what do you think folks of making the
tripleo-ci-centos-7-3nodes-multinode depend on
tripleo-ci-centos-7-containers-multinode [2]? The former seems quite
faily and long running, and is non-voting. It deploys (see featuresets
configs [3]*) a 3 nodes in HA fashion. And it seems almost never
passing, when the containers-multinode fails - see the CI stats page
[4]. I've found only a 2 cases there for the otherwise situation, when
containers-multinode fails, but 3nodes-multinode passes. So cutting off
those future failures via the dependency added, *would* buy us something
and allow other jobs to wait less to commence, by a reasonable price of
somewhat extended time of the main zuul pipeline. I think it makes sense
and that extended CI time will not overhead the RDO CI execution times
so much to become a problem. WDYT?

[0] https://review.openstack.org/#/c/568275/
[1] https://review.openstack.org/#/c/568278/
[2] https://review.openstack.org/#/c/568326/
[3]
https://docs.openstack.org/tripleo-quickstart/latest/feature-configuration.html
[4] http://tripleo.org/cistatus.html

* ignore the column 1, it's obsolete, all CI jobs now using configs
download AFAICT...

--
Best regards,
Bogdan Dobrelya,
Irc #bogdando

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-***@lists.openstack.org?subject:unsubscribe
http://lists.open

Sagi Shnaidman

2018-05-14 19:15:06 UTC

Permalink

Hi, Bogdan

I like the idea with undercloud job. Actually if undercloud fails, I'd stop
all other jobs, because it doens't make sense to run them. Seeing the same
failure in 10 jobs doesn't add too much. So maybe adding undercloud job as
dependency for all multinode jobs would be great idea. I think it's worth
to check also how long it will delay jobs. Will all jobs wait until
undercloud job is running? Or they will be aborted when undercloud job is
failing?

However I'm very sceptical about multinode containers and scenarios jobs,
they could fail because of very different reasons, like race conditions in
product or infra issues. Having skipping some of them will lead to more
rechecks from devs trying to discover all problems in a row, which will
delay the development process significantly.

Thanks

Post by Bogdan Dobrelya
An update for your review please folks

Hello.

As Zuul documentation [0] explains, the names "check", "gate", and
"post" may be altered for more advanced pipelines. Is it doable to
introduce, for particular openstack projects, multiple check
stages/steps as check-1, check-2 and so on? And is it possible to make
the consequent steps reusing environments from the previous steps
finished with?
Narrowing down to tripleo CI scope, the problem I'd want we to solve
with this "virtual RFE", and using such multi-staged check pipelines,
is reducing (ideally, de-duplicating) some of the common steps for
existing CI jobs.

What you're describing sounds more like a job graph within a pipeline.
See: https://docs.openstack.org/infra/zuul/user/config.html#attr-
job.dependencies
for how to configure a job to run only after another job has completed.
There is also a facility to pass data between such jobs.
... (skipped) ...
Creating a job graph to have one job use the results of the previous job
can make sense in a lot of cases. It doesn't always save *time*
however.
It's worth noting that in OpenStack's Zuul, we have made an explicit
choice not to have long-running integration jobs depend on shorter pep8
or tox jobs, and that's because we value developer time more than CPU
time. We would rather run all of the tests and return all of the
results so a developer can fix all of the errors as quickly as possible,
rather than forcing an iterative workflow where they have to fix all the
whitespace issues before the CI system will tell them which actual tests
broke.
-Jim

I proposed a few zuul dependencies [0], [1] to tripleo CI pipelines for
undercloud deployments vs upgrades testing (and some more). Given that
those undercloud jobs have not so high fail rates though, I think Emilien
is right in his comments and those would buy us nothing.
From the other side, what do you think folks of making the
tripleo-ci-centos-7-3nodes-multinode depend on
tripleo-ci-centos-7-containers-multinode [2]? The former seems quite
faily and long running, and is non-voting. It deploys (see featuresets
configs [3]*) a 3 nodes in HA fashion. And it seems almost never passing,
when the containers-multinode fails - see the CI stats page [4]. I've found
only a 2 cases there for the otherwise situation, when containers-multinode
fails, but 3nodes-multinode passes. So cutting off those future failures
via the dependency added, *would* buy us something and allow other jobs to
wait less to commence, by a reasonable price of somewhat extended time of
the main zuul pipeline. I think it makes sense and that extended CI time
will not overhead the RDO CI execution times so much to become a problem.
WDYT?
[0] https://review.openstack.org/#/c/568275/
[1] https://review.openstack.org/#/c/568278/
[2] https://review.openstack.org/#/c/568326/
[3] https://docs.openstack.org/tripleo-quickstart/latest/feature
-configuration.html
[4] http://tripleo.org/cistatus.html
* ignore the column 1, it's obsolete, all CI jobs now using configs
download AFAICT...
--
Best regards,
Bogdan Dobrelya,
Irc #bogdando
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Best regards
Sagi Shnaidman

Bogdan Dobrelya

2018-05-15 08:43:10 UTC

Permalink

I like that idea, I'll add another patch in the topic then.

Post by Sagi Shnaidman
I think it's worth to check also how long it will delay jobs. Will all
jobs wait until undercloud job is running? Or they will be aborted when
undercloud job is failing?

That is is a good question for openstack-infra folks developing zuul :)
But, we could just try it and see how it works, happily zuul v3 allows
doing that just in the scope of proposed patches! My expectation is all
jobs delayed (and I mean the main zuul pipeline execution time here) by
an average time of the undercloud deploy job of ~80 min, which hopefully
should not be a big deal given that there is a separate RDO CI pipeline
running in parallel, which normally *highly likely* extends that
extended time anyway :) And given high chances of additional 'recheck
rdo' runs we can observe these days for patches on review. I wish we
could introduce inter-pipeline dependencies (zuul CI <-> RDO CI) for
those as well...

Post by Sagi Shnaidman
However I'm very sceptical about multinode containers and scenarios
jobs, they could fail because of very different reasons, like race
conditions in product or infra issues. Having skipping some of them will
lead to more rechecks from devs trying to discover all problems in a
row, which will delay the development process significantly.

right, I roughly estimated delay for the main zuul pipeline execution
time for jobs might be a ~2.5h, which is not good. We could live with
that had it be a ~1h only, like it takes for the undercloud containers
job dependency example.

Post by Sagi Shnaidman
Thanks
An update for your review please folks
Hello.
As Zuul documentation [0] explains, the names "check", "gate", and
"post" may be altered for more advanced pipelines. Is it
doable to
introduce, for particular openstack projects, multiple check
stages/steps as check-1, check-2 and so on? And is it
possible to make
the consequent steps reusing environments from the previous steps
finished with?
Narrowing down to tripleo CI scope, the problem I'd want we
to solve
with this "virtual RFE", and using such multi-staged check
pipelines,
is reducing (ideally, de-duplicating) some of the common steps for
existing CI jobs.
What you're describing sounds more like a job graph within a pipeline.
https://docs.openstack.org/infra/zuul/user/config.html#attr-job.dependencies
<https://docs.openstack.org/infra/zuul/user/config.html#attr-job.dependencies>
for how to configure a job to run only after another job has completed.
There is also a facility to pass data between such jobs.
... (skipped) ...
Creating a job graph to have one job use the results of the previous job
can make sense in a lot of cases. It doesn't always save *time*
however.
It's worth noting that in OpenStack's Zuul, we have made an explicit
choice not to have long-running integration jobs depend on shorter pep8
or tox jobs, and that's because we value developer time more than CPU
time. We would rather run all of the tests and return all of the
results so a developer can fix all of the errors as quickly as possible,
rather than forcing an iterative workflow where they have to fix all the
whitespace issues before the CI system will tell them which actual tests
broke.
-Jim
I proposed a few zuul dependencies [0], [1] to tripleo CI pipelines
for undercloud deployments vs upgrades testing (and some more).
Given that those undercloud jobs have not so high fail rates though,
I think Emilien is right in his comments and those would buy us nothing.
From the other side, what do you think folks of making the
tripleo-ci-centos-7-3nodes-multinode depend on
tripleo-ci-centos-7-containers-multinode [2]? The former seems quite
faily and long running, and is non-voting. It deploys (see
featuresets configs [3]*) a 3 nodes in HA fashion. And it seems
almost never passing, when the containers-multinode fails - see the
CI stats page [4]. I've found only a 2 cases there for the otherwise
situation, when containers-multinode fails, but 3nodes-multinode
passes. So cutting off those future failures via the dependency
added, *would* buy us something and allow other jobs to wait less to
commence, by a reasonable price of somewhat extended time of the
main zuul pipeline. I think it makes sense and that extended CI time
will not overhead the RDO CI execution times so much to become a
problem. WDYT?
[0] https://review.openstack.org/#/c/568275/
<https://review.openstack.org/#/c/568275/>
[1] https://review.openstack.org/#/c/568278/
<https://review.openstack.org/#/c/568278/>
[2] https://review.openstack.org/#/c/568326/
<https://review.openstack.org/#/c/568326/>
[3]
https://docs.openstack.org/tripleo-quickstart/latest/feature-configuration.html
<https://docs.openstack.org/tripleo-quickstart/latest/feature-configuration.html>
[4] http://tripleo.org/cistatus.html <http://tripleo.org/cistatus.html>
* ignore the column 1, it's obsolete, all CI jobs now using configs
download AFAICT...
--
Best regards,
Bogdan Dobrelya,
Irc #bogdando
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
<http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
--
Best regards
Sagi Shnaidman
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Wesley Hayutin

2018-05-15 20:35:48 UTC

Permalink

Post by Sagi Shnaidman
Hi, Bogdan
I like the idea with undercloud job. Actually if undercloud fails, I'd
stop all other jobs, because it doens't make sense to run them. Seeing the
same failure in 10 jobs doesn't add too much. So maybe adding undercloud
job as dependency for all multinode jobs would be great idea. I think it's
worth to check also how long it will delay jobs. Will all jobs wait until
undercloud job is running? Or they will be aborted when undercloud job is
failing?
However I'm very sceptical about multinode containers and scenarios jobs,
they could fail because of very different reasons, like race conditions in
product or infra issues. Having skipping some of them will lead to more
rechecks from devs trying to discover all problems in a row, which will
delay the development process significantly.
Thanks

I agree on both counts w/ Sagi here.
Thanks Sagi

Post by Sagi Shnaidman

Post by Bogdan Dobrelya
An update for your review please folks

Hello.

What you're describing sounds more like a job graph within a pipeline.
https://docs.openstack.org/infra/zuul/user/config.html#attr-job.dependencies
for how to configure a job to run only after another job has completed.
There is also a facility to pass data between such jobs.
... (skipped) ...
Creating a job graph to have one job use the results of the previous job
can make sense in a lot of cases. It doesn't always save *time*
however.
It's worth noting that in OpenStack's Zuul, we have made an explicit
choice not to have long-running integration jobs depend on shorter pep8
or tox jobs, and that's because we value developer time more than CPU
time. We would rather run all of the tests and return all of the
results so a developer can fix all of the errors as quickly as possible,
rather than forcing an iterative workflow where they have to fix all the
whitespace issues before the CI system will tell them which actual tests
broke.
-Jim

I proposed a few zuul dependencies [0], [1] to tripleo CI pipelines for
undercloud deployments vs upgrades testing (and some more). Given that
those undercloud jobs have not so high fail rates though, I think Emilien
is right in his comments and those would buy us nothing.
From the other side, what do you think folks of making the
tripleo-ci-centos-7-3nodes-multinode depend on
tripleo-ci-centos-7-containers-multinode [2]? The former seems quite faily
and long running, and is non-voting. It deploys (see featuresets configs
[3]*) a 3 nodes in HA fashion. And it seems almost never passing, when the
containers-multinode fails - see the CI stats page [4]. I've found only a 2
cases there for the otherwise situation, when containers-multinode fails,
but 3nodes-multinode passes. So cutting off those future failures via the
dependency added, *would* buy us something and allow other jobs to wait
less to commence, by a reasonable price of somewhat extended time of the
main zuul pipeline. I think it makes sense and that extended CI time will
not overhead the RDO CI execution times so much to become a problem. WDYT?
[0] https://review.openstack.org/#/c/568275/
[1] https://review.openstack.org/#/c/568278/
[2] https://review.openstack.org/#/c/568326/
[3]
https://docs.openstack.org/tripleo-quickstart/latest/feature-configuration.html
[4] http://tripleo.org/cistatus.html
* ignore the column 1, it's obsolete, all CI jobs now using configs
download AFAICT...
--
Best regards,
Bogdan Dobrelya,
Irc #bogdando
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Best regards
Sagi Shnaidman
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Alex Schultz

2018-05-14 20:06:30 UTC

Permalink

Post by Bogdan Dobrelya
An update for your review please folks

I'm not sure it makes sense to add a dependency on other deployment
tests. It's going to add additional time to the CI run because the
upgrade won't start until well over an hour after the rest of the
jobs. The only thing I could think of where this makes more sense is
to delay the deployment tests until the pep8/unit tests pass. e.g.
let's not burn resources when the code is bad. There might be
arguments about lack of information from a deployment when developing
things but I would argue that the patch should be vetted properly
first in a local environment before taking CI resources.

Thanks,
-Alex

Post by Bogdan Dobrelya
[0] https://review.openstack.org/#/c/568275/
[1] https://review.openstack.org/#/c/568278/
[2] https://review.openstack.org/#/c/568326/
[3]
https://docs.openstack.org/tripleo-quickstart/latest/feature-configuration.html
[4] http://tripleo.org/cistatus.html
* ignore the column 1, it's obsolete, all CI jobs now using configs download
AFAICT...
--
Best regards,
Bogdan Dobrelya,
Irc #bogdando
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-***@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mail

Bogdan Dobrelya

2018-05-15 08:54:37 UTC

Permalink

Post by Alex Schultz

Post by Bogdan Dobrelya
An update for your review please folks

The things are not so simple. There is also a significant
time-to-wait-in-queue jobs start delay. And it takes probably even
longer than the time to execute jobs. And that delay is a function of
available HW resources and zuul queue length. And the proposed change
affects those parameters as well, assuming jobs with failed dependencies
won't run at all. So we could expect longer execution times compensated
with shorter wait times! I'm not sure how to estimate that tho. You
folks have all numbers and knowledge, let's use that please.

Post by Alex Schultz
jobs. The only thing I could think of where this makes more sense is
to delay the deployment tests until the pep8/unit tests pass. e.g.
let's not burn resources when the code is bad. There might be
arguments about lack of information from a deployment when developing
things but I would argue that the patch should be vetted properly
first in a local environment before taking CI resources.

I support this idea as well, though I'm sceptical about having that
blessed in the end :) I'll add a patch though.

Post by Alex Schultz
Thanks,
-Alex

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Bogdan Dobrelya

2018-05-15 09:39:35 UTC

Permalink

Added a few more patches [0], [1] by the discussion results. PTAL folks.
Wrt remaining in the topic, I'd propose to give it a try and revert it,
if it proved to be worse than better.
Thank you for feedback!

The next step could be reusing artifacts, like DLRN repos and containers
built for patches and hosted undercloud, in the consequent pipelined
jobs. But I'm not sure how to even approach that.

[0] https://review.openstack.org/#/c/568536/
[1] https://review.openstack.org/#/c/568543/

Post by Bogdan Dobrelya

On Mon, May 14, 2018 at 10:15 AM, Bogdan Dobrelya

Post by Bogdan Dobrelya
An update for your review please folks

jobs. The only thing I could think of where this makes more sense is
to delay the deployment tests until the pep8/unit tests pass. e.g.
let's not burn resources when the code is bad. There might be
arguments about lack of information from a deployment when developing
things but I would argue that the patch should be vetted properly
first in a local environment before taking CI resources.

I support this idea as well, though I'm sceptical about having that
blessed in the end :) I'll add a patch though.

Thanks,
-Alex

James E. Blair

2018-05-15 14:30:03 UTC

Permalink

Post by Bogdan Dobrelya
Added a few more patches [0], [1] by the discussion results. PTAL folks.
Wrt remaining in the topic, I'd propose to give it a try and revert
it, if it proved to be worse than better.
Thank you for feedback!
The next step could be reusing artifacts, like DLRN repos and
containers built for patches and hosted undercloud, in the consequent
pipelined jobs. But I'm not sure how to even approach that.
[0] https://review.openstack.org/#/c/568536/
[1] https://review.openstack.org/#/c/568543/

In order to use an artifact in a dependent job, you need to store it
somewhere and retrieve it.

In the parent job, I'd recommend storing the artifact on the log server
(in an "artifacts/" directory) next to the job's logs. The log server
is essentially a time-limited artifact repository keyed on the zuul
build UUID.

Pass the URL to the child job using the zuul_return Ansible module.

Have the child job fetch it from the log server using the URL it gets.

However, don't do that if the artifacts are very large -- more than a
few MB -- we'll end up running out of space quickly.

In that case, please volunteer some time to help the infra team set up a
swift container to store these artifacts. We don't need to *run*
swift -- we have clouds with swift already. We just need some help
setting up accounts, secrets, and Ansible roles to use it from Zuul.

-Jim

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-***@lists.openstack.org?subject:unsubscribe
http://lists.o

Bogdan Dobrelya

2018-05-15 15:31:07 UTC

Permalink

Post by James E. Blair

In order to use an artifact in a dependent job, you need to store it
somewhere and retrieve it.
In the parent job, I'd recommend storing the artifact on the log server
(in an "artifacts/" directory) next to the job's logs. The log server
is essentially a time-limited artifact repository keyed on the zuul
build UUID.
Pass the URL to the child job using the zuul_return Ansible module.
Have the child job fetch it from the log server using the URL it gets.
However, don't do that if the artifacts are very large -- more than a
few MB -- we'll end up running out of space quickly.
In that case, please volunteer some time to help the infra team set up a
swift container to store these artifacts. We don't need to *run*
swift -- we have clouds with swift already. We just need some help
setting up accounts, secrets, and Ansible roles to use it from Zuul.

Thank you, that's a good proposal! So when we have done that upstream
infra swift setup for tripleo, the 1st step in the job dependency graph
may be using quickstart to do something like:

* check out testing depends-on things,
* build repos and all tripleo docker images from these repos,
* upload into a swift container, with an automatic expiration set, the
de-duplicated and compressed tarball created with something like:
# docker save $(docker images -q) | gzip -1 > all.tar.xz
(I expect it will be something like a 2G file)
* something similar for DLRN repos prolly, I'm not an expert for this part.

Then those stored artifacts to be picked up by the next step in the
graph, deploying undercloud and overcloud in the single step, like:
* fetch the swift containers with repos and container images
* docker load -i all.tar.xz
* populate images into a local registry, as usual
* something similar for the repos. Includes an offline yum update (we
already have a compressed repo, right? profit!)
* deploy UC
* deploy OC, if a job wants it

And if OC deployment brought into a separate step, we do not need local
registries, just 'docker load -i all.tar.xz' issued for overcloud nodes
should replace image prep workflows and registries, AFAICT. Not sure
with the repos for that case.

I wish to assist with the upstream infra swift setup for tripleo, and
that plan, just need a blessing and more hands from tripleo CI squad ;)

Post by James E. Blair
-Jim
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Best regards,
Bogdan Dobrelya,
Irc #bogdando

Jeremy Stanley

2018-05-15 15:40:53 UTC

Permalink

On 2018-05-15 17:31:07 +0200 (+0200), Bogdan Dobrelya wrote:
[...]

Post by Bogdan Dobrelya
* upload into a swift container, with an automatic expiration set, the
# docker save $(docker images -q) | gzip -1 > all.tar.xz
(I expect it will be something like a 2G file)
* something similar for DLRN repos prolly, I'm not an expert for this part.
Then those stored artifacts to be picked up by the next step in the graph,
* fetch the swift containers with repos and container images

[...]

I do worry a little about network fragility here, as well as
extremely variable performance. Randomly-selected job nodes could be
shuffling those files halfway across the globe so either upload or
download (or both) will experience high round-trip latency as well
as potentially constrained throughput, packet loss,
disconnects/interruptions and so on... all the things we deal with
when trying to rely on the Internet, except magnified by the
quantity of data being transferred about.

Ultimately still worth trying, I think, but just keep in mind it may
introduce more issues than it solves.

--
Jeremy Stanley

Wesley Hayutin

2018-05-15 20:52:26 UTC

Permalink

Post by Bogdan Dobrelya
[...]

part.

Post by Bogdan Dobrelya
Then those stored artifacts to be picked up by the next step in the

graph,

Post by Bogdan Dobrelya
* fetch the swift containers with repos and container images

[...]
I do worry a little about network fragility here, as well as
extremely variable performance. Randomly-selected job nodes could be
shuffling those files halfway across the globe so either upload or
download (or both) will experience high round-trip latency as well
as potentially constrained throughput, packet loss,
disconnects/interruptions and so on... all the things we deal with
when trying to rely on the Internet, except magnified by the
quantity of data being transferred about.
Ultimately still worth trying, I think, but just keep in mind it may
introduce more issues than it solves.
--
Jeremy Stanley

Question... If we were to build or update the containers that need an
update and I'm assuming the overcloud images here as well as a parent job.

The content would then sync to a swift file server on a central point for
ALL the openstack providers or it would be sync'd to each cloud?

Not to throw too much cold water on the idea, but...
I wonder if the time to upload and download the containers and images would
significantly reduce any advantage this process has.

Although centralizing the container updates and images on a per check job
basis sounds attractive, I get the sense we need to be very careful and
fully vett the idea. At the moment it's also an optimization ( maybe ) so
I don't see this as a very high priority atm.

Let's bring the discussion the tripleo meeting next week. Thanks all!

Post by Bogdan Dobrelya
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Jeremy Stanley

2018-05-15 21:01:21 UTC

Permalink

On 2018-05-15 14:52:26 -0600 (-0600), Wesley Hayutin wrote:
[...]

Post by Wesley Hayutin
The content would then sync to a swift file server on a central
point for ALL the openstack providers or it would be sync'd to
each cloud?

[...]

We haven't previously requested that all the Infra provider donors
support Swift, and even for the ones who do I don't think we can
count on it being available in every region where we run jobs. I
assumed that implementation would be a single (central) Swift tenant
provided by one of our donors who has it, thus the reason for my
performance concerns at "large" artifact sizes.

--
Jeremy Stanley

James E. Blair

2018-05-15 16:40:28 UTC

Permalink

Post by Bogdan Dobrelya
* check out testing depends-on things,

(Zuul should have done this for you, but yes.)

Post by Bogdan Dobrelya
* build repos and all tripleo docker images from these repos,
* upload into a swift container, with an automatic expiration set, the
# docker save $(docker images -q) | gzip -1 > all.tar.xz
(I expect it will be something like a 2G file)
* something similar for DLRN repos prolly, I'm not an expert for this part.
Then those stored artifacts to be picked up by the next step in the
* fetch the swift containers with repos and container images
* docker load -i all.tar.xz
* populate images into a local registry, as usual
* something similar for the repos. Includes an offline yum update (we
already have a compressed repo, right? profit!)
* deploy UC
* deploy OC, if a job wants it
And if OC deployment brought into a separate step, we do not need
local registries, just 'docker load -i all.tar.xz' issued for
overcloud nodes should replace image prep workflows and registries,
AFAICT. Not sure with the repos for that case.
I wish to assist with the upstream infra swift setup for tripleo, and
that plan, just need a blessing and more hands from tripleo CI squad ;)

That sounds about right (at least the Zuul parts :).

We're also talking about making a new kind of job which can continue to
run after it's "finished" so that you could use it to do something like
host a container registry that's used by other jobs running on the
change. We don't have that feature yet, but if we did, would you prefer
to use that instead of the intermediate swift storage?

-Jim

Jeremy Stanley

2018-05-15 16:56:20 UTC

Permalink

On 2018-05-15 09:40:28 -0700 (-0700), James E. Blair wrote:
[...]

Post by James E. Blair
We're also talking about making a new kind of job which can continue to
run after it's "finished" so that you could use it to do something like
host a container registry that's used by other jobs running on the
change. We don't have that feature yet, but if we did, would you prefer
to use that instead of the intermediate swift storage?

If the subsequent jobs depending on that one get nodes allocated
from the same provider, that could solve a lot of the potential
network performance risks as well.

--
Jeremy Stanley

James E. Blair

2018-05-15 17:28:14 UTC

Permalink

Post by Jeremy Stanley
[...]

If the subsequent jobs depending on that one get nodes allocated
from the same provider, that could solve a lot of the potential
network performance risks as well.

Wesley Hayutin

2018-05-15 20:31:16 UTC

Permalink

Post by James E. Blair

Post by Jeremy Stanley
[...]

If the subsequent jobs depending on that one get nodes allocated
from the same provider, that could solve a lot of the potential
network performance risks as well.

There is a lot here to unpack and discuss, but I really like the ideas I'm
seeing.
Nice work Bogdan! I've added it the tripleo meeting agenda for next week
so we can continue socializing the idea and get feedback.

Thanks!

https://etherpad.openstack.org/p/tripleo-meeting-items

Bogdan Dobrelya

2018-05-16 09:31:30 UTC

Permalink

Post by James E. Blair

Post by Jeremy Stanley
[...]

Post by James E. Blair
We're also talking about making a new kind of job which can

continue to

Post by Jeremy Stanley

Post by James E. Blair
run after it's "finished" so that you could use it to do

something like

Post by Jeremy Stanley

Post by James E. Blair
host a container registry that's used by other jobs running on the
change. We don't have that feature yet, but if we did, would

you prefer

Post by Jeremy Stanley

Post by James E. Blair
to use that instead of the intermediate swift storage?

If the subsequent jobs depending on that one get nodes allocated
from the same provider, that could solve a lot of the potential
network performance risks as well.

That's... tricky. We're *also* looking at affinity for buildsets, and
I'm optimistic we'll end up with something there eventually, but that's
likely to be a more substantive change and probably won't happen as
soon. I do agree it will be nice, especially for use cases like this.
-Jim
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
There is a lot here to unpack and discuss, but I really like the ideas
I'm seeing.
Nice work Bogdan! I've added it the tripleo meeting agenda for next
week so we can continue socializing the idea and get feedback.
Thanks!
https://etherpad.openstack.org/p/tripleo-meeting-items

Thank you for feedback, folks. There is a lot of technical caveats,
right. I'm pretty sure though with broader containers adoption,
openstack infra will catch up eventually, so we all could benefit our
upstream CI jobs with affinity based and co-located data available
around for consequent build steps.

Post by James E. Blair
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Jeremy Stanley

2018-05-16 12:17:03 UTC

Permalink

On 2018-05-16 11:31:30 +0200 (+0200), Bogdan Dobrelya wrote:
[...]

I'm pretty sure though with broader containers adoption, openstack
infra will catch up eventually, so we all could benefit our
upstream CI jobs with affinity based and co-located data available
around for consequent build steps.

I still don't see what it has to do with containers. We've known
these were potentially useful features long before
container-oriented projects came into the picture. We simply focused
on implementing other, even more generally-applicable features
first.

--
Jeremy Stanley

Bogdan Dobrelya

2018-05-16 13:17:52 UTC

Permalink

Post by Jeremy Stanley
[...]

I still don't see what it has to do with containers. We've known

My understanding, I may be totally wrong, is that unlike to packages and
repos (do not count OSTree [0]), containers use layers and can be
exported into tarballs with built-in de-duplication. This makes idea of
tossing those tarballs around much more attractive, than doing something
similar to repositories with packages. Of course container images can be
pre-built into nodepool images, just like packages, so CI users can
rebuild on top with less changes brought into new layers, which is
another nice to have option by the way.

[0] https://rpm-ostree.readthedocs.io/en/latest/

Post by Jeremy Stanley
these were potentially useful features long before
container-oriented projects came into the picture. We simply focused
on implementing other, even more generally-applicable features
first.

Right, I think this only confirms that it *does* have something to
containers, and priorities for containerized use cases will follow
containers adoption trends. If everyone one day suddenly ask for
nodepool images containing latest kolla containers injected, for example.

Post by Jeremy Stanley
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Best regards,
Bogdan Dobrelya,
Irc #bogdando

Jeremy Stanley

2018-05-16 15:25:17 UTC

Permalink

On 2018-05-16 15:17:52 +0200 (+0200), Bogdan Dobrelya wrote:
[...]

Post by Bogdan Dobrelya
My understanding, I may be totally wrong, is that unlike to
packages and repos (do not count OSTree [0]), containers use
layers and can be exported into tarballs with built-in
de-duplication. This makes idea of tossing those tarballs around
much more attractive, than doing something similar to repositories
with packages.

[...]

Projects which utilize service VMs (e.g. Trove) were asking to do
precisely the same things and had nothing to do with containers. The
idea that you might build a VM image up from proposed source in one
job and then fire several other jobs which used that proposed image
well-predates similar requests from container-oriented projects.

--
Jeremy Stanley

Bogdan Dobrelya

2018-05-15 12:07:56 UTC

Permalink

Let me clarify the problem I want to solve with pipelines.

It is getting *hard* to develop things and move patches to the Happy End
(merged):
- Patches wait too long for CI jobs to start. It should be minutes and
not hours of waiting.
- If a patch fails a job w/o a good reason, the consequent recheck
operation repeat waiting all over again.

How pipelines may help solve it?
Pipelines only alleviate, not solve the problem of waiting. We only want
to build pipelines for the main zuul check process, omitting gating and
RDO CI (for now).

Where are two cases to consider:
- A patch succeeds all checks
- A patch fails a check with dependencies

The latter cases benefit us the most, when pipelines are designed like
it is proposed here. So that any jobs expected to fail, when a
dependency fails, will be omitted from execution. This saves HW
resources and zuul queue places a lot, making it available for other
patches and allowing those to have CI jobs started faster (less
waiting!). When we have "recheck storms", like because of some known
intermittent side issue, that outcome is multiplied by the recheck storm
um... level, and delivers even better and absolutely amazing results :)
Zuul queue will not be growing insanely getting overwhelmed by multiple
clones of the rechecked jobs highly likely deemed to fail, and blocking
other patches what might have chances to pass checks as non-affected by
that intermittent issue.

And for the first case, when a patch succeeds, it takes some extended
time, and that is the price to pay. How much time it takes to finish in
a pipeline fully depends on implementation.

The effectiveness could only be measured with numbers extracted from
elastic search data, like average time to wait for a job to start,
success vs fail execution time percentiles for a job, average amount of
rechecks, recheck storms history et al. I don't have that data and don't
know how to get it. Any help with that is very appreciated and could
really help to move the proposed patches forward or decline it. And we
could then compare "before" and "after" as well.

I hope that explains the problem scope and the methodology to address that.

Post by Bogdan Dobrelya
An update for your review please folks

I proposed a few zuul dependencies [0], [1] to tripleo CI pipelines for
undercloud deployments vs upgrades testing (and some more). Given that
those undercloud jobs have not so high fail rates though, I think
Emilien is right in his comments and those would buy us nothing.
From the other side, what do you think folks of making the
tripleo-ci-centos-7-3nodes-multinode depend on
tripleo-ci-centos-7-containers-multinode [2]? The former seems quite
faily and long running, and is non-voting. It deploys (see featuresets
configs [3]*) a 3 nodes in HA fashion. And it seems almost never
passing, when the containers-multinode fails - see the CI stats page
[4]. I've found only a 2 cases there for the otherwise situation, when
containers-multinode fails, but 3nodes-multinode passes. So cutting off
those future failures via the dependency added, *would* buy us something
and allow other jobs to wait less to commence, by a reasonable price of
somewhat extended time of the main zuul pipeline. I think it makes sense
and that extended CI time will not overhead the RDO CI execution times
so much to become a problem. WDYT?
[0] https://review.openstack.org/#/c/568275/
[1] https://review.openstack.org/#/c/568278/
[2] https://review.openstack.org/#/c/568326/
[3]
https://docs.openstack.org/tripleo-quickstart/latest/feature-configuration.html
[4] http://tripleo.org/cistatus.html
* ignore the column 1, it's obsolete, all CI jobs now using configs
download AFAICT...

--
Best regards,
Bogdan Dobrelya,
Irc #bogdando

Jeremy Stanley

2018-05-15 12:30:44 UTC

Permalink

On 2018-05-15 14:07:56 +0200 (+0200), Bogdan Dobrelya wrote:
[...]

Post by Bogdan Dobrelya
How pipelines may help solve it?
Pipelines only alleviate, not solve the problem of waiting. We only want to
build pipelines for the main zuul check process, omitting gating and RDO CI
(for now).
- A patch succeeds all checks
- A patch fails a check with dependencies
The latter cases benefit us the most, when pipelines are designed like it is
proposed here. So that any jobs expected to fail, when a dependency fails,
will be omitted from execution.

[...]

Your choice of terminology is making it hard to follow this
proposal. You seem to mean something other than
https://zuul-ci.org/docs/zuul/user/config.html#pipeline when you use
the term "pipeline" (which gets confusing very quickly for anyone
familiar with Zuul configuration concepts).

--
Jeremy Stanley

Bogdan Dobrelya

2018-05-15 13:22:14 UTC

Permalink

Post by Jeremy Stanley
[...]

[...]
Your choice of terminology is making it hard to follow this
proposal. You seem to mean something other than
https://zuul-ci.org/docs/zuul/user/config.html#pipeline when you use
the term "pipeline" (which gets confusing very quickly for anyone
familiar with Zuul configuration concepts).

Indeed, sorry for that confusion. I mean pipelines as jobs executed in
batches, ordered via defined dependencies, like gitlab pipelines [0].
And those batches can also be thought of steps, or whatever we call that.

[0] https://docs.gitlab.com/ee/ci/pipelines.html

--
Best regards,
Bogdan Dobrelya,
Irc #bogdando

Jeremy Stanley

2018-05-15 14:07:57 UTC

Permalink

On 2018-05-15 15:22:14 +0200 (+0200), Bogdan Dobrelya wrote:
[...]

I mean pipelines as jobs executed in batches, ordered via defined
dependencies, like gitlab pipelines [0]. And those batches can
also be thought of steps, or whatever we call that.

[...]

Got it. So Zuul refers to that relationship as a job dependency:

https://zuul-ci.org/docs/zuul/user/config.html#attr-job.dependencies

To be clearer, you might refer to this as dependent job ordering or
a job dependency graph.

--
Jeremy Stanley

Sagi Shnaidman

2018-05-15 15:08:01 UTC

Permalink

Bogdan,

I think before final decisions we need to know exactly - what a price we
need to pay? Without exact numbers it will be difficult to discuss about.
I we need to wait 80 mins of undercloud-containers job to finish for
starting all other jobs, it will be about 4.5 hours to wait for result (+
4.5 hours in gate) which is too big price imho and doesn't worth an effort.

What are exact numbers we are talking about?

Thanks

Post by Bogdan Dobrelya
Let me clarify the problem I want to solve with pipelines.
It is getting *hard* to develop things and move patches to the Happy End
- Patches wait too long for CI jobs to start. It should be minutes and not
hours of waiting.
- If a patch fails a job w/o a good reason, the consequent recheck
operation repeat waiting all over again.
How pipelines may help solve it?
Pipelines only alleviate, not solve the problem of waiting. We only want
to build pipelines for the main zuul check process, omitting gating and RDO
CI (for now).
- A patch succeeds all checks
- A patch fails a check with dependencies
The latter cases benefit us the most, when pipelines are designed like it
is proposed here. So that any jobs expected to fail, when a dependency
fails, will be omitted from execution. This saves HW resources and zuul
queue places a lot, making it available for other patches and allowing
those to have CI jobs started faster (less waiting!). When we have "recheck
storms", like because of some known intermittent side issue, that outcome
is multiplied by the recheck storm um... level, and delivers even better
and absolutely amazing results :) Zuul queue will not be growing insanely
getting overwhelmed by multiple clones of the rechecked jobs highly likely
deemed to fail, and blocking other patches what might have chances to pass
checks as non-affected by that intermittent issue.
And for the first case, when a patch succeeds, it takes some extended
time, and that is the price to pay. How much time it takes to finish in a
pipeline fully depends on implementation.
The effectiveness could only be measured with numbers extracted from
elastic search data, like average time to wait for a job to start, success
vs fail execution time percentiles for a job, average amount of rechecks,
recheck storms history et al. I don't have that data and don't know how to
get it. Any help with that is very appreciated and could really help to
move the proposed patches forward or decline it. And we could then compare
"before" and "after" as well.
I hope that explains the problem scope and the methodology to address that.

Post by Bogdan Dobrelya
An update for your review please folks

Hello.

What you're describing sounds more like a job graph within a pipeline.
See: https://docs.openstack.org/infra/zuul/user/config.html#attr-
job.dependencies
for how to configure a job to run only after another job has completed.
There is also a facility to pass data between such jobs.
... (skipped) ...
Creating a job graph to have one job use the results of the previous job
can make sense in a lot of cases. It doesn't always save *time*
however.
It's worth noting that in OpenStack's Zuul, we have made an explicit
choice not to have long-running integration jobs depend on shorter pep8
or tox jobs, and that's because we value developer time more than CPU
time. We would rather run all of the tests and return all of the
results so a developer can fix all of the errors as quickly as possible,
rather than forcing an iterative workflow where they have to fix all the
whitespace issues before the CI system will tell them which actual tests
broke.
-Jim

I proposed a few zuul dependencies [0], [1] to tripleo CI pipelines for
undercloud deployments vs upgrades testing (and some more). Given that
those undercloud jobs have not so high fail rates though, I think Emilien
is right in his comments and those would buy us nothing.
From the other side, what do you think folks of making the
tripleo-ci-centos-7-3nodes-multinode depend on
tripleo-ci-centos-7-containers-multinode [2]? The former seems quite
faily and long running, and is non-voting. It deploys (see featuresets
configs [3]*) a 3 nodes in HA fashion. And it seems almost never passing,
when the containers-multinode fails - see the CI stats page [4]. I've found
only a 2 cases there for the otherwise situation, when containers-multinode
fails, but 3nodes-multinode passes. So cutting off those future failures
via the dependency added, *would* buy us something and allow other jobs to
wait less to commence, by a reasonable price of somewhat extended time of
the main zuul pipeline. I think it makes sense and that extended CI time
will not overhead the RDO CI execution times so much to become a problem.
WDYT?
[0] https://review.openstack.org/#/c/568275/
[1] https://review.openstack.org/#/c/568278/
[2] https://review.openstack.org/#/c/568326/
[3] https://docs.openstack.org/tripleo-quickstart/latest/feature
-configuration.html
[4] http://tripleo.org/cistatus.html
* ignore the column 1, it's obsolete, all CI jobs now using configs
download AFAICT...

--
Best regards,
Bogdan Dobrelya,
Irc #bogdando
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Best regards
Sagi Shnaidman

Bogdan Dobrelya

2018-05-15 15:54:42 UTC

Permalink

Post by Sagi Shnaidman
Bogdan,
I think before final decisions we need to know exactly - what a price we
need to pay? Without exact numbers it will be difficult to discuss about.
I we need to wait 80 mins of undercloud-containers job to finish for
starting all other jobs, it will be about 4.5 hours to wait for result
(+ 4.5 hours in gate) which is too big price imho and doesn't worth an
effort.
What are exact numbers we are talking about?

I fully agree but can't have those numbers, sorry! As I noted above,
those are definitely sitting in openstack-infra's elastic search DB,
just needed to get extracted with some assistance of folks who know more
on that!

Post by Sagi Shnaidman
Thanks
Let me clarify the problem I want to solve with pipelines.
It is getting *hard* to develop things and move patches to the Happy
- Patches wait too long for CI jobs to start. It should be minutes
and not hours of waiting.
- If a patch fails a job w/o a good reason, the consequent recheck
operation repeat waiting all over again.
How pipelines may help solve it?
Pipelines only alleviate, not solve the problem of waiting. We only
want to build pipelines for the main zuul check process, omitting
gating and RDO CI (for now).
- A patch succeeds all checks
- A patch fails a check with dependencies
The latter cases benefit us the most, when pipelines are designed
like it is proposed here. So that any jobs expected to fail, when a
dependency fails, will be omitted from execution. This saves HW
resources and zuul queue places a lot, making it available for other
patches and allowing those to have CI jobs started faster (less
waiting!). When we have "recheck storms", like because of some known
intermittent side issue, that outcome is multiplied by the recheck
storm um... level, and delivers even better and absolutely amazing
results :) Zuul queue will not be growing insanely getting
overwhelmed by multiple clones of the rechecked jobs highly likely
deemed to fail, and blocking other patches what might have chances
to pass checks as non-affected by that intermittent issue.
And for the first case, when a patch succeeds, it takes some
extended time, and that is the price to pay. How much time it takes
to finish in a pipeline fully depends on implementation.
The effectiveness could only be measured with numbers extracted from
elastic search data, like average time to wait for a job to start,
success vs fail execution time percentiles for a job, average amount
of rechecks, recheck storms history et al. I don't have that data
and don't know how to get it. Any help with that is very appreciated
and could really help to move the proposed patches forward or
decline it. And we could then compare "before" and "after" as well.
I hope that explains the problem scope and the methodology to address that.
An update for your review please folks
Bogdan Dobrelya <bdobreli at redhat.com <http://redhat.com>>
Hello.
As Zuul documentation [0] explains, the names "check",
"gate", and
"post" may be altered for more advanced pipelines. Is
it doable to
introduce, for particular openstack projects, multiple check
stages/steps as check-1, check-2 and so on? And is it
possible to make
the consequent steps reusing environments from the
previous steps
finished with?
Narrowing down to tripleo CI scope, the problem I'd want
we to solve
with this "virtual RFE", and using such multi-staged
check pipelines,
is reducing (ideally, de-duplicating) some of the common
steps for
existing CI jobs.
What you're describing sounds more like a job graph within a
pipeline.
https://docs.openstack.org/infra/zuul/user/config.html#attr-job.dependencies
<https://docs.openstack.org/infra/zuul/user/config.html#attr-job.dependencies>
for how to configure a job to run only after another job has
completed.
There is also a facility to pass data between such jobs.
... (skipped) ...
Creating a job graph to have one job use the results of the
previous job
can make sense in a lot of cases. It doesn't always save *time*
however.
It's worth noting that in OpenStack's Zuul, we have made an
explicit
choice not to have long-running integration jobs depend on
shorter pep8
or tox jobs, and that's because we value developer time more
than CPU
time. We would rather run all of the tests and return all of the
results so a developer can fix all of the errors as quickly
as possible,
rather than forcing an iterative workflow where they have to
fix all the
whitespace issues before the CI system will tell them which
actual tests
broke.
-Jim
I proposed a few zuul dependencies [0], [1] to tripleo CI
pipelines for undercloud deployments vs upgrades testing (and
some more). Given that those undercloud jobs have not so high
fail rates though, I think Emilien is right in his comments and
those would buy us nothing.
From the other side, what do you think folks of making the
tripleo-ci-centos-7-3nodes-multinode depend on
tripleo-ci-centos-7-containers-multinode [2]? The former seems
quite faily and long running, and is non-voting. It deploys (see
featuresets configs [3]*) a 3 nodes in HA fashion. And it seems
almost never passing, when the containers-multinode fails - see
the CI stats page [4]. I've found only a 2 cases there for the
otherwise situation, when containers-multinode fails, but
3nodes-multinode passes. So cutting off those future failures
via the dependency added, *would* buy us something and allow
other jobs to wait less to commence, by a reasonable price of
somewhat extended time of the main zuul pipeline. I think it
makes sense and that extended CI time will not overhead the RDO
CI execution times so much to become a problem. WDYT?
[0] https://review.openstack.org/#/c/568275/
<https://review.openstack.org/#/c/568275/>
[1] https://review.openstack.org/#/c/568278/
<https://review.openstack.org/#/c/568278/>
[2] https://review.openstack.org/#/c/568326/
<https://review.openstack.org/#/c/568326/>
[3]
https://docs.openstack.org/tripleo-quickstart/latest/feature-configuration.html
<https://docs.openstack.org/tripleo-quickstart/latest/feature-configuration.html>
[4] http://tripleo.org/cistatus.html
<http://tripleo.org/cistatus.html>
* ignore the column 1, it's obsolete, all CI jobs now using
configs download AFAICT...
--
Best regards,
Bogdan Dobrelya,
Irc #bogdando
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
<http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
--
Best regards
Sagi Shnaidman
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev