[openstack-dev] [Keystone][Oslo] Caching tokens in auth token middleware

Post by Vishvananda Ishaya
Hi Everyone,
So I've been doing some profiling of api calls against devstack and I've discovered that a significant portion of time spent is in the auth_token middleware validating the PKI token. There is code to turn on caching of the token if memcache is enabled, but this seems like overkill in most cases. We should be caching the token in memory by default. Fortunately, nova has some nifty code that will use an in-memory cache if memcached isn't available.

We gave up on PKI in Folsom after weeks of trouble with it:

* Unstable -- Endpoints would stay up >24 hours but after around 24
hours (sometimes sooner), the endpoint would stop working properly with
the server user suddenly returned a 401 when trying to authenticate a
token. Restarting the endpoint with a service nova-api restart gets rid
of the 401 Unauthorized for a few hours and then it happens again.

* Unable to use memcache with PKI. The token was longer than the maximum
memcache key and resulted in errors on every request. The solution for
this was to hash the CMS token and use hash as a key in memcache, but
unfortunately this solution wasn't backported to Folsom Keystone --
partly I think because the auth_token middleware was split out into the
keystoneclient during Grizzly.

In any case, the above two things make PKI unusable in Folsom.

We fell back on UUID tokens -- the default in Folsom. Unfortunately,
there are serious performance issues with this approach as well. Every
single request to an endpoint results in multiple requests to Keystone,
which bogs down the system.

In addition to the obvious roundtrip issues, with just 26 users in a
test cloud, in 3 weeks there are over 300K records in the tokens table
on a VERY lightly used cloud. Not good. Luckily, we use multi-master
MySQL replication (Galera) with excellent write rates spread across four
cluster nodes, but this scale of writes for such a small test cluster is
worrying to say the least.

Although not related to PKI, I've also noticed that due to the decision
to use a denormalized schema in the users table with the "extra" column
storing a JSON-encoded blob of data including the user's default tenant
and enabled flag is a horrible performance problem. Hope that v3
Keystone has corrected these issues in the SQL driver.

Post by Vishvananda Ishaya
https://review.openstack.org/23236
This is my least favorite option since changing paste config is a pain for deployers and it doesn't help any of the other projects.

Meh, whether you add options to a config file or a paste INI file it's
the same pain for deployers :) But generally agree with you.

Post by Vishvananda Ishaya
https://review.openstack.org/23307
https://review.openstack.org/23306
https://review.openstack.org/23308
https://review.openstack.org/23309
I think 3) is the right long term move, but I'm not sure if this appropriate considering how close we are to the grizzly release, so if we want to do 2) immediately and postpone 3) until H, that is fine with me.

Well, I think 3) is the right thing to do in any case, and can be done
in oslo regardless of Nova's RC status.

Not sure that 2) is really all that useful. If you are in any serious
production environment, you're going to be using memcached anyway.

Best,
-jay

Post by Vishvananda Ishaya
Thoughts?
Vish
_______________________________________________
OpenStack-dev mailing list
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Russell Bryant

2013-03-01 23:32:27 UTC

* Unstable -- Endpoints would stay up >24 hours but after around 24
hours (sometimes sooner), the endpoint would stop working properly with
the server user suddenly returned a 401 when trying to authenticate a
token. Restarting the endpoint with a service nova-api restart gets rid
of the 401 Unauthorized for a few hours and then it happens again.
* Unable to use memcache with PKI. The token was longer than the maximum
memcache key and resulted in errors on every request. The solution for
this was to hash the CMS token and use hash as a key in memcache, but
unfortunately this solution wasn't backported to Folsom Keystone --
partly I think because the auth_token middleware was split out into the
keystoneclient during Grizzly.
In any case, the above two things make PKI unusable in Folsom.
We fell back on UUID tokens -- the default in Folsom. Unfortunately,
there are serious performance issues with this approach as well. Every
single request to an endpoint results in multiple requests to Keystone,
which bogs down the system.
In addition to the obvious roundtrip issues, with just 26 users in a
test cloud, in 3 weeks there are over 300K records in the tokens table
on a VERY lightly used cloud. Not good. Luckily, we use multi-master
MySQL replication (Galera) with excellent write rates spread across four
cluster nodes, but this scale of writes for such a small test cluster is
worrying to say the least.
Although not related to PKI, I've also noticed that due to the decision
to use a denormalized schema in the users table with the "extra" column
storing a JSON-encoded blob of data including the user's default tenant
and enabled flag is a horrible performance problem. Hope that v3
Keystone has corrected these issues in the SQL driver.

This is really interesting feedback. Thanks for writing it up.

Post by Vishvananda Ishaya
https://review.openstack.org/23236
This is my least favorite option since changing paste config is a pain for deployers and it doesn't help any of the other projects.

Meh, whether you add options to a config file or a paste INI file it's
the same pain for deployers :) But generally agree with you.

Well, I think 3) is the right thing to do in any case, and can be done
in oslo regardless of Nova's RC status.
Not sure that 2) is really all that useful. If you are in any serious
production environment, you're going to be using memcached anyway.

+1 that 3 is ideal. I think this should have been done with a FFE for
Oslo. It got merged in Oslo already anyway, though ...

--
Russell Bryant

Mark McLoughlin

2013-03-02 13:32:33 UTC

Post by Russell Bryant

Post by Vishvananda Ishaya
https://review.openstack.org/23306
https://review.openstack.org/23308
https://review.openstack.org/23309
I think 3) is the right long term move, but I'm not sure if this appropriate considering how close we are to the grizzly release, so if we want to do 2) immediately and postpone 3) until H, that is fine with me.

Well, I think 3) is the right thing to do in any case, and can be done
in oslo regardless of Nova's RC status.
Not sure that 2) is really all that useful. If you are in any serious
production environment, you're going to be using memcached anyway.

+1 that 3 is ideal. I think this should have been done with a FFE for
Oslo. It got merged in Oslo already anyway, though ...

The Oslo commit merged quickly while I wasn't looking, but I when I saw
it post-merge I figured it's fine without an FFE.

It's just moving code from nova.common to nova.openstack.common
essentially. No major regression risk, no impact on users, no massive
distraction of reviewers.

Adding it to keystoneclient also doesn't need an FFE since we don't do
feature freezes for clients.

Cheers,
Mark.

Dolph Mathews

2013-03-02 00:40:41 UTC

Post by Vishvananda Ishaya

Post by Vishvananda Ishaya
Hi Everyone,
So I've been doing some profiling of api calls against devstack and I've

discovered that a significant portion of time spent is in the auth_token
middleware validating the PKI token. There is code to turn on caching of
the token if memcache is enabled, but this seems like overkill in most
cases. We should be caching the token in memory by default. Fortunately,
nova has some nifty code that will use an in-memory cache if memcached
isn't available.
* Unstable -- Endpoints would stay up >24 hours but after around 24
hours (sometimes sooner), the endpoint would stop working properly with
the server user suddenly returned a 401 when trying to authenticate a
token. Restarting the endpoint with a service nova-api restart gets rid
of the 401 Unauthorized for a few hours and then it happens again.

Obviously that's not acceptable behavior; is there a bug tracking this
issue? I poked around but didn't see anything related to unexpected 401's.

Post by Vishvananda Ishaya
* Unable to use memcache with PKI. The token was longer than the maximum
memcache key and resulted in errors on every request. The solution for
this was to hash the CMS token and use hash as a key in memcache, but
unfortunately this solution wasn't backported to Folsom Keystone --
partly I think because the auth_token middleware was split out into the
keystoneclient during Grizzly.

I believe keystoneclient.middleware.auth_token supports this and is
backwards compatible with essex and folsom-- any reason why you couldn't
utilize the latest client / middleware?

Post by Vishvananda Ishaya
In any case, the above two things make PKI unusable in Folsom.
We fell back on UUID tokens -- the default in Folsom. Unfortunately,
there are serious performance issues with this approach as well. Every
single request to an endpoint results in multiple requests to Keystone,
which bogs down the system.

+1; the best solution to this is to have clients cache tokens until they're
expired. keystoneclient does this if keyring support is enabled.

Post by Vishvananda Ishaya
In addition to the obvious roundtrip issues, with just 26 users in a
test cloud, in 3 weeks there are over 300K records in the tokens table
on a VERY lightly used cloud. Not good. Luckily, we use multi-master
MySQL replication (Galera) with excellent write rates spread across four
cluster nodes, but this scale of writes for such a small test cluster is
worrying to say the least.

At the moment, you're free to do whatever you want with expired tokens and
you won't otherwise impact the deployment. Delete them, archive them,
whatever. Joe Breu suggested adding a keystone-manage command to copy
expired tokens into a tokens_archive table, that could be
called periodically to clean up... I'm certainly in favor of that. Perhaps
something like:

$ keystone-manage flush-expired-tokens [--strategy=archive|delete]

Post by Vishvananda Ishaya
Although not related to PKI, I've also noticed that due to the decision
to use a denormalized schema in the users table with the "extra" column
storing a JSON-encoded blob of data including the user's default tenant
and enabled flag is a horrible performance problem. Hope that v3
Keystone has corrected these issues in the SQL driver.

We've made several such improvements. We still use the 'extra' columns, but
hopefully not for anything that should be indexed.

Post by Vishvananda Ishaya

Post by Vishvananda Ishaya
1) Shim the code into the wsgi stack using the configuration options
https://review.openstack.org/23236
This is my least favorite option since changing paste config is a pain

for deployers and it doesn't help any of the other projects.
Meh, whether you add options to a config file or a paste INI file it's
the same pain for deployers :) But generally agree with you.

appropriate considering how close we are to the grizzly release, so if we
want to do 2) immediately and postpone 3) until H, that is fine with me.
Well, I think 3) is the right thing to do in any case, and can be done
in oslo regardless of Nova's RC status.

Since the feature has merged to oslo, I take it that oslo isn't abiding by
feature freeze? I'm happy to see it utilized in keystoneclient.

Post by Vishvananda Ishaya
Not sure that 2) is really all that useful. If you are in any serious
production environment, you're going to be using memcached anyway.
Best,
-jay

Post by Vishvananda Ishaya
Thoughts?
Vish
_______________________________________________
OpenStack-dev mailing list
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-dev mailing list
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Vishvananda Ishaya

2013-03-02 01:15:27 UTC

This bug was fixed quite a while ago:

https://bugs.launchpad.net/keystone/+bug/1074172

https://review.openstack.org/#/c/15242/

But it looks like it was never backported to stable/folsom. I've proposed it here:

https://review.openstack.org/#/c/23334/

If someone can target the bug to folsom that would be awesome.

Vish

Jay Pipes

2013-03-04 16:24:22 UTC

Post by Vishvananda Ishaya
https://bugs.launchpad.net/keystone/+bug/1074172
https://review.openstack.org/#/c/15242/
https://review.openstack.org/#/c/23334/
If someone can target the bug to folsom that would be awesome.

Thank you, Vish.

Best,
-jay

Adam Young

2013-03-02 03:17:11 UTC

It was commited late enough in the Folsom cycle that we didn't feel
comfortable going default with it. I knew that we wouldn't flush out
the bugs, though, until it was the default, which is why it was our
first task in the Grizzly cycle.

Post by Jay Pipes
* Unstable -- Endpoints would stay up >24 hours but after around 24
hours (sometimes sooner), the endpoint would stop working properly with
the server user suddenly returned a 401 when trying to authenticate a
token. Restarting the endpoint with a service nova-api restart gets rid
of the 401 Unauthorized for a few hours and then it happens again.

I assume there was no logging specifying what was failing. I can make a
guess, though, that there was some sort of glitch in getting the token
revocation list, and that the list was only fetched at start up.

Post by Jay Pipes
* Unable to use memcache with PKI. The token was longer than the maximum
memcache key and resulted in errors on every request. The solution for
this was to hash the CMS token and use hash as a key in memcache, but
unfortunately this solution wasn't backported to Folsom Keystone --
partly I think because the auth_token middleware was split out into the
keystoneclient during Grizzly.
In any case, the above two things make PKI unusable in Folsom.
We fell back on UUID tokens -- the default in Folsom. Unfortunately,
there are serious performance issues with this approach as well. Every
single request to an endpoint results in multiple requests to Keystone,
which bogs down the system.

That right there was the original reason for the PKI tokens. Any hard
performance data?

Post by Jay Pipes
In addition to the obvious roundtrip issues, with just 26 users in a
test cloud, in 3 weeks there are over 300K records in the tokens table
on a VERY lightly used cloud. Not good. Luckily, we use multi-master
MySQL replication (Galera) with excellent write rates spread across four
cluster nodes, but this scale of writes for such a small test cluster is
worrying to say the least.

Did you consider using the memcached backedn for Tokens?
Memcached has an automated timeout.

Post by Jay Pipes
Although not related to PKI, I've also noticed that due to the decision
to use a denormalized schema in the users table with the "extra" column
storing a JSON-encoded blob of data including the user's default tenant
and enabled flag is a horrible performance problem. Hope that v3
Keystone has corrected these issues in the SQL driver.

Normalization was done roughly at the start of the G3 cycle.

Post by Vishvananda Ishaya
https://review.openstack.org/23236
This is my least favorite option since changing paste config is a pain for deployers and it doesn't help any of the other projects.

Meh, whether you add options to a config file or a paste INI file it's
the same pain for deployers :) But generally agree with you.

Well, I think 3) is the right thing to do in any case, and can be done
in oslo regardless of Nova's RC status.
Not sure that 2) is really all that useful. If you are in any serious
production environment, you're going to be using memcached anyway.
Best,
-jay

I'm all for 3. I thought this was underway already. I assume the Oslo
folks have no reservations?

Post by Vishvananda Ishaya
Thoughts?
Vish
_______________________________________________
OpenStack-dev mailing list
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-dev mailing list
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Jay Pipes

2013-03-04 17:04:49 UTC

Understood, and I believe I was clear in my post that I was specifically
referring to Folsom :)

Indeed, no logging other than the return of the 401. I will say that it
would have been much easier to determine that it was the *service* user
that was getting a 401 and not the authenticating user if something like
that was in the debug log message! :)

That right there was the original reason for the PKI tokens. Any hard
performance data?

Nothing hard, no. Doing things via keystone CLI tool are noticeably
slower, but I have not had the time to do any benchmarks. Deployments
and w$rk has a habit of getting in the way of that ;)

Did you consider using the memcached backedn for Tokens?
Memcached has an automated timeout.

We did not, no. More likely we will be on Grizzly before long, so I will
be prototyping PKI + memcache soon enough. Not sure I'll have time to
change this before we are on Grizzly (which is likely a good thing ;)