Current state of metrics in Theia

There’s been a longstanding issue open in Theia about metrics that doesn’t look like it gained much traction and I wanted to bring it back into the light: https://github.com/theia-ide/theia/issues/1328

For context: I’m currently working on https://github.com/eclipse/che/issues/14245 which seeks to implement language server metrics and collect information on things such as “The % of Language Servers requests that get a successful response” and “The % of Language Servers requests that get a successful response in under N seconds”. This is simple enough if we just want to output with prometheus’ format but I’m just trying to see if we should do this or try to look back into metrics first. Personally, I think the plugin extension should just export the prometheus format for now but we should still have discussions over what we should do for metrics.

From what I understand, the @theia/metrics extension right now is only exporting logs with the prometheus format. One question I have is whether we should stick with this, and only support metrics with the prometheus format or instead move to support a generic format that be converted into any other metrics format.

I’m currently working on a P.O.C that allows extensions to register a metrics format and then a user can specify that metrics format by entering a query param in the URL. When that url is accessed it converts a general metrics log format into the registered metrics log output format.

For example, if we register some new metrics log format called foo we could retrieve the metrics in the format foo by going to /metrics?format=foo. If format in the query param is not found or no query param is specified then it will fall back to prometheus.

I think the biggest issue with this approach is that metrics would have to be converted from a general log format to the specified log format on demand (with many metrics this might be really slow).

Another worry with the general format is that it would be hard to create a general format that would be compatible with everything. If we ever needed to retroactively change the general format it would probably be a breaking change with any extendors who contribute metrics to theia.

The last thing is logging to the /metrics endpoint from the frontend. We can use JSON-RPC to communicate the logs between the frontend and backend, but should that be set up on a per extension basis? Or live on the browser side of metrics extension?

[original thread by Josh Pinkney]

cc @svenefftinge

I think we did no want to make explicit dependencies to prometheus, since it is not needed in all cases, e.g. when you develop the desktop application.

there can be some abstract metrics datatypes and apis, which can be written down to prometheus by a special Theia extension.

I’m not so sure this should be generalized. I can imagine that different use cases require different data points as well as different data sinks. And as you mentioned there could be a lot of data being transported so it would be good to avoid any unnecessary conversions. Maybe we instead write documentation on how to generally do it and solve it on a case by case? Eventually something that is common might evolve and we can put that into the framework.

In other words it should be done with a custom extension hooking into the plugin system extension?

Can we get rid of metrics extension then? Does anyone using it? cc @marc-dumais

Hi Anton. I would not object eventually deprecating the current extension if a more generic solution comes about, where we can still generate Prometheus-formatted metrics. We are using the current extension internally, for the node memory and CPU usage stats, but have not yet extended it, though we intend-to.

I meant remove it from the scope of Theia to allow each product has own way to collect metrics. There can be an external prometheus metrics extension without abstract metric APIs under theia-ide org, but not as a part of vailla Theia.

Do you see a pressing need to do this now? It could be that @jpinkney will end-up using Prometheus as well?

No pressing, it is just not necessary for Electron and for other products one can use other kind of reporting. I don’t see a reason to pull prometheus and metrics in core. I thitnk @jpinkney will need to create a custom che extension and hook it into the plugin system by overriding some methods, then we don’t need to come up with abstractions over prometheus, reviewing and maintaining it. And @jpinkney can use the whole api surface of prometheus. We could have such extension outside of che, just under theia-ide org that other can reuse it as well. It will require the plugin system though.

[Josh Pinkney]

What I planned to do was something like this: https://github.com/theia-ide/theia/compare/master...JPinkney:plugin-metrics#diff-42e21ff47aae4b2d39ad311d392f4a26R111. It works by wrapping a language server call and resolving it then creating the metrics. From there, the metrics are passed to the backend via json-rpc and then when a user accesses the /metrics endpoint you can see the percentage of successful calls. We’re actually using prometheus on Che so i’d be completely okay with putting the logs into prometheus format and just doing it that way. I just wanted to bring up the state of metrics just to get a better of idea whether or not this would even be mergeable or not. Also, maybe we could detect whether we are running in electron and just not send metrics at all in that case?

I don’t think it should be done in the plugin extension, there can be another extension like plugin-metrics which should instrument plugin code.

in this way, only apps which needs it can add such extension and it can be completely skipped for Electron

we should probably do https://github.com/theia-ide/theia/issues/5761 first

that any plugin rpc service can be customized in the same way as the rest of Theia app

@jpinkney @anton-kosyakov It’s not quite clear to me right now if you are talking about something we need right now (before we can collect LSP metrics) or an improvement in the future? Because what you’re proposing seems a mighty big hill to cross.

It can be done paritally for now, i.e LanguagesMainImpl can be put in DI context via a factory receiving RPCProtocol and returning an instance of LanguagesMainImpl. Code which should be instrumented by plugin-metrics extension should be extracted to own methods. Then plugin mertics extension subclasses LanguagesMainImpl with instrumented methods and rebinds the factory to provide a custom instance of such class.

[Josh Pinkney]

I’ve run into a bit of a snag for metrics. I’ve done everything we’ve discussed – created plugin-ext-metrics extension, refactored and made languages-main use DI etc but now I’m not entirely sure if it’s possible to get language server metrics directly.

Basically what’s happening is when you manually call vscode.languages.registerCompletionItemProvider like in https://github.com/microsoft/vscode-extension-samples/blob/master/completions-sample/src/extension.ts#L11

you can easily catch any error that appears within something like the provideCodeCompletions method because an error would return a rejected promise – this is how I was originally testing everything

I later learned that if you test on a vsix that uses vscode-languageclient (like https://github.com/redhat-developer/vscode-yaml/blob/master/src/extension.ts#L60) you cannot catch errors and thus can’t create metrics because that library handles the errors for you: https://github.com/microsoft/vscode-languageserver-node/blob/master/client/src/client.ts#L1403.

I think I misunderstood everything that vscode-languageclient was doing and that’s why I was testing it the first way and thinking that would work but not testing the second way until later on.

Now, the only way I’ve been able to figure out how to know there’s an error when a language server is running is to using the outputchannel and checking if the message starts with “[Error”

Does anyone have any other ideas of things I can try that might be a little bit better?

@jpinkney @tsmaeder Could you elaborate what kind of errors you want to detect? Failed connections?