Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(python cdk): Update init files to expose classes that are part of the public interface #38297

Merged
merged 18 commits into from
May 17, 2024

Conversation

girarda
Copy link
Contributor

@girarda girarda commented May 16, 2024

What

Update CDK init files to expose classes that are currently imported by a connector.

How

  1. Run git grep 'from airbyte_cdk' -- . | cat | awk -F ':' '{print $2}' | sort | uniq > imports.txt to find all imports from airbyte-integrations/connectors
  2. Add all imported files to __all__
  3. The test package is exported as a package to make it clear that it's separate from the other classes
  4. I didn't export airbyte_cdk.sources.declarative.schema.json_file_schema_loader._default_file_path even though it's used by a connector since it's marked as private. We should instead update the connector
  5. I didn't export the classes in airbyte_cdk.sources.declarative.models even though two are used by a connector. We shouldn't need to import these classes since they're only used to parse from the YAML representation of low-code connectors to their python implementation. The connector should instead of updated to use the declarativeinterfaces.

The classes from the destination.vector_db_based aren't at the top level because the submodule depends on the vector-db-based extra.

The classes from source.file_based module aren't at the top level because the submodule depends on the file-based extra

Follow up issues to force connector modules to import from the top-level module

Can this PR be safely reverted and rolled back?

  • YES 💚

Copy link

vercel bot commented May 16, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview May 17, 2024 3:24am

@girarda girarda changed the title Alex/cdk init refactor: python cdk - Update init files to expose classes that are part of the public interface May 16, 2024
@girarda girarda marked this pull request as ready for review May 16, 2024 18:09
@girarda girarda requested a review from a team as a code owner May 16, 2024 18:09
airbyte-cdk/python/airbyte_cdk/__init__.py Show resolved Hide resolved
from .transformations import RecordTransformation
from .declarative_stream import DeclarativeStream

__all__ = ["ConnectionChecker",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expose the interfaces of the declarative module since connector developer can create custom class inheriting them. Changing these interfaces should be considered a breaking change


__all__ = ["AirbyteTracedException", "SchemaInferrer", "is_cloud_environment"]
__all__ = ["AirbyteTracedException", "SchemaInferrer", "is_cloud_environment", "as_airbyte_message", "resolve_refs"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved_refs is used by a few destinations so it should be part of the public interface. We can revisit separately if we don't want it to be

@girarda girarda changed the title refactor: python cdk - Update init files to expose classes that are part of the public interface refactor(python cdk): Update init files to expose classes that are part of the public interface May 16, 2024
Copy link
Contributor

@erohmensing erohmensing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a bit of digging to see if there are things being used that are being exposed. There are, but it is also super understandably hard to keep track! It's also hard to talk about a "top level interface" when we still encourage importing from all of these various init files. That's to say, I don't know how much it adds to expose things here when they're so close to the files that they're being imported from.

Most of the other python libraries we import from (dagster, pydantic, git, etc) put everything into their top level (e.g. most things we import from them we can or should be importing with a from dagster import X instead of from dagster.some.specific.path import X). WDYT about us working toward this? We could take 1 of 2 approaches:

  1. Put everything that's currently used in the top-level init. This is easier to track because as you add things to the init, you can temporarily find and replace the code in the connectors until everything is just from airbyte_cdk import X. People can move over to the new imports now, and then we start removing things as necessary in later majors. Less obvious visibility into what people are using that we want to stop supporting, but we can still search for usages I guess
  2. Use that top level file as our "ideal", such that we put everything we want exposed there, and don't expose anything else. Then if we get people to move to importing things from base airbyte_cdk when it's in the list, we start to have visibility into which things people are importing that we aren't intentionally exposing. Downside here is its a lot of work to update the "desired" imports and distinguish them from the ones that can't be updated

@@ -7,6 +7,7 @@
from .abstract_source import AbstractSource
from .config import BaseConfig
from .source import Source
from .types import Config, Record
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add these to __all__?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda think we can go either way with these. We could go all in on the newly defined types. And try to normalize or encourage our Python sources to use stronger types like the StreamSlice concept. Or like Ella mentioned we just don't expose them. Until a week ago these didn't even exist in Python CDK so most connectors probably don't refrerence them

Two questions:

  • Do we want to expose the StreamSlice concept which might help add some more sanity to our Python susbstreams?
  • Right now our method interfaces don't account for a lot of these types so it's not intuitive to use the new classes. But if we're making the 1.0.0 bump, would it be a good idea to improve the method interfaces?

@girarda
Copy link
Contributor Author

girarda commented May 16, 2024

Did a bit of digging to see if there are things being used that are being exposed. There are, but it is also super understandably hard to keep track! It's also hard to talk about a "top level interface" when we still encourage importing from all of these various init files. That's to say, I don't know how much it adds to expose things here when they're so close to the files that they're being imported from.

Most of the other python libraries we import from (dagster, pydantic, git, etc) put everything into their top level (e.g. most things we import from them we can or should be importing with a from dagster import X instead of from dagster.some.specific.path import X). WDYT about us working toward this? We could take 1 of 2 approaches:

  1. Put everything that's currently used in the top-level init. This is easier to track because as you add things to the init, you can temporarily find and replace the code in the connectors until everything is just from airbyte_cdk import X. People can move over to the new imports now, and then we start removing things as necessary in later majors. Less obvious visibility into what people are using that we want to stop supporting, but we can still search for usages I guess
  2. Use that top level file as our "ideal", such that we put everything we want exposed there, and don't expose anything else. Then if we get people to move to importing things from base airbyte_cdk when it's in the list, we start to have visibility into which things people are importing that we aren't intentionally exposing. Downside here is its a lot of work to update the "desired" imports and distinguish them from the ones that can't be updated

@erohmensing I'll move classes to the top-level init. that's a good idea.

I have a soft-preference for the first approach because it'll make it clear that modifying these classes is a breaking change while others can be considered "internal".

@octavia-squidington-iii octavia-squidington-iii removed area/documentation Improvements or additions to documentation area/connectors Connector related issues labels May 17, 2024
Copy link
Contributor

@erohmensing erohmensing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! If we wanted to, we could stop exporting the classes from the lower init files that aren't extras-based, which would start to force people to move to the top-level imports if they're using those and not the actual immediate file-based imports. I'm not sure its necessary, but it might also make one less import interface.

Can you please also update the 1.0 migration guide to tell people that they should now import directly from airbyte_cdk and that other imports will be considered non-stable (i.e. we can move them around)?

@girarda girarda merged commit be090e8 into alex/cdk_1.0.0 May 17, 2024
29 checks passed
@girarda girarda deleted the alex/cdk_init branch May 17, 2024 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants