Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to decode Python 2.7 Code Object #1

Open
landaire opened this issue Dec 11, 2020 · 3 comments
Open

Failure to decode Python 2.7 Code Object #1

landaire opened this issue Dec 11, 2020 · 3 comments

Comments

@landaire
Copy link

First off, thanks for doing the hard work for this crate. I have a project where I'd like to just read data from a marshalled Python object and this solution fits the bill.

The code object I'm trying to deserialize is from Python 2.7 which appears to have a slightly different layout for TYPE_CODE. If you try to deserialize this data currently, the code object will advance the stream beyond where it should be and you'll start deserializing incorrect data.

I wrote a patch to add basic support for Python 2.7:

diff --git a/Cargo.toml b/Cargo.toml
index aed85a2..4faa7ce 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -18,3 +18,7 @@ num-bigint = "0.2"
 num-complex = "0.2"
 error-chain = "0.12"
 owning_ref = { version = "0.4", optional = true }
+
+[features]
+default = []
+python27 = []
\ No newline at end of file
diff --git a/src/lib.rs b/src/lib.rs
index 73e9402..a4e6e27 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -115,12 +115,12 @@ pub struct Code {
     pub flags:           CodeFlags,
     pub code:            Arc<Vec<u8>>,
     pub consts:          Arc<Vec<Obj>>,
-    pub names:           Vec<Arc<String>>,
-    pub varnames:        Vec<Arc<String>>,
-    pub freevars:        Vec<Arc<String>>,
-    pub cellvars:        Vec<Arc<String>>,
-    pub filename:        Arc<String>,
-    pub name:            Arc<String>,
+    pub names:           Vec<Arc<Vec<u8>>>,
+    pub varnames:        Vec<Arc<Vec<u8>>>,
+    pub freevars:        Vec<Arc<Vec<u8>>>,
+    pub cellvars:        Vec<Arc<Vec<u8>>>,
+    pub filename:        Arc<Vec<u8>>,
+    pub name:            Arc<Vec<u8>>,
     pub firstlineno:     u32,
     pub lnotab:          Arc<Vec<u8>>,
 }
@@ -920,12 +920,12 @@ pub mod read {
                 re: r_float_bin(p)?,
                 im: r_float_bin(p)?,
             })),
-            Type::String => Some(Obj::Bytes(Arc::new(r_bytes(r_long(p)? as usize, p)?))),
+            Type::String => { let obj = Some(Obj::Bytes(Arc::new(r_bytes(r_long(p)? as usize, p)?))); println!("{:?}", obj); obj },
             Type::AsciiInterned | Type::Ascii | Type::Interned | Type::Unicode => {
-                Some(Obj::String(Arc::new(r_string(r_long(p)? as usize, p)?)))
+                Some(Obj::Bytes(Arc::new(r_bytes(r_long(p)? as usize, p)?)))
             }
             Type::ShortAsciiInterned | Type::ShortAscii => {
-                Some(Obj::String(Arc::new(r_string(r_byte(p)? as usize, p)?)))
+                Some(Obj::Bytes(Arc::new(r_bytes(r_byte(p)? as usize, p)?)))
             }
             Type::SmallTuple => Some(Obj::Tuple(Arc::new(r_vec(r_byte(p)? as usize, p)?))),
             Type::Tuple => Some(Obj::Tuple(Arc::new(r_vec(r_long(p)? as usize, p)?))),
@@ -948,8 +948,8 @@ pub mod read {
             Type::Dict => Some(Obj::Dict(Arc::new(RwLock::new(r_hashmap(p)?)))),
             Type::Code => Some(Obj::Code(Arc::new(Code {
                 argcount: r_long(p)?,
-                posonlyargcount: if p.has_posonlyargcount { r_long(p)? } else { 0 },
-                kwonlyargcount: r_long(p)?,
+                posonlyargcount: if cfg!(not(feature = "python27")) && p.has_posonlyargcount { r_long(p)? } else { 0 },
+                kwonlyargcount: if cfg!(not(feature = "python27"))  { r_long(p)? } else { 0 },
                 nlocals: r_long(p)?,
                 stacksize: r_long(p)?,
                 flags: CodeFlags::from_bits_truncate(r_long(p)?),
@@ -996,9 +996,9 @@ pub mod read {
     fn r_object_not_null(p: &mut RFile<impl Read>) -> Result<Obj> {
         Ok(r_object(p)?.ok_or(ErrorKind::IsNull)?)
     }
-    fn r_object_extract_string(p: &mut RFile<impl Read>) -> Result<Arc<String>> {
+    fn r_object_extract_string(p: &mut RFile<impl Read>) -> Result<Arc<Vec<u8>>> {
         Ok(r_object_not_null(p)?
-            .extract_string()
+            .extract_bytes()
             .map_err(ErrorKind::TypeError)?)
     }
     fn r_object_extract_bytes(p: &mut RFile<impl Read>) -> Result<Arc<Vec<u8>>> {
@@ -1011,15 +1011,15 @@ pub mod read {
             .extract_tuple()
             .map_err(ErrorKind::TypeError)?)
     }
-    fn r_object_extract_tuple_string(p: &mut RFile<impl Read>) -> Result<Vec<Arc<String>>> {
+    fn r_object_extract_tuple_string(p: &mut RFile<impl Read>) -> Result<Vec<Arc<Vec<u8>>>> {
         Ok(r_object_extract_tuple(p)?
             .iter()
             .map(|x| {
                 x.clone()
-                    .extract_string()
+                    .extract_bytes()
                     .map_err(|o: Obj| Error::from(ErrorKind::TypeError(o)))
             })
-            .collect::<Result<Vec<Arc<String>>>>()?)
+            .collect::<Result<Vec<Arc<Vec<u8>>>>>()?)
     }
 
     fn read_object(p: &mut RFile<impl Read>) -> Result<Obj> {

There's a lot going on here, but the basic change is this:

-                posonlyargcount: if p.has_posonlyargcount { r_long(p)? } else { 0 },
-                kwonlyargcount: r_long(p)?,
+                posonlyargcount: if cfg!(not(feature = "python27")) && p.has_posonlyargcount { r_long(p)? } else { 0 },
+                kwonlyargcount: if cfg!(not(feature = "python27"))  { r_long(p)? } else { 0 },

This on its own is not enough, however. At least for the object I'm trying to deserialize, things break again once the filename field is deserialized. r_object is invoked again to read the filename string, and then you hit this block:

Type::String => Some(Obj::Bytes(Arc::new(r_bytes(r_long(p)? as usize, p)?))),

Note the change to r_object_extract_string:

-    fn r_object_extract_string(p: &mut RFile<impl Read>) -> Result<Arc<String>> {
+    fn r_object_extract_string(p: &mut RFile<impl Read>) -> Result<Arc<Vec<u8>>> {
         Ok(r_object_not_null(p)?
-            .extract_string()
+            .extract_bytes()
             .map_err(ErrorKind::TypeError)?)
     }

Type::String currently returns an Obj::Bytes but extract_string expects an Obj::String. I'm not sure what the right approach is here. I ended up just changing all of the strings to be Vec<u8>s since, at least in my case, the strings may contain wildly invalid UTF-8.

I wanted to ask: are you interested in supporting Python 2.7 deserialization? I wouldn't mind cleaning up this patch a bit and adding tests. I would need to study the Python 2.7 marshal code a bit more though to understand what the right approach here is for handling these strings correctly.

@sollyucko
Copy link
Owner

I am so sorry for not noticing this until now, but if you're still interested...

Rather than adding a feature flag, I feel like it would make sense to just use the has_posonlyargcount in-code configuration option, and add another has_kwonlyargcount option, which are designed to accommodate for differences between Python versions. (The way they work is that if you use marshal_load_ex, you can specify a MarshalLoadExOptions struct with the configuration options describing the behavior of the relevant Python version.)

Re: strings vs bytes: any idea what encoding the strings you are encountering are in? Latin-1? Or something else? If the encoding is predictable, would converting to UTF-8 AKA String be okay? If so, the code for creating the Obj::Code (https://github.com/sollyucko/py-marshal/blob/master/src/lib.rs#L949-L966) could probably be changed to automatically convert a Obj::Bytes from the correct encoding into a String.

FYI: I'm happy to support as wide of a range of versions and use-cases as possible. However, I'm not really working on this much anymore myself, so feel free to submit a PR.

@landaire
Copy link
Author

landaire commented Feb 1, 2021

I am so sorry for not noticing this until now, but if you're still interested...

No worries, I know how it goes :)

For full context, I started using this crate as I didn't want to write my project in Python. I also was not aware of how different the Python 2.7 and Python 3 marshal format was. As I've worked on the project since creating this issue some other changes have been made. You can view them here: master...landaire:master

Note the change looks a lot larger than it really is since somewhere along the way I had energy to add support for serializing data, refactored some things, then quickly lost that energy. Here's a rough list of major changes:

  • String refs appear to be handled differently in Python 2
  • The types that can be deserialized may have a different marker (787c324)
  • This isn't committed and pushed but I ended up changing strings to be based on https://docs.rs/bstr/0.2.14/bstr/ as a cheap workaround

Rather than adding a feature flag, I feel like it would make sense to just use the has_posonlyargcount in-code configuration option, and add another has_kwonlyargcount option, which are designed to accommodate for differences between Python versions. (The way they work is that if you use marshal_load_ex, you can specify a MarshalLoadExOptions struct with the configuration options describing the behavior of the relevant Python version.)

This makes sense.

Re: strings vs bytes: any idea what encoding the strings you are encountering are in?

They should be UTF-8, but I think the problem is that in Python 2.7 strings aren't guaranteed to be UTF-8 -- I think they're just arbitrary bytes but I could be wrong. As I eluded to above, I ended up just using the bstr crate for Strings/Bytes -- and this has worked out pretty well. I'll push this change soon.

FYI: I'm happy to support as wide of a range of versions and use-cases as possible. However, I'm not really working on this much anymore myself, so feel free to submit a PR.

Totally understand. My work on the project has been sort of off-and-on, so when I get a chance I'll package things up and submit a PR. Hopefully you don't feel any pressure -- I've been moving along just fine with a git source specified for the crate.

@sollyucko
Copy link
Owner

Things I'm fine with:

  • Changing error-chain to thiserror
  • Renaming Result to ParseResult
  • Changing String to BString: I guess this is fine, due to Python 2's use of non-strict UTF-8
  • Moving mod read to its own file: Great idea, I was even thinking about this myself
  • Adding Python 2 type characters

Things I'm not fine with:

  • Removing Python 3 type characters: Is there any reason this is necessary?
  • Adding the fictitious Bytes type: Shouldn't the mapping be Obj::Bytes => Type::String and Obj::String => Type::Unicode?
  • Removing posonlyargcount and kwonlyargcount: Is there any reason these can't just be 0 in Python 2?
  • Modifying existing test data to remove the posonlyargcount and kwonlyargcount fields: I'd rather have multiple tests using different MarshalLoadExOptions values

AFAICT from my testing:

  • Python 2 displays str as UTF-8 and marshals it as s + possibly-invalid UTF-8
  • Python 2 displays unicode as UCS-1/2/4 (UCS-4 = UTF-32) and marshals it as u + UTF-8
  • Python 3 marshals bytes as s
  • Python 3 displays str as characters rather than escapes and marshals it as u/z (ignoring interning and FLAG_REF) + UTF-8
  • Basically, str/bytes is marshalled as s, unicode/str is typically marshalled as u, and everything is marshalled using UTF-8. However, Python 2 isn't strict about having proper UTF-8.

To summarize: I'm happy to support Python 2, but without removing support for Python 3. I'm fine with making breaking changes if needed, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants